THE WORLD DISCUSSES #PYTHIAN ON TWITTER. HAVE A QUESTION? USE OUR HASHTAG AND ASK AWAY.

On Disaster Recovery and my SQL Rally 2011 Presentation

Yesterday, I saw a Twitter post regarding the speaker evaluation results from SQL Rally 2011 in Orlando, FL last May. I was surprised to see that my session was in the top 3 best sessions of the conference. I dug up the Excel spreadsheet containing my session evaluation results and began to read. I found one comment very fascinating (the only evaluation where I got very low scores) as the response pertains to the speaker’s knowledge of the subject. The comment was: “copy and paste coder.” I’ve been doing this specific presentation for almost 5 years now with a few tweaks every once in a while based on feedback from attendees. Yes, I live and breathe disaster recovery as part of my day-to-day job. However, there are several reasons why I do not type nor write code during my presentations. Here are a few of them:

  1. A presentation is a performance: Many will disagree with me on this, especially experts who believe that to demonstrate their expertise, they should be writing code and doing live demos during a presentation. Whenever I go up the stage to deliver a presentation, I always think about the attendee/audience. My goal is not to display my expertise nor to brag about what I can do that the audience could not. I always remember that my presentations are not about me, but about the audience. Which is why I do a lot of preparation prior to delivery – research, writing an appropriate storyline (you got it right – storyline), selecting the right demos, building test environments, writing demo scripts, rehearsing my presentation, etc. Yes, I rehearse my presentations and I say it out loud. I do the best that I can to make sure that the audience will be entertained, engaged, enlightened, educated and encouraged. If I’m doing a presentation on disaster recovery, I even plan out what type of disaster will I be simulating. Doing this will help me make sure that I don’t go beyond the time limit that was alloted for my session while covering all of the items that I intend to. I’d be very happy if the audience will walk out of my presentation with something that they will do when they get back to their regular routine. I keep in mind what Dr. Nick Morgan, one of America’s top communication theorist and coach, always say:”The only reason to give a speech is to change the world.” So, if you’ll be attending a presentation I’m delivering in the future, I’ll assure you that you won’t be disappointed.
  2. Read the rest of this entry . . .

Disaster Recovery Is More Than Just Technology Part 3: The Lion, The Switch and The Wardrobe

You were in your favourite bar one Saturday night when, suddenly, you hear your mobile phone ring. You pick up the phone and heard the sound of a screaming voice on the other end (no, it’s not your wife telling you to go home and take out the trash). The background noise is preventing you from understanding what is actually being said. You checked on the phone number that registered on the phone – it’s your manager. You get out of the bar to clearly hear what is being said until you barely hear the last phrase, “the production database is in recovering state for more than an hour now…” And, then, your battery went dead. Sound familiar?

In my previous blog post, I talked about the different acronyms that come with the term disaster recovery. In this blog post, I’ll talk about key items that we sometimes tend to ignore when creating a disaster recovery strategy – the lion, the “switch” and the wardrobe (I’ve been a fan of the Narnia movie series from which I got the idea). And, yes, I did get a phone call similar to that while I was driving with my family that I had to pull over and guide the other person on the line as they try to recover the database.
Read the rest of this entry . . .

Managing production systems: document, test, verify, try again.

A couple of days ago I was reading a paper Paxos Made Live – An Engineering Perspective written by Google engineers. It is an interesting reading about implementation of Paxos algorithm for building a fault-tolerant database. But one paragraph made me think I am reading something very familiar:

We decided to err on the side of caution and to rollback our system to the old version of Chubby (based on 3DB) in one of our data centers. At that point, the rollback mechanism was not properly documented (because we never expected to use it), its use was non-intuitive, the operator performing the roll-back had no experience with it, and when the rollback was performed, no member of the development team was present. As a result, an old snapshot was accidentally used for the rollback. By the time we discovered the error, we had lost 15 hours of data and several key datasets had to be rebuilt.

This really looked like one of the incident notification we at Pythian send to our customers in case of production outage or any other significant issue. Don’t get me wrong, I am not saying: “Look, big guys, like Google, have problems too!”. The point here is that when you manage production environment of any scale, whether it is a multi-terabyte heavily loaded system, or a “one database” website, you face similar organizational problems. This short paragraph points to some very important questions you should be asking yourself everyday if you are in charge of a production system.

  • Are all of your standard procedures, like backup/restore well documented? You never know who will be dealing with issues when thunder strikes.
  • Do you have a proper escalation procedures, so every production support team member knows where to look for help, in case he is stuck or in doubt?
  • Do you crosscheck work done by others? This can help you catch things like wrong backup used for restore, suggest a way to improve one’s work, or learn something new from your colleagues and make existing process better.
  • Stop trusting your own procedures. Test and verify them from time to time. Things tend to change, sometimes unnoticed. So if you haven’t tried to restore your backup for 3 months, you can’t really be sure it works.
  • Disaster Recovery Is More Than Just Technology Part 2:The Acronyms

    In my previous blog post, I talked about high availablity and disaster recovery (HADR) and how it is more than just the underlying technology that keeps the entire strategy intact. In this blog post, I’ll describe a few acronyms – sometimes called buzzwords – that are commonly referred to in HADR projects and implementations (I know I use them a lot when addressing questions regarding HADR.) These acronyms fall under the second P in my PPT for HADR – PROCESS. Every HADR project or implementation should first be able to define these acronyms well before they even purchase the hardware, software and technologies they intend to use. Let’s get going.
    Read the rest of this entry . . .

    Disaster Recovery is More Than Just Technology Part 1

    While I was at the PASS Summit 2010, I’ve spent a fair amount of time at the Ask-the-Experts table on high availability, disaster recovery and virtualization. Conference attendees with different requirements on high availability and disaster recovery come to these tables and ask questions.

    I’ve spent a fair amount of time doing high availability and disaster recovery (HADR) in my previous life as a data center engineer focusing on the Microsoft platform. My previous organization sold high availability and disaster recovery solutions to customers like crazy, highlighting the fact that the solutions are more than just the technology aspect. Every time I talk about HADR in my presentations, I focus on the three main ingredients to have a successful implementation – people, process and technology (PPT). Note that technology is at the end of the list as the people and the process components should come first.

    What I heard at the PASS Summit gave me insights as to how people approach HADR (and I thought I only saw these on the newsgroups and forums as I answer their questions.) Most SQL Server DBAs (and maybe even a lot of IT professionals) want a technical answer to their HADR problem. They want to know if failover clustering, database mirroring, replication or log shipping is the best solution to their requirement. What’s funny is that when I ask them about what their RPO/RTO/SLAs are, they scratch their head and ask what those acronyms are. And when I start explaining these acronyms to them, they still want to hear what the best solution is for their requirement.

    As I prepare for my presentation on Disaster Recovery Techniques for SQL Saturday #61 in Washington DC, I’ll be writing a series of articles about disaster recovery and what RPO/RTO/SLAs are and how they fit into the whole disaster recovery strategies. Before I dive into the “technology” part of the PPT ingredient for a successful HADR implementation, I will talk about the people and the process part first. Why? Because these two will drive the technology part of the whole strategy. And if you’re in the Washington DC area, feel free to drop by at the SQL Saturday event.

    “Oracle RMAN 11g Backup and Recovery” book

    One of the most critical skills of any Oracle DBA is the ability to prevent a system crash and to restore and recover the system in case of a disaster. The “Oracle RMAN 11g Backup and Recovery” book by Robert G. Freeman and Matthew Hart is a resource that can definitely help to acquire the skill. I recently received my early copy of it, and am honored to have contributed to Chapter 5 “Oracle Secure Backup” for it.

    My acquaintance with Robert G. Freeman began last summer when I found the “Merge me baby!!” contest for a new book on his blog. I have always loved to solve puzzles and that one seemed interesting to me. As it turns out, I won the contest. When I knew that Robert was looking for someone who could help him with the chapter about Oracle Secure Backup (OSB) for his new book, I immediately offered him my help and experience in OSB.

    I highly recommend the book. It has not been published yet but can be pre-ordered on Amazon.com or on McGraw-Hill Professional Books.

    Product Details:
    * Paperback: 688 pages
    * Publisher: McGraw-Hill Osborne Media; 1 edition (May 17, 2010)
    * Language: English
    * ISBN-10: 0071628606
    * ISBN-13: 978-0071628600

    When Was Your Last Disaster Recovery Test?

    If you answer anything else but something like “last month and every month before that”, then you are probably in troubles. Learn from Wikipedia’s Data Center Overheating.

    It doesn’t mean that they didn’t regularly test their disaster recovery process. Maybe they did but the failover mechanism was broken after the last test.

    A regular DR procedure validation is designed to minimize the risk of a broken process to go unnoticed. If the failure is detected during a regular switchover process, you are prepared to handle it way better (or potentially just leave services on the currently primary site) than during emergency failover when you get to the “Oh shit!” moment under the tremendous pressure to get services back.
    Read the rest of this entry . . .

    Backups in SQL Server 2005/2008, Part 1: The Basics

    This is the first post in a series dedicated to exploring the backup and availability options in SQL Server 2005 and 2008. It is aimed at anyone unfamiliar with the database backup options in SQL Server 2005 and 2008. I’m not going to explore every single option or scenario, the goal is to give you the language and the tools to do deep dives where you need to.

    SQL Server 2005 has several DBA-job-saving options available to the would-be administrator. Think of a Database Backup as the technology to save data and Database and Availability as the technology to keep it online and available to it’s consumers.

    A very brief introduction to SQL Server databases

    Its important to have a few SQL Server database basics in order to understand the backup options. If you know what a recovery model is, and the difference between an .ldf and .mdf file, you can skip this section. If this is as good as a foreign language to you, read on.

    Read the rest of this entry . . .

    Different Technology Stacks On Production and DR?

    Last week, I was at the NetApp office in North Sydney for the presentation on NetApp SnapManager for Oracle. It was good opportunity to learn more about NetApp snapshots while working on a project for one of our clients in Sydney. It was an especially interesting topic as I have some experience using Veritas Checkpoints (see my presentation on test systems refreshes), and it was interesting to see what’s different and new in the NetApp implementation. But I digress.

    I learned that NetApp can provide access to the same LUNs via either Fiber-Channel (FC) or iSCSI. And this is when the interesting argument surfaced. Apparently, some companies aim to have the technology stack on their disaster-recovery site as different as possible from the primary production site. Their argument is that if one technology fails at the primary site (like FC to access storage), then the DR site using a different technology stack will more likely be unaffected.

    Hrm . . .  I had never thought about this, and when I consider it now, it still doesn’t appeal to me. If I design a highly-available solution with a disaster-recovery site in place, one of my priorities would be to switch between the sites comfortably at any time. The more differences two sites have, the lower my comfort level is.

    The only reason why I think some companies can “demand” having different storage technology stacks at production and DR is to justify a more convenient (a cheaper?) implementation.

    Thoughts? Comments?

    Start NowWith Pythian - database design, management and emergency handling capabilities...

    Live Updates

    pythian: RT @pythianfielding: My #ukoug2011 #Exadata IORM presentation starts in a few mins in hall 7A
    more



    Testimonials

    • Serge Racine

      DBA, Brookfield Energy

      We are very satisfied by the service given to us by Andre and Shakir in support of our recent data quality and reorganization initiative.... more