Managing Production Systems: Document, test, verify, try again.
Jan 27, 2011 / By Danil Zburivsky
A couple of days ago I was reading a paper Paxos Made Live – An Engineering Perspective written by Google engineers. It is an interesting reading about implementation of Paxos algorithm for building a fault-tolerant database. But one paragraph made me think I am reading something very familiar:
We decided to err on the side of caution and to rollback our system to the old version of Chubby (based on 3DB) in one of our data centers. At that point, the rollback mechanism was not properly documented (because we never expected to use it), its use was non-intuitive, the operator performing the roll-back had no experience with it, and when the rollback was performed, no member of the development team was present. As a result, an old snapshot was accidentally used for the rollback. By the time we discovered the error, we had lost 15 hours of data and several key datasets had to be rebuilt.
This really looked like one of the incident notification we at Pythian send to our customers in case of production outage or any other significant issue. Don’t get me wrong, I am not saying: “Look, big guys, like Google, have problems too!”. The point here is that when you manage production environment of any scale, whether it is a multi-terabyte heavily loaded system, or a “one database” website, you face similar organizational problems. This short paragraph points to some very important questions you should be asking yourself everyday if you are in charge of a production system.
- Are all of your standard procedures, like backup/restore well documented? You never know who will be dealing with issues when thunder strikes.
- Do you have a proper escalation procedures, so every production support team member knows where to look for help, in case he is stuck or in doubt?
- Do you crosscheck work done by others? This can help you catch things like wrong backup used for restore, suggest a way to improve one’s work, or learn something new from your colleagues and make existing process better.
- Stop trusting your own procedures. Test and verify them from time to time. Things tend to change, sometimes unnoticed. So if you haven’t tried to restore your backup for 3 months, you can’t really be sure it works.
Leave a Reply