My thoughts on the resilience of Cassandra
This blog is a part 1 of a 2 in a series. This will be different from my previous blogs, as this is more about some decisions you can make with Cassandra regarding the resilience of your system. I will talk deeply about this topic in the upcoming Datastax Days in London, this is more of an introduction!
TL;DR: Cassandra is tough!
DataStax describes Cassandra as delivering “…continuous availability, linear scalability , and operational simplicity across many commodity servers with no single point of failure, along with a powerful data model designed for maximum flexibility and fast response times“. In a production system, having your persistence layer failure tolerant is a big thing. Even more so when you can make it resilient to full locations failure through geographic replication (and easily).
As in any production system you need to plan for failure. Should we blindly trust in Cassandra resilience and forget about the plan because “Cassandra can handle it”? By reading the documentation, some may think that by having several data centers and a high enough replication factor we are covered. In part this is true. Cassandra will handle servers down, even a full DC (or several!) down. But, anyway, you should always prepare for chaos! Failure will increase pressure on your remaining servers, latency will increase, etc. And when things get up again, will it just work? Getting all data in sync, are you ready for that? Did you forgot about gc_grace_seconds? There are lots of variables and small details that can be forgotten if you don’t plan ahead. And then in the middle of a problem, it will not help having those details forgotten!
My experience tells me that you must take Cassandra failures seriously, and plan for them! Having a B plan is never a bad thing, and a C even. Also, make sure those plans work! So for this short introduction I will leave a couple of recommendations:
- Test your system against Cassandra delivering a bad service (timeouts, high latency, etc).
- Set a “bare minimum” for your system to work (how low can we go on consistency, for example).
- Test not only your system going down, but also prepare for the coming up!
- Keep calm! Cassandra will help you!
Overall, Cassandra is a tough and robust system. I’ve had major problems with network, storage, Cassandra itself, etc. And in the end Cassandra not only survived, it gave me no downtime. But with every problem I had, it increased my knowledge and awareness of what I could expect. This lead to planning for major problems (which did happen) and this combined with the natural resilience of Cassandra made me go through those events without downtime.
Fell free to comment/discuss about it, in the comment section below! Juicy details will be left for London!