Cassandra Database

My thoughts on the resilience of Cassandra

2 min read

Apr 13, 2015 12:00:00 AM

The Resilience of Apache Cassandra: An Introduction

This blog is a part 1 of a 2 in a series. This will be different from my previous blogs, as this is more about some decisions you can make with Cassandra regarding the resilience of your system. I will talk deeply about this topic in the upcoming Datastax Days in London, this is more of an introduction!

TL;DR: Cassandra is tough!

Built for Failure: Continuous Availability and Scalability

DataStax describes Cassandra as delivering “…continuous availability, linear scalability, and operational simplicity across many commodity servers with no single point of failure, along with a powerful data model designed for maximum flexibility and fast response times“. In a production system, having your persistence layer failure tolerant is a big thing. Even more so when you can make it resilient to full locations failure through geographic replication (and easily).

Beyond the Documentation: Planning for Real-World Chaos

As in any production system you need to plan for failure. Should we blindly trust in Cassandra resilience and forget about the plan because “Cassandra can handle it”? By reading the documentation, some may think that by having several data centers and a high enough replication factor we are covered. In part this is true. Cassandra will handle servers down, even a full DC (or several!) down.

But, anyway, you should always prepare for chaos! Failure will increase pressure on your remaining servers, latency will increase, etc. And when things get up again, will it just work? Getting all data in sync, are you ready for that? Did you forget about gc_grace_seconds? There are lots of variables and small details that can be forgotten if you don’t plan ahead. And then in the middle of a problem, it will not help having those details forgotten!

Core Recommendations for Cassandra Resilience

My experience tells me that you must take Cassandra failures seriously, and plan for them! Having a B plan is never a bad thing, and a C even. Also, make sure those plans work! So for this short introduction I will leave a couple of recommendations:

Test your system against Cassandra delivering a bad service (timeouts, high latency, etc).
Set a “bare minimum” for your system to work (how low can we go on consistency, for example).
Test not only your system going down, but also prepare for the coming up!
Keep calm! Cassandra will help you!

Final Thoughts: Surviving the Outage

Overall, Cassandra is a tough and robust system. I’ve had major problems with network, storage, Cassandra itself, etc. And in the end Cassandra not only survived, it gave me no downtime. But with every problem I had, it increased my knowledge and awareness of what I could expect. This lead to planning for major problems (which did happen) and this combined with the natural resilience of Cassandra made me go through those events without downtime.

Feel free to comment/discuss about it in the comment section below! Juicy details will be left for London!

Cassandra Database Consulting Services

Ready to optimize your Cassandra Database for the future?

Speak with our Cassandra Database consultants ->

On this page

Ready to unlock value from your data?

With Pythian, you can accomplish your data transformation goals and more.

Speak with Pythian consultants now →

My thoughts on the resilience of Cassandra

The Resilience of Apache Cassandra: An Introduction

Built for Failure: Continuous Availability and Scalability

Beyond the Documentation: Planning for Real-World Chaos

Core Recommendations for Cassandra Resilience

Final Thoughts: Surviving the Outage

Cassandra Database Consulting Services

Share this

Share this

More resources

Lightweight transactions in Cassandra

Examining the lifecycle of tombstones in Apache Cassandra

An effective approach to migrate dynamic thrift data to CQL, part 1

Ready to unlock value from your data?