I have a database server that has these features:
With the list of features above, why don’t we all use Cassandra for all our database needs? This is the hype I hear at conferences and from some commercial entities pushing their version of Cassandra. Unfortunately, some people believe it. Especially now when many users of proprietary database technologies like Oracle and SQL Server are looking to get out of massive license fees. The (apparent) low cost of open-source in combination with the list of features above, make Cassandra very attractive to many corporate CTOs and CFOs. What they are missing is the core features they assume a database has, but are missing from Cassandra.
I am a database architect and consultant. I have been working with Cassandra since version 0.7. came out in 2010.
I like, and often promote Cassandra to my customers—for the right use cases.
Unfortunately, I often find myself being asked to help after the choice was already made and it turned out to be a poor use case for Cassandra, or they made some poor choices in their data modeling for Cassandra.
In this blog post I am going to discuss some of the pitfalls to avoid, suggest a few good use cases for Cassandra and offer just a bit of data modeling advice.
Cassandra projects tend to fail as a result of one or more of these reasons:
To be honest, it doesn’t help that Cassandra has a bunch of features that probably shouldn’t be there. Features leading one to believe you can do some of the things everyone expects a relational database to do:
Using any of the above features the way you would expect them to work in a traditional database is certain to result in serious performance problems and in some cases a broken database.
Another major mistake developers make in building a Cassandra database is making a poor choice for partition keys.
Cassandra is distributed. This means you need to have a way to distribute the data across multiple nodes. Cassandra does this by hashing a part of every table’s primary key called the partition key and assigning the hashed values (called tokens) to specific nodes in the cluster. It is important to consider the following rules when choosing you partition keys:
Typical real-world partition keys are user id, device id, account number etc. To manage partition size, often a time modifier like year and month or year are added to the partition key.
If you get this wrong, you will suffer greatly. I should probably point out that this is true in one way or another of all distributed databases. The key word here is distributed.
If you have a database where you depend on any of the following things– Cassandra is wrong for your use case. Please don’t even consider Cassandra. You will be unhappy.
If you are thinking about using Cassandra with any of the above requirements, you likely don’t have an appropriate use case. Please think about using another database technology that might better meet your needs.
Every database server ever designed was built to meet specific design criteria. Those design criteria define the use cases where the database will fit well and the use cases where it will not.
Cassandra’s design criteria are the following:
There is nothing in the list about ACID, support for relational operations or aggregates. At this point you might well say, “what is it going to be good for?” ACID, relational and aggregates are critical to the use of all databases. No ACID means no Atomic and without Atomic operations, how do you make sure anything ever happens correctly–meaning consistently. The answer is you don’t. If you were thinking of using Cassandra to keep track of account balances at a bank, you probably should look at alternatives.
It turns out that Cassandra is really very good for some applications.
The ideal Cassandra application has the following characteristics:
Some of my favorite examples of good use cases for Cassandra are:
Frequently, executives and developers look at the feature set of a technology without understanding the underlying design criteria and the methods used to implement those features. When dealing with distributed databases, it’s also very important to recognize how the data and workload will be distributed. Without understanding the design criteria, implementation and distribution plan, any attempt to use a distributed database like Cassandra is going to fail. Usually in a spectacular fashion.
Whether you’re considering an open source or commercial Cassandra deployment, planning to implement it, or already have it in production, Pythian’s certified experts can work with your team to ensure the success of your project at every phase. Learn more about Pythian Services for Cassandra.
Ready to handle massive data volumes with zero downtime?