The things I hate about Apache Cassandra
IntroFirst, let me start by saying I do not hate Cassandra. I love Cassandra. In its place, Cassandra is a powerful tool designed well to scale to millions of operations per second over geographically distributed locations operating in a highly available manner. I have worked on Cassandra clusters that have operated for years without interruption. What everyone likes about Cassandra:
- Linear scaling to hundreds of nodes is a reality
- Online maintenance is the norm and downtime windows do not exist
- There are no masters and slaves only nodes all part of a truly democratic cluster
- Write times are typically equal to a network round trip between the client and a small number of other nodes local to each other
- Read times are only slightly slower, on the order of a millisecond or two over network time
- A failed node can go unnoticed for hours or even days if the cluster is not properly monitored
Java/JVMCassandra is written in Java. This comes out in part from the time when it was written and the community’s choice of language at that time. It also has some very real advantages. Java programs are compiled into Java byte code which is the same no matter what the host machine’s architecture is. While there are some performance enhancements to Cassandra that use native libraries where available Cassandra can run just in byte code without those extensions. This makes Java remarkably small for what it is, and it also makes it easy to build and run on platforms as diverse as Raspberry PI’s, Intel X86 architecture, and IBM mainframes. But…. Java also brings with it some not so nice things as well.
Garbage collectionJava is an Object-oriented language. Almost every popular modern computer language is either object-oriented or has object-oriented features. But the design of most Object-oriented languages makes memory management a function of the language runtime and effectively takes it out of the hands of the developer. In theory, this eliminates memory leaks. Largely it does. Nevertheless, due to some of the more unusual practices of some developers, it is still possible to get memory leaks where the garbage collector is unable to determine when objects should be released. Unfortunately, the garbage collector needs to lockout the application for short periods of time called collector pauses to clean up after the developer’s often wasteful use of objects. Fortunately, the Cassandra developers all pretty much know tricks to minimize unnecessary object creation. Even with all that effort and planning occasionally under some workloads, GC pauses become a major factor in response time and Cluster stability. When that happens life can become stressful for the operations staff.
JMXJMX (Java Metrics eXtension) is in many ways a developer’s dream come true. It makes metrics collection from an application almost painless. The Cassandra community has made extensive use of JMX and the Cassandra server has an immense wealth of metrics it is collecting all the time which are especially useful for monitoring the performance, health, and well being of the server. Unfortunately, JMX has a couple of well-known undesirable side effects.
- The JMX port – when you assign the JMX port the JVM (Java Virtual Machine) looks around at all the available TCP/IP stacks and assigns it to them all. You cannot tell it which stack to assign it to. Well, you can but it ignores you.
- RMI – (Remote Method Invocation) Think remote procedure call but for object methods. The RMI protocol is a prolificate slob in its use of objects. Make too many JMX calls and you are going to put pressure on the Garbage collector which I have already complained about. Fortunately, there are other ways to collect JMX metrics than by making RMI calls. But, sadly not everyone seems to be aware of those alternatives or those alternatives do not meet their needs.
Limits on the HEAPThe JVM stores the majority of its working data on the Heap. There are whole books or at least whole chapters in books written about the Java Heap. I am not going to go into details about how it works here but it turns out due to two things, our old friend the garbage collector and heap addressing, you want to avoid making the heap too large which limits the amount of work a Cassandra node can perform on a machine. There are ways around this, but they all involve added operational complexity of which I am not a big fan.
Single storage engineCassandra uses a single Log-Structured Merge Tree storage engine. For many use cases, especially those where write requests exceed read requests by an order of magnitude or more this is, in many ways, and Ideal storage engine. But for many other use cases, especially where reads outnumber writes by an order of magnitude or more it is far from ideal. Other Databases like MySQL/MariaDB and MongoDB have solved the single storage engine dilemma by putting in a pluggable storage engine interface. Cassandra does not have such an interface and as a result building and integrating an alternative storage engine is a major challenge. It has been done by some users and as an exercise to demonstrate the capability of some new technology, but the idea has not yet gained enough traction for new storage engines to appear for Cassandra. I really would like to see this change someday, but I have learned not to hold my breath on the subject. Blue is just not a good look for my face.
Limited compaction strategy optionsIn a real sense, this is an extension of the storage engine topic but here my dreams might have a chance of bearing fruit before I get too old to care anymore. First a note on compaction strategies. LSM Tree storage engines need a compactor to prevent running out of disk and slowing reads down to the point of un-usability. A compactor goes through data stored on disk and creates copies of the data in the existing files dropping old versions and deleted data from the new copies. One of the measures of how beneficial a storage engine can be is to measure its effective write amplification which is in effect the number of times a piece of data has to be re-written in a given period of time. Write amplification exists in all databases but it is particularly important to measure in LSM Trees. Cassandra comes with three compaction strategies:
- Size Tiered Compaction Strategy (STCS) – the original strategy. It has moderate write amplification, but its read path characteristics are not great, and it requires a lot of empty available disk space to manage the copy part of the compaction cycle.
- Leveled Compaction Strategy (LCS) – Solves the extra space and read performance problems from STCS but write amplification is very high. It should only be used in use cases where reads dramatically exceed writes by at least an order of magnitude and write volume is at best moderate.
- Time Window Compaction Strategy – Similar to STCS except for data on disk is stored in windows or buckets and data from one window is never compacted with another window. It is very useful for storing time-series data as long as all the data arrives within the window. The windows are based on the data arrival time, not the timestamp created by the clients writing the data. As a result, if a feed gets interrupted and some of the data does not arrive until after the window closes it goes into the “Wrong” window. This is not normally a major problem since Cassandra does not actually assume anything about the data stored including the primary key. But it can make deleting the data a challenge. TWCS has the lowest write amplification but it also has some severe limitations in how it can be used.
General lack of great operational toolingDatabase servers like PostgreSQL and MySQL/MariaDB have an enormous amount of open source tools available in their respective communities. Cassandra unfortunately does not. There are some. But not nearly as much as I would like to see. This is because Cassandra is a bit more than 10 years younger than MySQL and more like 25 years younger than Postgres. The fact that Cassandra is also more of a niche database technology also plays a role. Tools we need in the community:
- I would like to see a good backup and restore tool. The Last pickle has adopted Medusa and is pushing it as the tool of choice. Unfortunately, we have tried to use it in several of our customer's environments with highly variable success. Conceptually it should work. In practice, it's hard to get it to work.
- Cassandra Aware automation tools – C-Star from the Last Pickle kind of works but also has limitations which often frustrates me. See: https://thelastpickle.com/blog/2018/10/01/introduction-to-cstar.html The DataStax K8S operator for Cassandra sounds like it will be useful in the future but it's very new. We at Pythian have a variety of Ansible scripts and they too have their limitations.
- Repair management – Here we have a pretty good tool. The Reaper also supported by the Last Pickle folks does a great job. See: http://cassandra-reaper.io/
- Database health – There are no community tools I am aware of that can poke around a Cassandra cluster and report back where there are problems and recommend fixes. Note we do have our own internal to Pythian tool that does some of this, but it needs a lot more work.
Large PartitionsCassandra stores data on disk for a table in partitions. A partition is calculated by taking part, or in some cases, all of the primary key and hashing it to get a token which is then assigned to some nodes in the cluster based on how many copies you want to keep. Generally, when the partition key = primary key partitions won’t get very large but often we want more than one row of data stored together with others for performance and easy access reasons. When you store a lot of rows in a single partition, more than 1 million columns or more than 100 megabytes of data problems start to develop. Work has been done over the years to make the 100 megabytes and 1 million columns not so much of a problem but it still becomes a problem at some point between 100 megabytes and 1 gigabyte or 1 million columns and 10 million columns. Regardless of the impact on Cassandra’s internal performance when you have a few very large partitions in your table your workload won’t spread evenly across your nodes and you get performance problems that way. There are ways to break up large partitions, but they almost always come at a price. Making the developer’s life a bit more difficult. MongoDB’s chunking mechanism solves this problem but introduces different ones, so I don’t see an easy fix.
Secondary indexesIn a traditional database, secondary indexes are generally used for two different purposes.
- Alternate access path to rows – For example, you might have a primary key of user id with secondary indexes on first and last name. If you have any of those items you have a chance of finding the row you want.
- Filtering – You might want to read only rows with a value of “new” in a status column. A secondary index will allow you to avoid reading the whole table, instead of allowing the server to pull up only those rows with the “new” value in the status column.
Materialized viewsWhen Cassandra 3.0 was released a new feature was announced called materialized views. In effect Materialized views are a combination of a traditional materialized view much like you might find in Oracle without the join part and a global secondary index. See above and all the trouble with indexes. The community cheered. Then they found out there were a few bugs. Then they found out the bugs were not exactly bugs but instead an incompletely coded feature. Incompletely coded because the developers were not sure how to finish the code for the feature. So, we have a broken feature that everyone wants to use but everyone hates when they realize it is broken. Sigh. The Scylla community claims to have implemented both Global secondary indexes and Materialized views correctly. Maybe some of the Cassandra community developers can look at their code (in C/C++) and translate it into Java. For now, if you are tempted to use materialized views don’t. You will not be happy.
CountersCassandra was designed from the very beginning to implement two legs of the CAP theorem (Consistency, Availability, Partition tolerance). The theorem argues you can never get more than two. The Cassandra development team chose Availability and Partition tolerance. This means that if any node or nodes of the cluster are available then queries against the cluster will get a response. If the data is not currently available, the query will say that, but Cassandra will not fail to respond. What Cassandra sacrifices to be able to so this is Consistency. E.G. the cluster is never fully consistent unless all data modifications stop. The cluster can be described as eventually consistent. This basic design decision impacts everything that happens in Cassandra. By telling Cassandra you want a quorum of responses from nodes that are expected to have the data to your query you will get essentially consistent data. But some things are not as easy to do as just asking for quorum responses. Counters to be accurate require an atomic operation. Cassandra as a part of its eventually consistent design does not do atomic operations. So… How do you do counters? The Cassandra community “solved” the problem by using the PAXOS consensus algorithm in a kind of weird way to implement a sort of atomic operation. For details on PAXOS see: https://en.wikipedia.org/wiki/Paxos_(computer_science) It is expensive at least 6X the cost of a simple write. It's only mostly accurate. For increments, it is almost perfect. For decrements not so much. It is a bad feature. Do not use it. If you really need to do counting, ask me about it some time and I can suggest several low-cost alternatives that are far better.
Light Weight TransactionsLight Weight Transactions (LWT) are not transactions. Not in any way shape or form. What they are is a way to synchronize all of the nodes in the cluster owning a particular token so they all can agree on who gets to change the data owned by that token. You can think of them as a kind of short-lived semaphore. They are implemented by using the same PAXOS algorithm used by counters. The principle difference is LWT works flawlessly. Well, it is still awfully expensive and… It seems to cause havoc in a Cassandra cluster where they are used with any frequency more than occasionally. For a very high-level overview of how Cassandra does both counters and Light Weight Transactions see: https://blog.pythian.com/lightweight-transactions-cassandra/
BatchBatch refers to performing a series of operations often a series of inserts as part of a single database call. For most types of database servers, It is generally much more efficient than making a series of database calls to accomplish the same thing. Cassandra has a batch feature that on the surface looks to be similar in function to other database batch operations. In addition, if you read the DataStax documentation you will see right there in a description of how batch works a statement about operations within a batch being atomic. If you read carefully you will be disappointed. The whole atomic thing is only valid if you do everything right. And of course, many developers mostly do not do it right. No! Cassandra batch operations are not transactions. There is no roll-back. No Isolation and they are at best, best effort. No! Cassandra batch operations are generally not atomic. They are the only kind of atomic if you do everything correctly which as I have already said developers very often do not. Use of Batch operations to touch multiple tables with different partition keys or even a single table with more than one unique partition key breaks the sort of atomic thing and in addition, adds undesirable stress to the cluster. Most uses I have seen of the Cassandra batch feature over the years have been problematic
The Cassandra Slow Query logUnless you are using Cassandra 3.10 (which you should not be doing) or 3.11 there is no slow query log. But if you are using 3.11 then you do have a slow query log. Compared to MySQL’s slow query log it's kind of pathetic but it's better than what was available before 3.11. My biggest complaint at this point is the Cassandra logger’s tendency to deduplicate multiple log entries with a message about more than X log entries have been dropped. Hey, can you tell me how many?
RepairsCassandra’s eventually consistent architecture invites data loss. Even with fully consistent databases, the risk of data loss is a real problem that much code has been written to protect against. With eventual consistency, it would appear that the Cassandra community is asking to lose data. They are not. There are three anti-entropy mechanisms in Cassandra to avoid data loss. They are:
- Hinted Handoffs – when a coordinator node is unable to deliver a write request to destination nodes or some of the nodes do not respond in time the coordinator node stores the write request in a hinted hand off file and periodically retries the request.
- Read Repair – When quorum reads find data mismatches returned to the coordinator node the coordinator determines the most recent value returning that to the client and also writing the corrected data back to the node(s) which supplied incorrect data.
- Repair – This is a background process that can be launched via JMX to compare data at specific token ranges and tables between nodes which should have the same data. When a discrepancy is discovered the corrected data is written to the node with missing or out of date data.