More effective anti-entropy repair in Cassandra
1. IntroductionCassandra offers three different repair mechanisms to make sure data from different replicas are consistent: Hinted Hand-off, Read-Repair, and Anti-Entropy Repair. The first two mechanisms are kind of "automatic" mechanisms that will be triggered internally within Cassandra, depending on the configuration parameter values that are set for them either in cassandra.yaml file (for Hinted Hand-off) or in table definition (for Read-Repair). The last mechanism (Anti-Entropy Repair), however, needs to be triggered with manual intervention, by running "nodetool repair" command (possibly with various options associated with it). Despite the "manual" nature of Anti-Entropy repair, it is nevertheless necessary, because the first two repair mechanisms cannot guarantee fixing all data consistency scenarios for Cassandra. For example, 1) what if a node is down longer than "max_hint_window_in_ms" (which defaults to 3 hours)? and 2) For deletes, Anti-Entropy repair has to be executed before "gc_grace_seconds" (defaults to 10 days) in order to avoid tombstone resurrection. At the same time, however, due to how Anti-Entropy repair is designed and implemented (Merkle-Tree building and massive data streaming across nodes), it is also a very expensive operation and can cause a big burden on all hardware resources (CPU, memory, hard drive, and network) within the cluster. In this case, how to run Anti-Entropy repair effectively becomes a constant topic within Cassandra community. In this post, I'd like to explore and summarize some of the techniques that can help achieve a more effective Anti-Entropy repair for Cassandra.
2. What is the issue of an Anti-Entropy RepairBefore Cassandra version 2.2, a sequential full Anti-Entropy repair is the default behavior when "nodetool repair" command (with no specific options) is executed. When this happens, 1) The node that initiates the operation, called coordinator node, scans the partition ranges that it owns (either as primary or replica) one by one. For each partition range, it sends the request to each of the peer/replica nodes to build a Merkle tree. 2) The peer/replica node scans all SSTables and a major, or validation, compaction is triggered, which reads every row in the SSTables, generates a hash for it, and then adds the result to a Merkle tree. 3) Once the peer node finishes building the Merkle tree, it sends the result back to the coordinator. The coordinator node compares every Merkle tree with all other trees. If difference is detected, data is exchanged between differing nodes. Looking at this highly summarized procedure, there are a few things that immediately caught my attention: First, because building Merkletree requires hashing every row of all SSTables, it is a very expensive operation, stressing CPU, memory, and disk I/O. Second, when Merkletree difference is detected, network bandwidth can be overwhelmed to stream the large amount of data to remote nodes. Lastly, although not obvious, it is a worse problem in my opinion, which is that this operation can cause computation repetition and therefore waste resources unnecessarily. Please see this post for an example of how this can happen. Based on the high cost related with it, a default (pre-Cassandra 2.2) sequential full Anti-Entropy repair is, in practice, rarely considered to be a routine task to run in production environment. Actually, a common trick that many people do with Cassandra repair is simply touching all data and let read-repair do the rest of work. However, there do have situations when an Anti-Entropy repair is required, for example, to recover from data loss. What can we do in these cases?
3. What can help to achieve a more effective Anti-Entropy repair?Over the time, different options for "nodetool repair" command have been introduced in different Cassandra versions. These options represent different techniques that can help achieve more effective Anti-Entropy repair, which I'll go through in more details in this section. Meanwhile, I will also try to clarify some of the confusion that new Cassandra users might have around the different nodetool repair command options (and related jargon's) of different Cassandra versions.
3.1 Primary range repairIn Cassandra, each node manages several (token) ranges of data. One of them is the primary range which is the token range that is assigned to the node according to the token setup within the ring. Other ranges of data are actually replica of primary ranges from other nodes. Running "nodetool repair" with option "-pr" (or "--partition-range") on a node means that the node only repairs the data of the primary range, but not other ranges managed on this node. The default behavior of a repair (without this option) is to repair all ranges of data managed by a node. When using this option, it avoids the cost of doing Merkle tree calculation on non-primary range data. It also helps reduce excessive data streaming across the network. One caveat of using this option is that since each node only repairs one range of data that is managed by a node, this option needs to run on ALL nodes in the ring in order to repair all the data. This option is available in very early release (0.x) of Cassandra.
3.2 Sequential repair"nodetool repair" option " -snapshot" (or " --with-snapshot") means a sequential repair (also called snapshot repair) and is a feature introduced in Cassandra version 1.2 and made as default in Cassandra version 2.0. Actually, in DataStax document for Cassandra 2.0 and later, you won't find this option. In Cassandra 2.0 and beyond, you can specify option "-par" (or "--parallel") to tell Cassandra to run in parallel. What does this really mean? As we mentioned in section 3.1, each node manages several different ranges of data, either as primary or replica. When "-par" option is used, it instructs Cassandra to run repair on all these ranges at the same, which means the expensive repair operation (especially the Merkletree building part) happens on multiple nodes concurrently. This could be problematic and may slow down the entire cluster. But when using "-snapshot" option (or default in Cassandra 2.0 and beyond), for each range of data, a snapshot is first taken and the repair operation is executed on the snapshot sequentially. This means that at any time for a given replica set of data, only one replica is being repaired, allowing other replica to satisfy the application requests in a more performant way.
3.3 Incremental repairSince Cassandra version 2.1, a new option "-inc" (or "--incremental ") is introduced for incremental repair. Starting from Cassandra version 2.2, this option becomes the default option and at the same time, option "-full" (or "--full") is used to specify a full repair. The option of incremental repair is a great feature that will help overcome all 3 issues as listed in Section 2. Instead of building a Merkle tree out of all SSTables (repaired or not), this option only builds the tree out of un-repaired SSTables. Therefore,
- The size of the Merkle tree to be built is way smaller and therefore requires much less computing resources
- When doing Merkle tree comparison, less data will be compared and potentially streamed over the network
- Already repaired data don't need to be computed and compared again, thus avoiding a lot of unnecessary repetition.
3.4 Sub-range repairSo far in this post when talking about repairing a range of data, it means the entire range (either primary or non-primary), which is sometimes also called as endpoint range. Beginning with Cassandra version 1.1.11, "nodetool repair" has options of "-st" (or "--start-token") and "-et" (or "--end-token") to specify a particular sub-range of data to repair. Conceptually, sub-range repair is much like primary range repair, except that each sub-range repair operation focuses even smaller subset of data. So in general sub-range repair shares much of the pros and cons as primary range repair. One key benefit of using sub-range repair is the freedom of specifying the repair granularity through command line, which gives the DevOps team much more flexibility regarding how to set up a repair schedule to best match the cluster's workload status. It is also doable to consider parallel running of multiple sub-range repairs on different sets of data at the same time, either on the same node, or on different nodes. This option has actually been deemed as one of the advanced repair techniques, as per post: Advanced repair techniques Despite the flexibility of this technique, it has to be emphasized that no matter how the repair schedule is set for sub-range repairs, all ranges of data in the ring has to be repaired at least once before gc_grace_seconds limit is reached.
3.4.1 How to calculate valid sub-range token valuesWhen invalid sub-range values are provided to the "-st" or "-et" option of "nodetool repair" command, most likely the command will fail and throw errors about "invalid tokens". So in order to make sub-range repair work effectively, we need a systematic method that can generate valid sub-range token values. We know that in Cassandra, data spans multiple nodes using a token ring. Each node is responsible for one or more slices of that ring. Each slice is a token range that has the start and end point. By this understanding, a natural way to generate valid sub-range token values would be: 1) find out all token ranges associated with one node in the entire ring; 2) for each token range associated with the node, divide the range into smaller chunks of sub-ranges. The start and end point of these sub-ranges would be valid values to be fed into the "-st" or "-et" option of "nodetool repair" command. The actual method and granularity to divide a node's token range into smaller sub-ranges is where the flexibility comes from and can be adjusted accordingly to best suit the needs of the system. There are different ways to find token ranges that are associated with a Cassandra node. Some of them are summarized below:
- Run "nodetool ring" command. "nodetool status" command will also do for single-token set-up.
- Run CQL query against "local" or "peers" tables in "system" keyspace to get the token values associated with a host.
- Cassandra client drivers provides APIs to get such information as well. For example, the "Metadata" class (in package com.datastax.driver.core) of the Java client API provides the following two methods:
* public Set<TokenRange> getTokenRanges() :
Returns the token ranges that define data distribution in the ring.
* public Set<TokenRange> getTokenRanges(String keyspace,Host host) :
Returns the token ranges that are replicated on the given host, for the given keyspace.
- Based on the thrift Java API describe_splits call, an open-source utility tool called "cassandra-list-subranges" has been developed to list valid sub-range values for a specific keyspace and table on a particular Cassandra node. For details of this tool, please check the GitHub repository page at https://github.com/pauloricardomg/cassandra-list-subranges
3.4.2 DataStax Enterprise (DSE) repair serviceIf the Cassandra cluster is running under DataStax Enterprise version, the OpsCenter provides a "repair service" starting from version 4.0. The way that this service works is to continuously repairing small chunks of data (using sub-range repair) in a cluster in the background until the entire cluster is repaired and then it starts the next cycle of processing. From the OpsCenter Window, there is only one parameter to specify the maximum amount of time it takes to repair the entire cluster once, which is typically less than gc_grace_seconds setting. The default is 9 days, compared with default 10 days of gc_grace_seconds setting. More advanced setting, such as the parallelism of concurrent repairs, the maximum pending repairs allowed to be running on a node at one time, and etc. can be set either in opscenterd.conf file (for all clusters) or in cluster_name.conf file (for a specific cluster). More detailed information about DSE repair service and its configuration can be found in the following documents:
3.5 Data-Center vs. Cluster wide repairSince Cassandra version 1.2, Cassandra starts to provide options for "nodetool repair" to offer the following repair capabilities regarding repair locality/scope:
- Cassandra 1.2: "-local" (or "--in-local-dc"), only repair nodes in the same data center as the node on which the "nodetool repair" command is executed.
- Cassandra 2.0: "-dc <dc_name>" (or "--in-dc <dc_name>"), repairs nodes in the named data center
- Cassandra 2.1:
- "-dcpar" (or "--dc-parallel"), repairs data center in parallel;
- "-hosts" (or "--in-hosts"), repairs specified hosts
4. ConclusionIn this post, we discussed in details some of the techniques that can help with a more effective Anti-Entropy repair in Cassandra. To summarize:
- Primary range repair is recommended for maintenance purpose and when it is used, it has to be executed on all nodes in the ring one by one to make sure the whole data range in the ring is covered.
- When available (Cassandra version 2.1 and beyond), incremental repair will bring the biggest benefit compared with other techniques. But please be aware of the caveats related with it. Please also be noted that with the bug as specified in CASSANDRA-9935, incremental repair may have issues with Level Compaction Strategy (LCS).
- Sub-range repair can be very useful, especially when incremental repair is not an option. But please make sure that whatever schedule that is set up using this technique has to make sure all range of data to be finished within gc_grace_seconds limit.
- If a repair is needed to recover from data loss, a sequential, full repair is needed. Neither primary range repair nor incremental repair technique works properly for this purpose. But due to the high cost associated with this operation, sometimes it might be faster to simply wipe out the data directory on a node and let it do bootstrapping.