Incremental Repair: Problems and a Solution
Because incremental repairs can significantly reduce the time and IO cost of performing a repair, they can seem like a great idea. However practical implementation carries a few pitfalls which can cause severe damage to a production cluster, especially when using LCS (leveled compaction strategy) and DTCS (date-tiered compaction strategy). Incremental repair relies on an operation called anticompaction to fulfill its purpose. Once an incremental repair session ends, each repaired SSTable will be split into two SSTables; one containing the data that was repaired in the session, and another with the remaining unrepaired data. The newly created SSTable containing repaired data will be marked with a repairedAt timestamp. When performing the next incremental repair, Cassandra will skip the SSTables with a repairedAt timestamp higher than 0 and only compare data that is unrepaired. Incremental repair in production can cause a lot of nightmares and it has caused many delays for me when working on clients' clusters. I have seen clients running incremental repairs both by mistake and intentionally. In both cases, I have to revert it and move my clients back to subrange repair.
Problems with incremental repair:
1. Overstreaming, especially when using LCSAt the time of anticompaction during incremental repair, compaction can happen on the SSTable involved in the repair. It can be compacted away on one node but may not be on other nodes. This creates inconsistency in the repaired and unrepaired stamp for that particular SSTable across nodes. The SSTable will be marked as repaired on other nodes but not on that particular node. This means in the next incremental run it can generate a large amount of overstreaming. This bug, reported on CASSANDRA-9143, badly affects tables using LCS strategy. Leveled compaction creates SSTables of a fixed, relatively small size (5MB by default in Cassandra's implementation), which are grouped into "levels." During repair, LCS tables can create tens of thousands of small SSTables in L0 which can affect the entire cluster and may even bring the node down.
2. Significant increase in disk usage because of anticompactionAnticompaction will rewrite all SSTables on a disk to separate repaired and unrepaired data. The incremental repair can take a lot of time in the beginning and create a lot of SSTables due to anticompaction which can lead to high disk utilization.
3. TombstonesIncremental repair-affected SSTables are marked as repaired. In subsequent compactions, these tables will be compacted separately from SSTables that have not been repaired. If tombstones are in unrepaired SSTables, and the shadowed data is in repaired SSTables (or vice versa), the data cannot be dropped because Cassandra will not compact repaired and unrepaired SSTables together. Tombstones can lead to additional problems like degraded read performance.
Finding the presence of incremental repairsIf incremental repairs are or were ever turned on, the data could be in an SSTtable having a different status ( repaired, unrepaired).
- In versions 2.2+, incremental repair is on by default. You'll need to use the --full flag with repairs to avoid it. However, in the latest version, full repair also performs anticompaction so the problem remains even when incremental repairs are off.
- To check whether an existing SSTable has been incrementally repaired, use the sstablemetadata tool and view the "Repaired at:" line. 0 means the SSTable has never been incrementally repaired; any other value means it has been incrementally repaired.
Procedure to revert incremental repairsYou should execute these steps on all nodes, but one node at a time. You must stop Cassandra on that node. 1. Take a snapshot first for the keyspace / table for which you are reverting incremental repairs.
nodetool -u <user> -pw <password> snapshot <Keyspace>2. Flush and drain the data before stopping the node so there is no in-memory data left.
nodetool -u <user> -pw <password> flush nodetool -u <user> -pw <password> drain3. Stop Cassandra on the node.
nodetool -u <user> -pw <password> stopdaemon4. Use sstablerepairedset to mark all SSTables as unrepaired and start Cassandra.
find <data directory path/<keyspace> -iname "Data.db" > find_data_paths.txt sudo <cassandra installation directory>/tools/bin/sstablerepairedset --really-set --is-unrepaired -f find_data_paths.txt sudo runuser -l cassandra -c <cassandra installation directory>/bin/cassandra5. Check for any error / warning related to the procedure.
grep 'ERROR|WARN' <debug file path>/debug.log6. Run tablestats and check for percent repaired. The value should be 0.
nodetool -u <user> -pw <password> tablestats <Keyspace> | grep PercentIt's important to remember that Cassandra will never compact repaired and unrepaired SSTables together. If you stop performing incremental repairs once started, then data on the disk can become outdated. Subrange repair is the only option to avoid anticompaction and other problems created by incremental repair, as running a repair with --full also triggers it. You can run subrange repair with the help of Reaper or a subrange repair script. We hope that future releases will make incremental repair better and provide more advantages.