So you have a broken Cassandra SSTable file?

14 min read

Aug 28, 2018

Every few months I have a customer come to me with the following concern: my compactions for one of my Cassandra tables are stuck or my repairs fail when referencing one of the nodes in my Cassandra cluster. I take a look or just ask a couple of questions and it becomes apparent that the problem is a broken SSTable file. Occasionally, they will come to me in a panic and tell me that they have looked at their logs and discovered they have a broken SSTable file. Don’t panic. A broken SSTable file is not a crisis. A broken SSTable file does not represent lost data or an unusable database. Well, that’s true unless you are using a Replication Factor (RF) of ONE. The cluster is still going to operate, and queries should be working just fine. But… it does need to be fixed. There are four ways to fix the problem which I will explain in this post, one of which I freely admit is not one of the community's recommended ways, but is often the simplest and quickest with minimal downside risk. Before I begin to explain the ways to repair an SSTable, I will spend a few lines to explain what an SSTable file is, then I will walk you through the four options from easiest and safest to the most difficult and risky. An SSTable file is not a file. It’s a set of eight files. One of those eight contains the actual data. The others contain metadata used by Cassandra to find specific partitions and rows in the data file. Here is a sample list of the files:

mc-131-big-CRC.db	Checksums of chunks of the data file.
mc-131-big-Data.db	The data file that contains all of rows and columns.
mc-131-big-Digest.crc32	Single checksum of the data file.
mc-131-big-Filter.db	Bloom filter containing partial checksums of all partition and cluster keys.
mc-131-big-Index.db	A single level index of the partitions and cluster keys in the data file.
mc-131-big-Statistics.db	Bunch of metadata that Cassandra keeps about this file including information about the columns, tombstones etc.
mc-131-big-Summary.db	An index into the index file. Making this a second level index.
mc-131-big-TOC.txt	This list of file names. No idea why it exists.

The “mc” is the SSTable file version. This changes whenever a new release of Cassandra changes anything in the way data is stored in any of the files listed in the table above. The number 131 is the sequence number of the SSTable file. It increases for each new SSTable file written through memtable flush, compaction, or streaming from another node. The word “big” was added to Cassandra SSTable files starting in Cassandra 2.2. I have no idea what its purpose is. The rest of the file name parts are explained in the chart above. When you get the dreaded error that an SSTable file is broken, it almost always is because an internal consistency check such as "column too long" or "one of the checksums has failed to validate". This has relatively little effect on normal reads against the table except for the request where the failure took place. It has a serious effect on compactions and repairs, stopping them in their tracks. Having repairs fail can result in long-term consistency issues between nodes and eventually the application returning incorrect results. Having compactions fail will degrade read performance in the short term and cause storage space problems in the long term. So… what are the four options?

Nodetool scrub command – Performed online with little difficulty. It usually has a low success rate in my own personal experience.
Offline sstablescrub – Must be performed offline. The tool is in /usr/bin with a package install. Otherwise its in $CASSANDRAHOME/bin. Its effectiveness rate is significantly better than the Nodetool scrub, but it requires the node to be down to work. And it takes forever…
rm -f – Performed offline. it must also be followed immediately with a Nodetool repair when you bring the node back up. This is the method I have successfully used most often but it also has some consistency risks while the repairs complete.
Bootstrap the node – This is kind of like number 3 but it has less theoretical impact on consistency.

Let us get into the details

It starts out like this. You are running a Nodetool repair and you get an error: $ nodetool repair -full [2018-08-09 17:00:51,663] Starting repair command #2 (4c820390-9c17-11e8-8e8f-fbc0ff4d2cb8), repairing keyspace keyspace1 with repair options (parallelism: parallel, primary range: false, incremental: false, job threads: 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 768, pull repair: false) error: Repair job has failed with the error message: [2018-08-15 09:59:41,659] Some repair failed -- StackTrace -- java.lang.RuntimeException: Repair job has failed with the error message: [2018-08-15 09:59:41,659] Some repair failed You see the error. But it doesn’t tell you a whole lot. Just that the repair failed. Next step look at the Cassandra system.log file you want to see the errors: $ grep -n -A10 ERROR /var/log/cassandra/system.log ERROR [RepairJobTask:8] 2018-08-08 15:15:57,726 RepairRunnable.java:277 - Repair session 2c5f89e0-9b39-11e8-b5ee-bb8feee1767a for range [(-1377105920845202291,-1371711029446682941], (-8865445607623519086,-885162575564883.... 425683885]]] Sync failed between /192.168.1.90 and /192.168.1.92 /var/log/cassandra/debug.log:ERROR [RepairJobTask:4] 2018-08-09 16:16:50,722 RepairSession.java:281 - [repair #25682740-9c11-11e8-8e8f-fbc0ff4d2cb8] Session completed with the following error /var/log/cassandra/debug.log:ERROR [RepairJobTask:4] 2018-08-09 16:16:50,726 RepairRunnable.java:277 - Repair session 25682740-9c11-11e8-8e8f-fbc0ff4d2cb8...... 7115161941975432804,7115472305105341673], (5979423340500726528,5980417142425683885]]] Validation failed in /192.168.1.88 /var/log/cassandra/system.log:ERROR [ValidationExecutor:2] 2018-08-09 16:16:50,707 Validator.java:268 - Failed creating a merkle tree for [repair #25682740-9c11-11e8-8e8f-fbc0ff4d2cb8 on keyspace1/standard1, The first error message Sync Failed is misleading although sometimes it can be a clue. Looking further, you see Validation failed in /192.168.1.88. This tells us that the error occurred on 192.158.1.88 which just happens to be the node we are on. Finally, we get the message showing the keyspace and table the error occurred on. Depending on the message, you might see the table file number mentioned. In this case it was not mentioned. Looking in the directory tree we see that we have the following SSTable files:

4,417,919,455	mc-30-big-Data.db
8,831,253,280	mc-45-big-Data.db
374,007,490	mc-49-big-Data.db
342,529,995	mc-55-big-Data.db
204,178,145	mc-57-big-Data.db
83,234,470	mc-59-big-Data.db
3,223,224,985	mc-61-big-Data.db
24,552,560	mc-62-big-Data.db
2,257,479,515	mc-63-big-Data.db
2,697,986,445	mc-66-big-Data.db
5,285	mc-67-big-Data.db

At this point we have our repair options. I’ll take them one at a time.

Online SSTable repair – Nodetool scrub

This command is easy to perform. It is also the option least likely to succeed. Steps:

Find out which SSTable is broken.
Run nodetool scrub keyspace tablename.
Run nodetool repair.
Run nodetool listsnapshots.
Run nodetool clearsnapshot keyspacename -t snapshot name.

We did the whole "find out what table is broken" thing just above, so we aren’t going to do it again. We will start with step 2. Scrub will take a snapshot and rebuild your table files. The one(s) that are corrupt will disappear. You will lose at least a few rows and possibly all the rows from the corrupted SSTable files. Hence the need to do a repair. $ nodetool scrub keyspace1 standard1 After the scrub, we have fewer SStable files and their names have all changed. There is also less space consumed and very likely some rows missing.

2,257,479,515	mc-68-big-Data.db
342,529,995	mc-70-big-Data.db
3,223,224,985	mc-71-big-Data.db
83,234,470	mc-72-big-Data.db
4,417,919,455	mc-73-big-Data.db
204,178,145	mc-75-big-Data.db
374,007,490	mc-76-big-Data.db
2,697,986,445	mc-77-big-Data.db
1,194,479,930	mc-80-big-Data.db

So we do a repair. $ nodetool repair -full [2018-08-09 17:00:51,663] Starting repair command #2 (4c820390-9c17-11e8-8e8f-fbc0ff4d2cb8), repairing keyspace keyspace1 with repair options (parallelism: parallel, primary range: false, incremental: false, job threads: 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 768, pull repair: false) [2018-08-09 18:14:09,799] Repair session 4cadf590-9c17-11e8-8e8f-fbc0ff4d2cb8 for range [(-1377105920845202291,… [2018-08-09 18:14:10,130] Repair completed successfully [2018-08-09 18:14:10,131] Repair command #2 finished in 1 hour 13 minutes 18 seconds After the repair, we have almost twice as many SSTable files with data pulled in from other nodes to replace the corrupted data lost by the scrub process.

2,257,479,515	mc-68-big-Data.db
342,529,995	mc-70-big-Data.db
3,223,224,985	mc-71-big-Data.db
83,234,470	mc-72-big-Data.db
4,417,919,455	mc-73-big-Data.db
204,178,145	mc-75-big-Data.db
374,007,490	mc-76-big-Data.db
2,697,986,445	mc-77-big-Data.db
1,194,479,930	mc-80-big-Data.db
1,209,518,945	mc-88-big-Data.db
193,896,835	mc-89-big-Data.db
170,061,285	mc-91-big-Data.db
63,427,680	mc-93-big-Data.db
733,830,580	mc-95-big-Data.db
1,747,015,110	mc-96-big-Data.db
16,715,886,480	mc-98-big-Data.db
49,167,805	mc-99-big-Data.db

Once the scrub and repair are completed, you are almost done. One of the side effects of the scrub is a snapshot called pre-scrub-<timestamp>. If you don’t want to run out of diskspace, you are going to want to remove it, preferably with the nodetool. $ nodetool listsnapshots Snapshot Details: Snapshot name Keyspace name Column family name True size Size on disk pre-scrub-1533897462847 keyspace1 standard1 35.93 GiB 35.93 GiB $ nodetool clearsnapshot -t pre-scrub-1533897462847 Requested clearing snapshot(s) for [all keyspaces] with snapshot name [pre-scrub-1533897462847] If the repair still fails to complete, we get to try one of the other methods.

Offline SSTable repair utility – sstablescrub

This option is a bit more complex to do but it often will work when the online version won’t work. Warning: it is very slow. Steps:

Bring the node down.
Run the sstablescrub command.
Start the node back up.
Run nodetool repair on the table.
Run nodetool clearsnapshot to remove the pre-scrub snapshot.

If the node is not already down, bring it down. I usually do the following commands: $ nodetool drain $ pkill java $ ps -ef |grep cassandra root 18271 14813 0 20:39 pts/1 00:00:00 grep --color=auto cassandra Then issue the sstablescrub command with the -n option unless you have the patience of a saint. Without the -n option, every column in every row in every SSTable file will be validated. Single threaded. It will take forever. In preparing for this blog post, I forgot to use the -n and found that it took 12 hours to scrub 500 megabytes of a 30 GB table. Not willing to wait 30 days for the scrub to complete, I stopped it and switched to the -n option completing the scrub in only… hang on for this, 6 days. So, um, maybe this isn’t going to be useful in most real-world situations unless you have really small tables. $ Sstablescrub -n keyspace1 standard1 Pre-scrub sstables snapshotted into snapshot pre-scrub-1533861799166 Scrubbing BigTableReader(path='/home/cassandra/data/keyspace1/standard1-afd416808c7311e8a0c96796602809bc/mc-88-big-Data.db') (1.126GiB)… Unfortunately, this took more time than I wanted to take for this blog post. Once you have the table scrubbed, you restart Cassandra and delete.

Delete the file and do a Nodetool repair – rm

This option works every time. It is no more difficult to do than the offline sstablescrub command and its success rate is 100%. It's usually much faster than the offline sstablescrub option. In my prep for the blog post, this approach took only two hours for my 30 GB table. The only drawback I can see is that for the time it takes to do the repair on the table after the delete is performed, there is an increased risk of consistency problems esp if you are using CF=1 which should be a fairly uncommon use case. Steps:

Stop the node.
cd to the offending keyspace and sstable directory.
If you know which sstable file is bad (if you learned about the problem from stalled compactions, you will know) just delete it. If not, delete all files in the directory.
Restart the node.
Nodetool repair.

$ nodetool drain $ pkill java $ ps -ef |grep cassandra root 18271 14813 0 20:39 pts/1 00:00:00 grep --color=auto cassandra $ cd /var/lib/cassandra/data/keyspace1/standard1-afd416808c7311e8a0c96796602809bc/ $ pwd /var/lib/cassandra/data/keyspace1/standard1-afd416808c7311e8a0c96796602809bc If you know the SSTable file you want to delete, you can delete just that one with rm -f *nnn*. If not, as in this case, you do them all. $ sudo rm -f * rm: cannot remove 'backups': Is a directory rm: cannot remove 'snapshots': Is a directory $ ls backups snapshots $systemctl start cassandra $ nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 192.168.1.88 1.35 MiB 256 100.0% c92d9374-cf3a-47f6-9bd1-81b827da0c1e rack1 UN 192.168.1.90 41.72 GiB 256 100.0% 3c9e61ae-8741-4a74-9e89-cfa47768ac60 rack1 UN 192.168.1.92 30.87 GiB 256 100.0% c36fecad-0f55-4945-a741-568f28a3cd8b rack1 $ nodetool repair keyspace1 standard1 -full [2018-08-10 11:23:22,454] Starting repair command #1 (51713c00-9cb1-11e8-ba61-01c8f56621df), repairing keyspace keyspace1 with repair options (parallelism: parallel, primary range: false, incremental: false, job threads: 1, ColumnFamilies: [standard1], dataCenters: [], hosts: [], # of ranges: 768, pull repair: false) [2018-08-10 13:02:36,097] Repair completed successfully [2018-08-10 13:02:36,098] Repair command #1 finished in 1 hour 39 minutes 13 seconds The SSTable file list now looks like this:

229,648,710	mc-10-big-Data.db
103,421,070	mc-11-big-Data.db
1,216,169,275	mc-12-big-Data.db
76,738,970	mc-13-big-Data.db
773,774,270	mc-14-big-Data.db
17,035,624,448	mc-15-big-Data.db
83,365,660	mc-16-big-Data.db
170,061,285	mc-17-big-Data.db
758,998,865	mc-18-big-Data.db
2,683,075,900	mc-19-big-Data.db
749,573,440	mc-1-big-Data.db
91,184,160	mc-20-big-Data.db
303,380,050	mc-21-big-Data.db
3,639,126,510	mc-22-big-Data.db
231,929,395	mc-23-big-Data.db
1,469,272,390	mc-24-big-Data.db
204,485,420	mc-25-big-Data.db
345,655,100	mc-26-big-Data.db
805,017,870	mc-27-big-Data.db
50,714,125	mc-28-big-Data.db
11,578,088,555	mc-2-big-Data.db
170,033,435	mc-3-big-Data.db
1,677,053,450	mc-4-big-Data.db
62,245,405	mc-5-big-Data.db
8,426,967,700	mc-6-big-Data.db
1,979,214,745	mc-7-big-Data.db
2,910,586,420	mc-8-big-Data.db
14,097,936,920	mc-9-big-Data.db

Bootstrap the node

If you are using consistency factor (CF) ONE on reads, or you are really concerned about consistency overall, use this approach instead of the rm -f approach. It will insure that the node with missing data will not participate in any reads until all data is restored. Depending on how much data the node has to recover, it will often take longer than any of the other approaches. Although since bootstrapping can operate in parallel, it may not. Steps:

Shut down the node.
Remove all of the files under the $CASSANDRA_HOME. Usually /var/lib/Cassandra.
Modify /etc/cassandra/conf/cassandra-env.sh.
Start Cassandra. – When the server starts with no files, it will connect to one of its seeds, recreate the schema and request all nodes to stream data to it to replace the data it has lost. It will not re-select new token ranges unless you try to restart it with a different IP than it had before.
Modify the /ect/cassandra/conf/cassandra-env.sh file to remove the change in Step 3.

$ nodetool drain $ sudo pkill java $ ps -ef |grep java $ vi /etc/cassandra/conf/cassandra-env.sh Add this line at the end of the file: JVM_OPTS="$JVM_OPTS -Dcassandra.replace_address=192.168.1.88" $ systemctl start cassandra Wait for the node to join the cluster During the bootstrap we see messages like this in the log:

INFO [main] 2018-08-10 13:39:06,780 StreamResultFuture.java:90 - [Stream #47b382f0-9cc4-11e8-a010-51948a7598a1] Executing streaming plan for Bootstrap

INFO [StreamConnectionEstablisher:1] 2018-08-10 13:39:06,784 StreamSession.java:266 - [Stream #47b382f0-9cc4-11e8-a010-51948a7598a1] Starting streaming to /192.168.1.90 >/code> Later on we see: INFO [main] 2018-08-10 14:18:16,133 StorageService.java:1449 - JOINING: Finish joining ring INFO [main] 2018-08-10 14:18:16,482 SecondaryIndexManager.java:509 - Executing pre-join post-bootstrap tasks for: CFS(Keyspace='keyspace1', ColumnFamily='standard1') INFO [main] 2018-08-10 14:18:16,484 SecondaryIndexManager.java:509 - Executing pre-join post-bootstrap tasks for: CFS(Keyspace='keyspace1', ColumnFamily='counter1') INFO [main] 2018-08-10 14:18:16,897 StorageService.java:2292 - Node /192.168.1.88 state jump to NORMAL WARN [main] 2018-08-10 14:18:16,899 StorageService.java:2324 - Not updating token metadata for /192.168.1.88 because I am replacing it When we do a nodetool status we see: $ nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 192.168.1.88 30.87 GiB 256 100.0% c92d9374-cf3a-47f6-9bd1-81b827da0c1e rack1 UN 192.168.1.90 41.72 GiB 256 100.0% 3c9e61ae-8741-4a74-9e89-cfa47768ac60 rack1 UN 192.168.1.92 30.87 GiB 256 100.0% c36fecad-0f55-4945-a741-568f28a3cd8b rack1 The node is up and running in less than one hour. Quicker than any of the options. Makes you think about your choices, doesn’t it? If you have a keyspace with RF=1 then options 3 and 4 are not viable. You will lose data. Although with RF=1 and a corrupted SSTable file you are going to lose some data anyway. A last view at the list of SSTable files shows you this:

773,774,270	mc-10-big-Data.db
17,148,617,040	mc-11-big-Data.db
749,573,440	mc-1-big-Data.db
170,033,435	mc-2-big-Data.db
1,677,053,450	mc-3-big-Data.db
62,245,405	mc-4-big-Data.db
8,426,967,700	mc-5-big-Data.db
229,648,710	mc-6-big-Data.db
103,421,070	mc-7-big-Data.db
1,216,169,275	mc-8-big-Data.db
76,738,970	mc-9-big-Data.db

Conclusion

If you run into corrupted SSTable files, don’t panic. It won’t have any impact on your operations in the short term unless you are using RF=ONE or CF=ONE. Find out which node has the broken SSTable file. Then, because its easiest and low risk, try the online nodetool scrub command. If that does not work, then you have three choices. Offline Scrub works but is usually too slow to be useful. Rebuilding the whole node seems to be overkill but it will work, and it will maintain consistency on reads. If you have a lot of data and you want to solve the problem fairly quickly, just remove the offending SSTable file and do a repair. All approaches have an impact on the other nodes in the cluster. The first three require a repair which computes merkle trees and streams data to the node being fixed. The amount to be streamed is most with the delete but the total time for the recovery was less in my example. That may not always be the case. In the bootstrap example, the total time was very similar to the delete case because my test case had only one large table. If there were several large tables, the delete approach would have been the fastest to get the node back to normal.

Approach	Scrub phase	Repair phase	Total Recovery time
Online Scrub	1:06	1:36	2:42
*Offline Scrub	144:35	1:37	146:22
Delete files	0:05	1:36	1:41
Bootstrap	0:05	1:45	1:50

All sample commands show the user in normal Linux user mode. That is because in my test environment the Cassandra cluster belonged to my user id. Most production Cassandra clusters run as the Cassandra Linux user. In that case, some amount of user id switching or sudo operations would be required to do the work. The offline scrub time was estimated. I did not want to wait for six days to see if it was really going to take that long. All sample output provided here was from a three-node cluster running Cassandra 3.11.2 running on Fedora 28 using a vanilla Cassandra install with pretty much everything in cassandra.env defaulted. I corrupted the SSTable file using this command: $ printf '\x31\xc0\xc3' | dd of=mc-8-big-Data.db bs=1 seek=0 count=100 conv=notrunc