Every few months I have a customer come to me with the following concern: my compactions for one of my Cassandra tables are stuck or my repairs fail when referencing one of the nodes in my Cassandra cluster. I take a look or just ask a couple of questions and it becomes apparent that the problem is a broken SSTable file. Occasionally, they will come to me in a panic and tell me that they have looked at their logs and discovered they have a broken SSTable file. Don’t panic. A broken SSTable file is not a crisis. A broken SSTable file does not represent lost data or an unusable database. Well, that’s true unless you are using a Replication Factor (RF) of ONE. The cluster is still going to operate, and queries should be working just fine.
But… it does need to be fixed. There are four ways to fix the problem which I will explain in this post, one of which I freely admit is not one of the community's recommended ways, but is often the simplest and quickest with minimal downside risk. Before I begin to explain the ways to repair an SSTable, I will spend a few lines to explain what an SSTable file is, then I will walk you through the four options from easiest and safest to the most difficult and risky. An SSTable file is not a file. It’s a set of eight files. One of those eight contains the actual data. The others contain metadata used by Cassandra to find specific partitions and rows in the data file. Here is a sample list of the files:
| mc-131-big-CRC.db | Checksums of chunks of the data file. |
| mc-131-big-Data.db | The data file that contains all of rows and columns. |
| mc-131-big-Digest.crc32 | Single checksum of the data file. |
| mc-131-big-Filter.db | Bloom filter containing partial checksums of all partition and cluster keys. |
| mc-131-big-Index.db | A single level index of the partitions and cluster keys in the data file. |
| mc-131-big-Statistics.db | Bunch of metadata that Cassandra keeps about this file including information about the columns, tombstones etc. |
| mc-131-big-Summary.db | An index into the index file. Making this a second level index. |
| mc-131-big-TOC.txt | This list of file names. No idea why it exists. |
The “mc” is the SSTable file version. This changes whenever a new release of Cassandra changes anything in the way data is stored in any of the files listed in the table above. The number 131 is the sequence number of the SSTable file. It increases for each new SSTable file written through memtable flush, compaction, or streaming from another node. The word “big” was added to Cassandra SSTable files starting in Cassandra 2.2. I have no idea what its purpose is. The rest of the file name parts are explained in the chart above. When you get the dreaded error that an SSTable file is broken, it almost always is because an internal consistency check such as "column too long" or "one of the checksums has failed to validate". This has relatively little effect on normal reads against the table except for the request where the failure took place. It has a serious effect on compactions and repairs, stopping them in their tracks. Having repairs fail can result in long-term consistency issues between nodes and eventually the application returning incorrect results. Having compactions fail will degrade read performance in the short term and cause storage space problems in the long term. So… what are the four options?
It starts out like this. You are running a Nodetool repair and you get an error: $ nodetool repair -full [2018-08-09 17:00:51,663] Starting repair command #2 (4c820390-9c17-11e8-8e8f-fbc0ff4d2cb8), repairing keyspace keyspace1 with repair options (parallelism: parallel, primary range: false, incremental: false, job threads: 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 768, pull repair: false) error: Repair job has failed with the error message: [2018-08-15 09:59:41,659] Some repair failed -- StackTrace -- java.lang.RuntimeException: Repair job has failed with the error message: [2018-08-15 09:59:41,659] Some repair failed You see the error. But it doesn’t tell you a whole lot. Just that the repair failed.
Next step look at the Cassandra system.log file you want to see the errors: $ grep -n -A10 ERROR /var/log/cassandra/system.log ERROR [RepairJobTask:8] 2018-08-08 15:15:57,726 RepairRunnable.java:277 - Repair session 2c5f89e0-9b39-11e8-b5ee-bb8feee1767a for range [(-1377105920845202291,-1371711029446682941], (-8865445607623519086,-885162575564883.... 425683885]]] Sync failed between /192.168.1.90 and /192.168.1.92 /var/log/cassandra/debug.log:ERROR [RepairJobTask:4] 2018-08-09 16:16:50,722 RepairSession.java:281 - [repair #25682740-9c11-11e8-8e8f-fbc0ff4d2cb8] Session completed with the following error /var/log/cassandra/debug.log:ERROR [RepairJobTask:4] 2018-08-09 16:16:50,726 RepairRunnable.java:277 - Repair session 25682740-9c11-11e8-8e8f-fbc0ff4d2cb8...... 7115161941975432804,7115472305105341673], (5979423340500726528,5980417142425683885]]] Validation failed in /192.168.1.88 /var/log/cassandra/system.log:ERROR [ValidationExecutor:2] 2018-08-09 16:16:50,707 Validator.java:268 - Failed creating a merkle tree for [repair #25682740-9c11-11e8-8e8f-fbc0ff4d2cb8 on keyspace1/standard1,
The first error message Sync Failed is misleading although sometimes it can be a clue. Looking further, you see Validation failed in /192.168.1.88. This tells us that the error occurred on 192.158.1.88 which just happens to be the node we are on. Finally, we get the message showing the keyspace and table the error occurred on. Depending on the message, you might see the table file number mentioned. In this case it was not mentioned. Looking in the directory tree we see that we have the following SSTable files:
| 4,417,919,455 | mc-30-big-Data.db |
| 8,831,253,280 | mc-45-big-Data.db |
| 374,007,490 | mc-49-big-Data.db |
| 342,529,995 | mc-55-big-Data.db |
| 204,178,145 | mc-57-big-Data.db |
| 83,234,470 | mc-59-big-Data.db |
| 3,223,224,985 | mc-61-big-Data.db |
| 24,552,560 | mc-62-big-Data.db |
| 2,257,479,515 | mc-63-big-Data.db |
| 2,697,986,445 | mc-66-big-Data.db |
| 5,285 | mc-67-big-Data.db |
At this point we have our repair options. I’ll take them one at a time.
We did the whole "find out what table is broken" thing just above, so we aren’t going to do it again. We will start with step 2. Scrub will take a snapshot and rebuild your table files. The one(s) that are corrupt will disappear. You will lose at least a few rows and possibly all the rows from the corrupted SSTable files. Hence the need to do a repair. $ nodetool scrub keyspace1 standard1 After the scrub, we have fewer SStable files and their names have all changed. There is also less space consumed and very likely some rows missing.
| 2,257,479,515 | mc-68-big-Data.db |
| 342,529,995 | mc-70-big-Data.db |
| 3,223,224,985 | mc-71-big-Data.db |
| 83,234,470 | mc-72-big-Data.db |
| 4,417,919,455 | mc-73-big-Data.db |
| 204,178,145 | mc-75-big-Data.db |
| 374,007,490 | mc-76-big-Data.db |
| 2,697,986,445 | mc-77-big-Data.db |
| 1,194,479,930 | mc-80-big-Data.db |
So we do a repair. $ nodetool repair -full [2018-08-09 17:00:51,663] Starting repair command #2 (4c820390-9c17-11e8-8e8f-fbc0ff4d2cb8), repairing keyspace keyspace1 with repair options (parallelism: parallel, primary range: false, incremental: false, job threads: 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 768, pull repair: false) [2018-08-09 18:14:09,799] Repair session 4cadf590-9c17-11e8-8e8f-fbc0ff4d2cb8 for range [(-1377105920845202291,… [2018-08-09 18:14:10,130] Repair completed successfully [2018-08-09 18:14:10,131] Repair command #2 finished in 1 hour 13 minutes 18 seconds After the repair, we have almost twice as many SSTable files with data pulled in from other nodes to replace the corrupted data lost by the scrub process.
| 2,257,479,515 | mc-68-big-Data.db |
| 342,529,995 | mc-70-big-Data.db |
| 3,223,224,985 | mc-71-big-Data.db |
| 83,234,470 | mc-72-big-Data.db |
| 4,417,919,455 | mc-73-big-Data.db |
| 204,178,145 | mc-75-big-Data.db |
| 374,007,490 | mc-76-big-Data.db |
| 2,697,986,445 | mc-77-big-Data.db |
| 1,194,479,930 | mc-80-big-Data.db |
| 1,209,518,945 | mc-88-big-Data.db |
| 193,896,835 | mc-89-big-Data.db |
| 170,061,285 | mc-91-big-Data.db |
| 63,427,680 | mc-93-big-Data.db |
| 733,830,580 | mc-95-big-Data.db |
| 1,747,015,110 | mc-96-big-Data.db |
| 16,715,886,480 | mc-98-big-Data.db |
| 49,167,805 | mc-99-big-Data.db |
Once the scrub and repair are completed, you are almost done. One of the side effects of the scrub is a snapshot called pre-scrub-<timestamp>. If you don’t want to run out of diskspace, you are going to want to remove it, preferably with the nodetool. $ nodetool listsnapshots Snapshot Details: Snapshot name Keyspace name Column family name True size Size on disk pre-scrub-1533897462847 keyspace1 standard1 35.93 GiB 35.93 GiB $ nodetool clearsnapshot -t pre-scrub-1533897462847 Requested clearing snapshot(s) for [all keyspaces] with snapshot name [pre-scrub-1533897462847] If the repair still fails to complete, we get to try one of the other methods.
$ nodetool drain $ pkill java $ ps -ef |grep cassandra root 18271 14813 0 20:39 pts/1 00:00:00 grep --color=auto cassandra Then issue the sstablescrub command with the -n option unless you have the patience of a saint. Without the -n option, every column in every row in every SSTable file will be validated. Single threaded. It will take forever. In preparing for this blog post, I forgot to use the -n and found that it took 12 hours to scrub 500 megabytes of a 30 GB table. Not willing to wait 30 days for the scrub to complete, I stopped it and switched to the -n option completing the scrub in only… hang on for this, 6 days. So, um, maybe this isn’t going to be useful in most real-world situations unless you have really small tables. $ Sstablescrub -n keyspace1 standard1 Pre-scrub sstables snapshotted into snapshot pre-scrub-1533861799166 Scrubbing BigTableReader(path='/home/cassandra/data/keyspace1/standard1-afd416808c7311e8a0c96796602809bc/mc-88-big-Data.db') (1.126GiB)… Unfortunately, this took more time than I wanted to take for this blog post. Once you have the table scrubbed, you restart Cassandra and delete.
$ nodetool drain $ pkill java $ ps -ef |grep cassandra root 18271 14813 0 20:39 pts/1 00:00:00 grep --color=auto cassandra $ cd /var/lib/cassandra/data/keyspace1/standard1-afd416808c7311e8a0c96796602809bc/ $ pwd /var/lib/cassandra/data/keyspace1/standard1-afd416808c7311e8a0c96796602809bc If you know the SSTable file you want to delete, you can delete just that one with rm -f *nnn*. If not, as in this case, you do them all. $ sudo rm -f * rm: cannot remove 'backups': Is a directory rm: cannot remove 'snapshots': Is a directory $ ls backups snapshots $systemctl start cassandra $ nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 192.168.1.88 1.35 MiB 256 100.0% c92d9374-cf3a-47f6-9bd1-81b827da0c1e rack1 UN 192.168.1.90 41.72 GiB 256 100.0% 3c9e61ae-8741-4a74-9e89-cfa47768ac60 rack1 UN 192.168.1.92 30.87 GiB 256 100.0% c36fecad-0f55-4945-a741-568f28a3cd8b rack1 $ nodetool repair keyspace1 standard1 -full [2018-08-10 11:23:22,454] Starting repair command #1 (51713c00-9cb1-11e8-ba61-01c8f56621df), repairing keyspace keyspace1 with repair options (parallelism: parallel, primary range: false, incremental: false, job threads: 1, ColumnFamilies: [standard1], dataCenters: [], hosts: [], # of ranges: 768, pull repair: false) [2018-08-10 13:02:36,097] Repair completed successfully [2018-08-10 13:02:36,098] Repair command #1 finished in 1 hour 39 minutes 13 seconds The SSTable file list now looks like this:
| 229,648,710 | mc-10-big-Data.db |
| 103,421,070 | mc-11-big-Data.db |
| 1,216,169,275 | mc-12-big-Data.db |
| 76,738,970 | mc-13-big-Data.db |
| 773,774,270 | mc-14-big-Data.db |
| 17,035,624,448 | mc-15-big-Data.db |
| 83,365,660 | mc-16-big-Data.db |
| 170,061,285 | mc-17-big-Data.db |
| 758,998,865 | mc-18-big-Data.db |
| 2,683,075,900 | mc-19-big-Data.db |
| 749,573,440 | mc-1-big-Data.db |
| 91,184,160 | mc-20-big-Data.db |
| 303,380,050 | mc-21-big-Data.db |
| 3,639,126,510 | mc-22-big-Data.db |
| 231,929,395 | mc-23-big-Data.db |
| 1,469,272,390 | mc-24-big-Data.db |
| 204,485,420 | mc-25-big-Data.db |
| 345,655,100 | mc-26-big-Data.db |
| 805,017,870 | mc-27-big-Data.db |
| 50,714,125 | mc-28-big-Data.db |
| 11,578,088,555 | mc-2-big-Data.db |
| 170,033,435 | mc-3-big-Data.db |
| 1,677,053,450 | mc-4-big-Data.db |
| 62,245,405 | mc-5-big-Data.db |
| 8,426,967,700 | mc-6-big-Data.db |
| 1,979,214,745 | mc-7-big-Data.db |
| 2,910,586,420 | mc-8-big-Data.db |
| 14,097,936,920 | mc-9-big-Data.db |
$ nodetool drain $ sudo pkill java $ ps -ef |grep java $ vi /etc/cassandra/conf/cassandra-env.sh Add this line at the end of the file: JVM_OPTS="$JVM_OPTS -Dcassandra.replace_address=192.168.1.88" $ systemctl start cassandra Wait for the node to join the cluster During the bootstrap we see messages like this in the log: INFO [main] 2018-08-10 13:39:06,780 StreamResultFuture.java:90 - [Stream #47b382f0-9cc4-11e8-a010-51948a7598a1] Executing streaming plan for Bootstrap INFO [StreamConnectionEstablisher:1] 2018-08-10 13:39:06,784 StreamSession.java:266 - [Stream #47b382f0-9cc4-11e8-a010-51948a7598a1] Starting streaming to /192.168.1.90 >/code> Later on we see: INFO [main] 2018-08-10 14:18:16,133 StorageService.java:1449 - JOINING: Finish joining ring INFO [main] 2018-08-10 14:18:16,482 SecondaryIndexManager.java:509 - Executing pre-join post-bootstrap tasks for: CFS(Keyspace='keyspace1', ColumnFamily='standard1') INFO [main] 2018-08-10 14:18:16,484 SecondaryIndexManager.java:509 - Executing pre-join post-bootstrap tasks for: CFS(Keyspace='keyspace1', ColumnFamily='counter1') INFO [main] 2018-08-10 14:18:16,897 StorageService.java:2292 - Node /192.168.1.88 state jump to NORMAL WARN [main] 2018-08-10 14:18:16,899 StorageService.java:2324 - Not updating token metadata for /192.168.1.88 because I am replacing it When we do a nodetool status we see: $ nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 192.168.1.88 30.87 GiB 256 100.0% c92d9374-cf3a-47f6-9bd1-81b827da0c1e rack1 UN 192.168.1.90 41.72 GiB 256 100.0% 3c9e61ae-8741-4a74-9e89-cfa47768ac60 rack1 UN 192.168.1.92 30.87 GiB 256 100.0% c36fecad-0f55-4945-a741-568f28a3cd8b rack1 The node is up and running in less than one hour. Quicker than any of the options. Makes you think about your choices, doesn’t it? If you have a keyspace with RF=1 then options 3 and 4 are not viable. You will lose data. Although with RF=1 and a corrupted SSTable file you are going to lose some data anyway. A last view at the list of SSTable files shows you this:
| 773,774,270 | mc-10-big-Data.db |
| 17,148,617,040 | mc-11-big-Data.db |
| 749,573,440 | mc-1-big-Data.db |
| 170,033,435 | mc-2-big-Data.db |
| 1,677,053,450 | mc-3-big-Data.db |
| 62,245,405 | mc-4-big-Data.db |
| 8,426,967,700 | mc-5-big-Data.db |
| 229,648,710 | mc-6-big-Data.db |
| 103,421,070 | mc-7-big-Data.db |
| 1,216,169,275 | mc-8-big-Data.db |
| 76,738,970 | mc-9-big-Data.db |
If you run into corrupted SSTable files, don’t panic. It won’t have any impact on your operations in the short term unless you are using RF=ONE or CF=ONE. Find out which node has the broken SSTable file. Then, because its easiest and low risk, try the online nodetool scrub command. If that does not work, then you have three choices. Offline Scrub works but is usually too slow to be useful. Rebuilding the whole node seems to be overkill but it will work, and it will maintain consistency on reads. If you have a lot of data and you want to solve the problem fairly quickly, just remove the offending SSTable file and do a repair. All approaches have an impact on the other nodes in the cluster. The first three require a repair which computes merkle trees and streams data to the node being fixed. The amount to be streamed is most with the delete but the total time for the recovery was less in my example. That may not always be the case. In the bootstrap example, the total time was very similar to the delete case because my test case had only one large table. If there were several large tables, the delete approach would have been the fastest to get the node back to normal.
| Approach | Scrub phase | Repair phase | Total Recovery time |
| Online Scrub | 1:06 | 1:36 | 2:42 |
| *Offline Scrub | 144:35 | 1:37 | 146:22 |
| Delete files | 0:05 | 1:36 | 1:41 |
| Bootstrap | 0:05 | 1:45 | 1:50 |
All sample commands show the user in normal Linux user mode. That is because in my test environment the Cassandra cluster belonged to my user id. Most production Cassandra clusters run as the Cassandra Linux user. In that case, some amount of user id switching or sudo operations would be required to do the work. The offline scrub time was estimated. I did not want to wait for six days to see if it was really going to take that long. All sample output provided here was from a three-node cluster running Cassandra 3.11.2 running on Fedora 28 using a vanilla Cassandra install with pretty much everything in cassandra.env defaulted. I corrupted the SSTable file using this command: $ printf '\x31\xc0\xc3' | dd of=mc-8-big-Data.db bs=1 seek=0 count=100 conv=notrunc
Ready to handle massive data volumes with zero downtime?