Every few months I have a customer come to me with the following concern: my compactions for one of my Cassandra tables are stuck or my repairs fail when referencing one of the nodes in my Cassandra cluster. I take a look or just ask a couple of questions and it becomes apparent that the problem is a broken SSTable file. Occasionally, they will come to me in a panic and tell me that they have looked at their logs and discovered they have a broken SSTable file. Don’t panic. A broken SSTable file is not a crisis. A broken SSTable file does not represent lost data or an unusable database. Well, that’s true unless you are using a Replication Factor (RF) of ONE. The cluster is still going to operate, and queries should be working just fine. But… it does need to be fixed. There are four ways to fix the problem which I will explain in this post, one of which I freely admit is not one of the community's recommended ways, but is often the simplest and quickest with minimal downside risk. Before I begin to explain the ways to repair an SSTable, I will spend a few lines to explain what an SSTable file is, then I will walk you through the four options from easiest and safest to the most difficult and risky. An SSTable file is not a file. It’s a set of eight files. One of those eight contains the actual data. The others contain metadata used by Cassandra to find specific partitions and rows in the data file. Here is a sample list of the files:
mc-131-big-CRC.db | Checksums of chunks of the data file. |
mc-131-big-Data.db | The data file that contains all of rows and columns. |
mc-131-big-Digest.crc32 | Single checksum of the data file. |
mc-131-big-Filter.db | Bloom filter containing partial checksums of all partition and cluster keys. |
mc-131-big-Index.db | A single level index of the partitions and cluster keys in the data file. |
mc-131-big-Statistics.db | Bunch of metadata that Cassandra keeps about this file including information about the columns, tombstones etc. |
mc-131-big-Summary.db | An index into the index file. Making this a second level index. |
mc-131-big-TOC.txt | This list of file names. No idea why it exists. |
- Nodetool scrub command – Performed online with little difficulty. It usually has a low success rate in my own personal experience.
- Offline sstablescrub – Must be performed offline. The tool is in /usr/bin with a package install. Otherwise its in $CASSANDRAHOME/bin. Its effectiveness rate is significantly better than the Nodetool scrub, but it requires the node to be down to work. And it takes forever…
- rm -f – Performed offline. it must also be followed immediately with a Nodetool repair when you bring the node back up. This is the method I have successfully used most often but it also has some consistency risks while the repairs complete.
- Bootstrap the node – This is kind of like number 3 but it has less theoretical impact on consistency.
Let us get into the details
It starts out like this. You are running a Nodetool repair and you get an error:$ nodetool repair -full
[2018-08-09 17:00:51,663] Starting repair command #2 (4c820390-9c17-11e8-8e8f-fbc0ff4d2cb8), repairing keyspace keyspace1 with repair options (parallelism: parallel, primary range: false, incremental: false, job threads: 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 768, pull repair: false) error: Repair job has failed with the error message: [2018-08-15 09:59:41,659] Some repair failed -- StackTrace -- java.lang.RuntimeException: Repair job has failed with the error message: [2018-08-15 09:59:41,659] Some repair failed You see the error. But it doesn’t tell you a whole lot. Just that the repair failed. Next step look at the Cassandra system.log file you want to see the errors:
$ grep -n -A10 ERROR /var/log/cassandra/system.log
ERROR [RepairJobTask:8] 2018-08-08 15:15:57,726 RepairRunnable.java:277 - Repair session 2c5f89e0-9b39-11e8-b5ee-bb8feee1767a for range [(-1377105920845202291,-1371711029446682941], (-8865445607623519086,-885162575564883.... 425683885]]] Sync failed between /192.168.1.90 and /192.168.1.92 /var/log/cassandra/debug.log:ERROR [RepairJobTask:4] 2018-08-09 16:16:50,722 RepairSession.java:281 - [repair #25682740-9c11-11e8-8e8f-fbc0ff4d2cb8] Session completed with the following error /var/log/cassandra/debug.log:ERROR [RepairJobTask:4] 2018-08-09 16:16:50,726 RepairRunnable.java:277 - Repair session 25682740-9c11-11e8-8e8f-fbc0ff4d2cb8...... 7115161941975432804,7115472305105341673], (5979423340500726528,5980417142425683885]]]
Validation failed in /192.168.1.88 /var/log/cassandra/system.log:ERROR [ValidationExecutor:2] 2018-08-09 16:16:50,707 Validator.java:268 - Failed creating a merkle tree for [repair #25682740-9c11-11e8-8e8f-fbc0ff4d2cb8 on keyspace1/standard1, The first error message Sync Failed is misleading although sometimes it can be a clue. Looking further, you see Validation failed in /192.168.1.88. This tells us that the error occurred on 192.158.1.88 which just happens to be the node we are on. Finally, we get the message showing the keyspace and table the error occurred on. Depending on the message, you might see the table file number mentioned. In this case it was not mentioned. Looking in the directory tree we see that we have the following SSTable files:
4,417,919,455 | mc-30-big-Data.db |
8,831,253,280 | mc-45-big-Data.db |
374,007,490 | mc-49-big-Data.db |
342,529,995 | mc-55-big-Data.db |
204,178,145 | mc-57-big-Data.db |
83,234,470 | mc-59-big-Data.db |
3,223,224,985 | mc-61-big-Data.db |
24,552,560 | mc-62-big-Data.db |
2,257,479,515 | mc-63-big-Data.db |
2,697,986,445 | mc-66-big-Data.db |
5,285 | mc-67-big-Data.db |
Online SSTable repair – Nodetool scrub
This command is easy to perform. It is also the option least likely to succeed. Steps:- Find out which SSTable is broken.
- Run nodetool scrub keyspace tablename.
- Run nodetool repair.
- Run nodetool listsnapshots.
- Run nodetool clearsnapshot keyspacename -t snapshot name.
$ nodetool scrub keyspace1 standard1
After the scrub, we have fewer SStable files and their names have all changed. There is also less space consumed and very likely some rows missing.
2,257,479,515 | mc-68-big-Data.db |
342,529,995 | mc-70-big-Data.db |
3,223,224,985 | mc-71-big-Data.db |
83,234,470 | mc-72-big-Data.db |
4,417,919,455 | mc-73-big-Data.db |
204,178,145 | mc-75-big-Data.db |
374,007,490 | mc-76-big-Data.db |
2,697,986,445 | mc-77-big-Data.db |
1,194,479,930 | mc-80-big-Data.db |
$ nodetool repair -full
[2018-08-09 17:00:51,663] Starting repair command #2 (4c820390-9c17-11e8-8e8f-fbc0ff4d2cb8), repairing keyspace keyspace1 with repair options (parallelism: parallel, primary range: false, incremental: false, job threads: 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 768, pull repair: false) [2018-08-09 18:14:09,799] Repair session 4cadf590-9c17-11e8-8e8f-fbc0ff4d2cb8 for range [(-1377105920845202291,… [2018-08-09 18:14:10,130] Repair completed successfully [2018-08-09 18:14:10,131] Repair command #2 finished in 1 hour 13 minutes 18 seconds After the repair, we have almost twice as many SSTable files with data pulled in from other nodes to replace the corrupted data lost by the scrub process.
2,257,479,515 | mc-68-big-Data.db |
342,529,995 | mc-70-big-Data.db |
3,223,224,985 | mc-71-big-Data.db |
83,234,470 | mc-72-big-Data.db |
4,417,919,455 | mc-73-big-Data.db |
204,178,145 | mc-75-big-Data.db |
374,007,490 | mc-76-big-Data.db |
2,697,986,445 | mc-77-big-Data.db |
1,194,479,930 | mc-80-big-Data.db |
1,209,518,945 | mc-88-big-Data.db |
193,896,835 | mc-89-big-Data.db |
170,061,285 | mc-91-big-Data.db |
63,427,680 | mc-93-big-Data.db |
733,830,580 | mc-95-big-Data.db |
1,747,015,110 | mc-96-big-Data.db |
16,715,886,480 | mc-98-big-Data.db |
49,167,805 | mc-99-big-Data.db |
$ nodetool listsnapshots
Snapshot Details: Snapshot name Keyspace name Column family name True size Size on disk pre-scrub-1533897462847 keyspace1 standard1 35.93 GiB 35.93 GiB $ nodetool clearsnapshot -t pre-scrub-1533897462847 Requested clearing snapshot(s) for [all keyspaces] with snapshot name [pre-scrub-1533897462847] If the repair still fails to complete, we get to try one of the other methods.
Offline SSTable repair utility – sstablescrub
This option is a bit more complex to do but it often will work when the online version won’t work. Warning: it is very slow. Steps:- Bring the node down.
- Run the sstablescrub command.
- Start the node back up.
- Run nodetool repair on the table.
- Run nodetool clearsnapshot to remove the pre-scrub snapshot.
$ nodetool drain
$ pkill java $ ps -ef |grep cassandra root 18271 14813 0 20:39 pts/1 00:00:00 grep --color=auto cassandra Then issue the sstablescrub command with the -n option unless you have the patience of a saint. Without the -n option, every column in every row in every SSTable file will be validated. Single threaded. It will take forever. In preparing for this blog post, I forgot to use the -n and found that it took 12 hours to scrub 500 megabytes of a 30 GB table. Not willing to wait 30 days for the scrub to complete, I stopped it and switched to the -n option completing the scrub in only… hang on for this, 6 days. So, um, maybe this isn’t going to be useful in most real-world situations unless you have really small tables.
$ Sstablescrub -n keyspace1 standard1
Pre-scrub sstables snapshotted into snapshot pre-scrub-1533861799166 Scrubbing BigTableReader(path='/home/cassandra/data/keyspace1/standard1-afd416808c7311e8a0c96796602809bc/mc-88-big-Data.db') (1.126GiB)… Unfortunately, this took more time than I wanted to take for this blog post. Once you have the table scrubbed, you restart Cassandra and delete.
Delete the file and do a Nodetool repair – rm
This option works every time. It is no more difficult to do than the offline sstablescrub command and its success rate is 100%. It's usually much faster than the offline sstablescrub option. In my prep for the blog post, this approach took only two hours for my 30 GB table. The only drawback I can see is that for the time it takes to do the repair on the table after the delete is performed, there is an increased risk of consistency problems esp if you are using CF=1 which should be a fairly uncommon use case. Steps:- Stop the node.
- cd to the offending keyspace and sstable directory.
- If you know which sstable file is bad (if you learned about the problem from stalled compactions, you will know) just delete it. If not, delete all files in the directory.
- Restart the node.
- Nodetool repair.
$ nodetool drain
$ pkill java $ ps -ef |grep cassandra root 18271 14813 0 20:39 pts/1 00:00:00 grep --color=auto cassandra $ cd /var/lib/cassandra/data/keyspace1/standard1-afd416808c7311e8a0c96796602809bc/ $ pwd /var/lib/cassandra/data/keyspace1/standard1-afd416808c7311e8a0c96796602809bc If you know the SSTable file you want to delete, you can delete just that one with rm -f *nnn*. If not, as in this case, you do them all.
$ sudo rm -f *
rm: cannot remove 'backups': Is a directory rm: cannot remove 'snapshots': Is a directory $ ls backups snapshots $systemctl start cassandra $ nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 192.168.1.88 1.35 MiB 256 100.0% c92d9374-cf3a-47f6-9bd1-81b827da0c1e rack1 UN 192.168.1.90 41.72 GiB 256 100.0% 3c9e61ae-8741-4a74-9e89-cfa47768ac60 rack1 UN 192.168.1.92 30.87 GiB 256 100.0% c36fecad-0f55-4945-a741-568f28a3cd8b rack1 $ nodetool repair keyspace1 standard1 -full [2018-08-10 11:23:22,454] Starting repair command #1 (51713c00-9cb1-11e8-ba61-01c8f56621df), repairing keyspace keyspace1 with repair options (parallelism: parallel, primary range: false, incremental: false, job threads: 1, ColumnFamilies: [standard1], dataCenters: [], hosts: [], # of ranges: 768, pull repair: false) [2018-08-10 13:02:36,097] Repair completed successfully [2018-08-10 13:02:36,098] Repair command #1 finished in 1 hour 39 minutes 13 seconds The SSTable file list now looks like this:
229,648,710 | mc-10-big-Data.db |
103,421,070 | mc-11-big-Data.db |
1,216,169,275 | mc-12-big-Data.db |
76,738,970 | mc-13-big-Data.db |
773,774,270 | mc-14-big-Data.db |
17,035,624,448 | mc-15-big-Data.db |
83,365,660 | mc-16-big-Data.db |
170,061,285 | mc-17-big-Data.db |
758,998,865 | mc-18-big-Data.db |
2,683,075,900 | mc-19-big-Data.db |
749,573,440 | mc-1-big-Data.db |
91,184,160 | mc-20-big-Data.db |
303,380,050 | mc-21-big-Data.db |
3,639,126,510 | mc-22-big-Data.db |
231,929,395 | mc-23-big-Data.db |
1,469,272,390 | mc-24-big-Data.db |
204,485,420 | mc-25-big-Data.db |
345,655,100 | mc-26-big-Data.db |
805,017,870 | mc-27-big-Data.db |
50,714,125 | mc-28-big-Data.db |
11,578,088,555 | mc-2-big-Data.db |
170,033,435 | mc-3-big-Data.db |
1,677,053,450 | mc-4-big-Data.db |
62,245,405 | mc-5-big-Data.db |
8,426,967,700 | mc-6-big-Data.db |
1,979,214,745 | mc-7-big-Data.db |
2,910,586,420 | mc-8-big-Data.db |
14,097,936,920 | mc-9-big-Data.db |
Bootstrap the node
If you are using consistency factor (CF) ONE on reads, or you are really concerned about consistency overall, use this approach instead of the rm -f approach. It will insure that the node with missing data will not participate in any reads until all data is restored. Depending on how much data the node has to recover, it will often take longer than any of the other approaches. Although since bootstrapping can operate in parallel, it may not. Steps:- Shut down the node.
- Remove all of the files under the $CASSANDRA_HOME. Usually /var/lib/Cassandra.
- Modify /etc/cassandra/conf/cassandra-env.sh.
- Start Cassandra. – When the server starts with no files, it will connect to one of its seeds, recreate the schema and request all nodes to stream data to it to replace the data it has lost. It will not re-select new token ranges unless you try to restart it with a different IP than it had before.
- Modify the /ect/cassandra/conf/cassandra-env.sh file to remove the change in Step 3.
$ nodetool drain
$ sudo pkill java $ ps -ef |grep java $ vi /etc/cassandra/conf/cassandra-env.sh Add this line at the end of the file: JVM_OPTS="$JVM_OPTS -Dcassandra.replace_address=192.168.1.88"
$ systemctl start cassandra
Wait for the node to join the cluster During the bootstrap we see messages like this in the log:
INFO [main] 2018-08-10 13:39:06,780 StreamResultFuture.java:90 - [Stream #47b382f0-9cc4-11e8-a010-51948a7598a1] Executing streaming plan for Bootstrap
INFO [StreamConnectionEstablisher:1] 2018-08-10 13:39:06,784 StreamSession.java:266 - [Stream #47b382f0-9cc4-11e8-a010-51948a7598a1] Starting streaming to /192.168.1.90 >/code> Later on we see:
INFO [main] 2018-08-10 14:18:16,133 StorageService.java:1449 - JOINING: Finish joining ring
INFO [main] 2018-08-10 14:18:16,482 SecondaryIndexManager.java:509 - Executing pre-join post-bootstrap tasks for: CFS(Keyspace='keyspace1', ColumnFamily='standard1') INFO [main] 2018-08-10 14:18:16,484 SecondaryIndexManager.java:509 - Executing pre-join post-bootstrap tasks for: CFS(Keyspace='keyspace1', ColumnFamily='counter1') INFO [main] 2018-08-10 14:18:16,897 StorageService.java:2292 - Node /192.168.1.88 state jump to NORMAL WARN [main] 2018-08-10 14:18:16,899 StorageService.java:2324 - Not updating token metadata for /192.168.1.88 because I am replacing it When we do a nodetool status we see:
$ nodetool status
Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 192.168.1.88 30.87 GiB 256 100.0% c92d9374-cf3a-47f6-9bd1-81b827da0c1e rack1 UN 192.168.1.90 41.72 GiB 256 100.0% 3c9e61ae-8741-4a74-9e89-cfa47768ac60 rack1 UN 192.168.1.92 30.87 GiB 256 100.0% c36fecad-0f55-4945-a741-568f28a3cd8b rack1 The node is up and running in less than one hour. Quicker than any of the options. Makes you think about your choices, doesn’t it? If you have a keyspace with RF=1 then options 3 and 4 are not viable. You will lose data. Although with RF=1 and a corrupted SSTable file you are going to lose some data anyway. A last view at the list of SSTable files shows you this:
773,774,270 | mc-10-big-Data.db |
17,148,617,040 | mc-11-big-Data.db |
749,573,440 | mc-1-big-Data.db |
170,033,435 | mc-2-big-Data.db |
1,677,053,450 | mc-3-big-Data.db |
62,245,405 | mc-4-big-Data.db |
8,426,967,700 | mc-5-big-Data.db |
229,648,710 | mc-6-big-Data.db |
103,421,070 | mc-7-big-Data.db |
1,216,169,275 | mc-8-big-Data.db |
76,738,970 | mc-9-big-Data.db |
Conclusion
If you run into corrupted SSTable files, don’t panic. It won’t have any impact on your operations in the short term unless you are using RF=ONE or CF=ONE. Find out which node has the broken SSTable file. Then, because its easiest and low risk, try the online nodetool scrub command. If that does not work, then you have three choices. Offline Scrub works but is usually too slow to be useful. Rebuilding the whole node seems to be overkill but it will work, and it will maintain consistency on reads. If you have a lot of data and you want to solve the problem fairly quickly, just remove the offending SSTable file and do a repair. All approaches have an impact on the other nodes in the cluster. The first three require a repair which computes merkle trees and streams data to the node being fixed. The amount to be streamed is most with the delete but the total time for the recovery was less in my example. That may not always be the case. In the bootstrap example, the total time was very similar to the delete case because my test case had only one large table. If there were several large tables, the delete approach would have been the fastest to get the node back to normal.Approach | Scrub phase | Repair phase | Total Recovery time |
Online Scrub | 1:06 | 1:36 | 2:42 |
*Offline Scrub | 144:35 | 1:37 | 146:22 |
Delete files | 0:05 | 1:36 | 1:41 |
Bootstrap | 0:05 | 1:45 | 1:50 |
$ printf '\x31\xc0\xc3' | dd of=mc-8-big-Data.db bs=1 seek=0 count=100 conv=notrunc
Share this
You May Also Like
These Related Stories
Cassandra backups using nodetool
Cassandra backups using nodetool
Jun 11, 2018
4
min read
Incremental Repair: Problems and a Solution
Incremental Repair: Problems and a Solution
Nov 30, 2020
3
min read
How to Deploy Spark in DataStax Cassandra 5.1
How to Deploy Spark in DataStax Cassandra 5.1
Jan 31, 2022
4
min read
No Comments Yet
Let us know what you think