Problem:
The client has a two-datacenter (DC1 and DR1) Cassandra cluster. They encountered a failure while running nodetool repair on a node in DC1, which was traced to data corruption on a node in DR1. The logs indicated a corruption error in a specific SSTable file.
Solution:
Step 1. Initial Diagnosis:
Ran nodetool repair in DC1 nodes, which succeeded on all nodes except one. Identified corruption in DR1 node logs.
Step 2. Immediate Actions:
Advised to run nodetool scrub and nodetool repair on all running nodes. If repair fails, proceed with:
- Stopping the affected node.
- Running sstablescrub on the corrupted keyspace.
- Changing ownership of the files back to the Cassandra user.
- Removing corrupted SSTable files.
- Restarting Cassandra.
- Running nodetool repair again on the affected node and then on all nodes.
Step 3. Encountering Issues with sstablescrub:
The client received errors during sstablescrub, indicating issues with specific rows in the SSTable. Noted that no new files were generated as expected but temporary files (tmp-*) were created.
Step 4. Final:
Advised to move the corrupted files to a safe location. Start Cassandra and run nodetool repair again on the affected table in the node.
Conclusion:
The process involved identifying the source of corruption, running a series of commands to scrub and repair the SSTables, and handling further issues by moving corrupted files and running repairs again. The systematic approach ensured that the data integrity was maintained while addressing the corruption effectively.