Problem:

The client reported that Cassandra 2.2.5, running in a single-node configuration, crashed and failed to start. The error logs pointed to a “CorruptSSTableException” in the “system_traces.events” table, indicating corruption in an SSTable file. Since this was a critical system table, the client needed a way to bypass the corrupted data and restart the Cassandra service.

Process:

Step 1 – Initial Assessment:

Upon receiving the error, the client provided the system log file and requested detailed instructions on how to resolve the issue without needing to recover the data in the corrupted table.

The error logs indicated corruption in the following file: “/users/gen/ogwwrk1/cassandra/data/data/system_traces/events-8826e8e9e16a372887533bc1fc713c25/la-7-big-Data.db”. The client did not require the data in this table and wanted to drop/recreate it after bypassing the corruption.

Step 2 – Proposed Solution:

Our expert suggested moving the corrupted SSTable files to a location outside Cassandra’s data directory. This would allow Cassandra to bypass the corrupted table upon startup. Once the service was running, the expert recommended using “cqlsh” to drop and recreate the “system_traces.events” table.

Step 3 – Client Concerns and Additional Validation:

The client expressed concerns about the safety of this approach, particularly given that the corrupted table was a system table. They requested confirmation that the service would start without the corrupted file and asked for the expert to review the system log file for additional issues.

After reviewing the logs, the expert confirmed that, while the situation was not entirely safe due to the corruption, moving the files out of Cassandra’s data directory was the best course of action given the circumstances. The expert also identified unrelated errors involving a third-party plugin (“cassandra-lucene-index-plugin”) that might be contributing to system instability.

Step 4 – Implementation:

The client moved the corrupted SSTable files to another directory outside Cassandra’s data directory. Upon restarting Cassandra, the service came up successfully, and the “system_traces.events” table was automatically recreated, likely because it is a system table.

Step 5 – Root Cause Investigation:

The client inquired about the potential causes of the corruption and how to prevent such issues in the future. The expert explained that corruption could stem from various factors, including hard disk issues (e.g., bad sectors), JVM crashes, or abrupt system reboots. The expert recommended upgrading Cassandra to a more recent version and avoiding the use of third-party plugins like Lucene index, which could further complicate stability.

Solution:

The client successfully resolved the “CorruptSSTableException” by moving the corrupted SSTable files out of Cassandra’s data directory, allowing the service to restart and automatically recreate the necessary system table. The expert advised upgrading Cassandra and avoiding third-party plugins to reduce the risk of future corruption.

Conclusion:

Through careful handling of corrupted SSTables and expert guidance, the client was able to restart their Cassandra service without data recovery and mitigate the risk of further issues. The recommendation to upgrade Cassandra and review system stability practices will help prevent similar incidents in the future.