Problem:
The client reported encountering memory-related issues and tombstone cell messages in the system.log file of their Apache Cassandra deployment. Notable log entries included warnings about maximum memory usage, tombstone cells, and concerns about a hanging repair process. The issue seemed to have improved temporarily after increasing the heap size but resurfaced after two days.
Process:
Upon analysis, it was identified that the presence of numerous tombstones and a relatively small heap might be causing the problem. The recommended actions were as follows:
- Share the content of the Cassandra config folder.
- Perform clean-up operations on each node:
nodetool cleanup
,nodetool compact
, andnodetool repair
. - Ensure the execution of these commands is sequential and not in parallel.
- Provide the cassandra-env.sh and cassandra.yaml files for further examination.
Solution:
- Identify and address delete queries generating tombstones.
- Execute non-destructive operations on each node sequentially:
nodetool cleanup
,nodetool compact
, andnodetool repair
. - Increase the heap size to 20GB to accommodate the large volume of tombstones and fragmented data.
- Implement major compaction on all Cassandra nodes using
nodetool compact
. - Follow up with
nodetool cleanup
to delete unused replicas andnodetool repair
to rebuild missing data. - Consider changing garbage collection (GC) settings to G1 if using a heap size greater than 16GB.
Conclusion:
The recommended actions aimed at optimizing Cassandra storage, removing tombstones, and addressing memory-related concerns. Regularly running cleanup, compaction, and repair operations, along with increasing the heap size, were suggested to ensure the stability and performance of the Cassandra cluster. The client was advised to monitor the system after implementing these changes and to consider further adjustments based on future observations.
By taking these steps, the client can expect a more stable and optimized Apache Cassandra deployment, reducing the likelihood of memory-related issues and improving overall system performance.