Database in the Cassandra cluster generates a large number of commitlogs - Proactive Insights and Support For Open-Source Applications

Problem:

In the Cassandra cluster, the database generated a large number of commit logs and didn’t delete them. Due to this, the commit logs filesystem is getting full and the database is crashing. This is relevant for all nodes.

Process:

Step 1: Initial Investigation and gather information from the client

Initial troubleshooting and information gathering from the client includes the following points:

Hardware Specifications: Detail the hardware specifications of each Cassandra node.
Disk Space Check: Execute “df -h” command on all nodes to check disk space and share results.
System Logs Check: Run “tail -n 100 /path/to/cassandra/logs/system.log” on each node and share results. Provide a zip of the Cassandra config folder.
Commitlog Size and Write Activity: Determine the size of the commitlog directory and assess write activity on the database in terms of size and frequency.
Configuration File: Provide configuration files from all nodes or confirm if configurations are identical across all nodes.
Performance Metrics: Gather performance metrics such as memory usage for all Cassandra nodes.
Log Purge Process: Investigate if any log purge process is currently running on the system.
Memtable Cleanup Threshold: Verify and calculate memtable_cleanup_threshold values, ensuring alignment with recommended settings.
Java Version: Confirm the Java version used by Cassandra nodes and ensure it’s a 64-bit installation to avoid potential issues.
Additional Insights: Share relevant Stack Overflow threads and Jira issues, such as those related to growing commit logs for further insights.
Step 2: Deeper Investigation

For further investigation, we had a meeting with our experts where we covered the main topics about why commit logs fill up frequently to find a workaround, analyzed the commit logs to understand why hints were not getting cleaned up, and decided to request the files from the client to analyze the issue.

Solution:

After the investigation, our experts suggested the next solution

Throttle-down the write load: An easy way to do this, is to put an event streamer/message broker (like Pulsar or Kafka) in-between the writing application and the cluster. Then, build a consumer to grab the messages and write them into Cassandra
Adjusting memtable_flush_writers: Initially, we can try to increase the parameter memtable_flush_writers to something <= number of cores(8). This will cause a reduction in the memtable_cleanup_threshold ratio which means as per the comment in the cassandra.yaml(smaller mct will mean smaller flushes and hence more compaction, which means data will be flushed more frequently)

The formula is: memtable_cleanup_threshold = 1 / (memtable_flush_writers + 1).

We can try setting the memtable_flush_writers = 6 and observe the performance. If things don’t improve then try with:

memtable_flush_writers = 8 again and observe for some time. Keep track of CPU usage of all the nodes as well.

Adjust memtable_flush_writers in the cassandra.yaml file (requires a restart of all nodes).

Conclusion:

The Cassandra cluster faced issues with accumulating commit logs, causing filesystem congestion and database crashes across all nodes.

We did the initial investigation steps for troubleshooting a Cassandra database issue, including gathering hardware specifications, checking disk space and system logs, assessing commit log size, verifying configuration files and performance metrics, and confirming the Java version. Additionally, it mentions deeper investigation involving discussions with experts to address commit log filling and hint cleanup issues, and requesting files from the client for further analysis.

Proposed solutions include throttling down the write load by introducing an intermediary message broker, like Pulsar or Kafka, and adjusting the memtable_flush_writers parameter in the cassandra.yaml file to manage flushes and compaction effectively. Testing and monitoring are recommended before final adjustments, which may require restarting all nodes.