Problem:

The customer experienced issues with their Cassandra database, specifically with write failures and slow performance during nodetool repair operations. These issues were affecting the application’s ability to interact with the database, resulting in delays and failure to write data. The Cassandra cluster, consisting of 3 nodes in each of two data centers (US East and US West), was running version 4.1.5, with a replication factor of 3 for most keyspaces. Despite attempts to address the problem by increasing resources on the virtual machines hosting Cassandra, the performance issues persisted.

Process:

Step 1: Initial Investigation

Log Analysis: The expert reviewed the logs provided by the customer, which indicated the following error message:

 "Temporary storage exception while acquiring id block - retrying in PT0.6S" 

This was a warning from the JanusGraph ID management system indicating slow write times when trying to acquire an ID block. The error message highlighted that the process of acquiring an ID block was taking longer than the configured threshold of 0.3 seconds.

Step 2: System Resource Monitoring

The expert advised the customer to monitor system resource usage, including CPU, memory, heap usage, and garbage collection, to identify any resource bottlenecks. The team adjusted the heap size from the default to 16GB and reviewed the Cassandra nodes’ resource allocation. However, the performance issues continued even with the increased resources.

Step 3: Performance Under Load

As part of the testing phase, the team initiated a series of data requests to test Cassandra’s performance under load. Despite some data being processed successfully, the performance remained inconsistent, especially under high load.

Step 4: Replication Factor and Configuration Adjustment

After further discussions, we recommended reducing the replication factor from 3 to 2 for some keyspaces to ease the load on the system. Additionally, the team considered implementing manual compaction to defragment the data and align the SSTables to improve write performance.

Step 5: Review of Cassandra Configuration

Further investigation revealed that the consistency level set for database operations might be contributing to the problem. The team was advised to try using a lower consistency level, such as LOCAL_QUORUM, for write operations, as this could reduce the load and help avoid timeouts due to cross-data center communication.

Solution:

The customer implemented the recommendation to switch the consistency level from QUORUM to LOCAL_QUORUM, which helped to alleviate the issues. By using LOCAL_QUORUM, Cassandra would only communicate with the nodes within the local data center for read and write operations, avoiding unnecessary delays due to remote communication between data centers.

Once the change was made, the application showed significant improvement, and the Cassandra cluster’s write and repair operations performed better under load. The issue with slow nodetool repair times and write failures was reduced significantly, and the application was able to function normally again.

Conclusion:

This case study highlights the importance of tuning the consistency level and resource configuration in Cassandra to meet the performance needs of an application. In this case, switching to LOCAL_QUORUM for write operations significantly reduced the load on the Cassandra cluster and improved the performance of the application. The expert also recommended periodically monitoring the health of the Cassandra nodes and considering further optimizations such as compaction and reducing the replication factor for smaller clusters.

We appreciate the customer’s proactive approach and willingness to adjust configurations to optimize the database performance. By implementing the recommended changes, the client was able to resolve the issues and move forward with their operations successfully.