Problem:
The client was experiencing performance issues with a self-managed Cassandra database cluster hosted on Azure VMs. A recent surge in data traffic led to high CPU utilization, causing significant system slowness and increased latency.
The environment utilized SSDs for storage, but earlier attempts at recommended SSD optimizations yielded no significant improvement. In light of these challenges, the client sought recommendations on whether to increase VM sizing, expand the number of nodes in the cluster, or implement other compute optimizations to address the high CPU spikes and degraded performance.
Process:
Step 1 – Initial Assessment
The client’s environment included a six-node Cassandra cluster running version 4.1.6, distributed across two data centers (three nodes per data center) with a replication factor (RF) of 1. To investigate the performance issues, the client shared detailed Azure monitoring data, logs from nodetool cfstats
and nodetool cfhistograms
, as well as the Cassandra YAML configuration file. These resources provided a comprehensive view of system bottlenecks and highlighted areas requiring optimization.
Step 2 – Identify Key Issues
The investigation revealed high write volumes and unusually large partition sizes in certain tables, which put significant strain on CPU and memory resources during compactions and read operations. The compaction throughput (16 MB/s) was identified as insufficient for the heavy workload, and earlier disk optimizations using SSDs had minimal impact, pointing to underlying storage performance bottlenecks.
Step 3 – Develop Solutions
Based on the findings, the expert proposed several solutions. Scaling out the cluster to 10 nodes (five per data center) was recommended to distribute the workload and reduce per-node stress. Adjustments to partitioning strategies and data models were suggested to mitigate hotspots and improve data distribution. Key system parameters such as compaction throughput, JVM garbage collection settings, and write concurrency were also identified for tuning to better handle increased traffic and future growth.
Step 4 – Implementation of Solutions
The implementation involved a phased approach to minimize disruptions. Two additional nodes were added to each data center, and data was rebalanced across the cluster with proper token allocations. The compaction throughput was increased to 64–128 MB/s, significantly improving compaction efficiency without overloading the storage system. JVM heap size and garbage collection parameters were optimized, with a shift to the G1 Garbage Collector for more effective memory management.
The data model was refined to distribute large partitions more evenly by introducing additional partition keys, reducing the resource load on individual nodes. Concurrent write settings were adjusted to align with the increased number of CPU cores per node. Azure premium storage was adopted to enhance disk I/O performance, and monitoring tools like Prometheus were implemented to provide continuous insights into cluster performance.
Solution:
The optimization process resolved the high CPU utilization and improved overall system performance. Scaling out the cluster distributed the workload more effectively, while adjustments to compaction and JVM settings reduced latency and compaction delays. Refinements to the partitioning strategy addressed large partitions, ensuring smoother operations for resource-intensive tables.
The adoption of premium Azure storage further eliminated disk I/O bottlenecks, and the introduction of advanced monitoring tools provided the client with actionable insights for ongoing performance management. The cluster was stabilized and prepared to handle future traffic increases and data growth.
Conclusion:
This case demonstrates the importance of a systematic approach to diagnosing and addressing performance issues in distributed systems like Cassandra. By scaling out infrastructure, tuning key configurations, and optimizing the data model, the client achieved a more resilient and high-performing database setup.
The case also underscores the value of proactive monitoring and long-term planning in managing rapidly growing workloads. With these enhancements, the client’s Cassandra cluster is now well-equipped to support their expanding projects while maintaining optimal performance.