Problem:
The client faced performance challenges while running nodetool cleanup
on an Apache Cassandra 4.1.5 cluster during a node addition activity in a production environment. Specifically, the cleanup process was taking an unexpectedly long time on nodes with over 600GiB of data load, raising concerns about the overall timeline and impact on production workflows.
The client sought assistance in estimating the cleanup duration and requested an expert walkthrough of Cassandra’s architecture to better understand cleanup behavior.
Process:
Step 1: Initial Assessment
- The expert confirmed the cluster was likely using default throughput parameters, which are deliberately conservative to avoid resource contention.
- To address the prolonged cleanup duration, the expert suggested monitoring and tuning key performance parameters:
stream_throughput_outbound_megabits_per_sec
compaction_throughput_mb_per_sec
- Concurrent compactors count
Step 2: Tuning Recommendations
- The expert provided a step-by-step approach to evaluate and adjust throughput settings:
- Retrieve current settings using:
nodetool getstreamthroughput
nodetool getcompactionthroughput
- Increase throughput values for faster streaming and compaction:
nodetool setstreamthroughput 100
nodetool setcompactionthroughput 100
- Verify and adjust concurrent compactors with:
nodetool getconcurrentcompactors
nodetool setconcurrentcompactors <desired_value>
- Retrieve current settings using:
- It was emphasized that these adjustments are per-node and must be applied individually across all nodes in the cluster.
Step 3: Clarification on Persistence
- The client sought clarification on whether these parameter changes persist across sessions or node restarts.
- The expert clarified that:
- These settings are **runtime adjustments**, remaining active until the next node restart.
- Upon restart, Cassandra will revert to the configuration file values unless explicitly updated there.
- Cluster-wide synchronization is manual—each node must be configured separately.
Solution:
- Retrieve and analyze current throughput and concurrency settings.
- Increase
stream_throughput
andcompaction_throughput
to optimize cleanup speed. - Adjust the number of concurrent compactors to utilize system resources more effectively.
- Ensure all changes are applied per-node across the cluster.
- Update configuration files if persistent changes are desired post-restart.
Conclusion:
By fine-tuning throughput and compaction parameters, the client could significantly accelerate the nodetool cleanup
process, reducing operational delays during node addition activities. Clear guidance on parameter persistence and cluster-wide application ensured long-term maintainability. The proactive adjustments restored confidence in managing large data volumes in production while safeguarding cluster stability.