Problem:

The client faced performance challenges while running nodetool cleanup on an Apache Cassandra 4.1.5 cluster during a node addition activity in a production environment. Specifically, the cleanup process was taking an unexpectedly long time on nodes with over 600GiB of data load, raising concerns about the overall timeline and impact on production workflows.

The client sought assistance in estimating the cleanup duration and requested an expert walkthrough of Cassandra’s architecture to better understand cleanup behavior.

Process:

Step 1: Initial Assessment

  1. The expert confirmed the cluster was likely using default throughput parameters, which are deliberately conservative to avoid resource contention.
  2. To address the prolonged cleanup duration, the expert suggested monitoring and tuning key performance parameters:
    • stream_throughput_outbound_megabits_per_sec
    • compaction_throughput_mb_per_sec
    • Concurrent compactors count

Step 2: Tuning Recommendations

  1. The expert provided a step-by-step approach to evaluate and adjust throughput settings:
    • Retrieve current settings using:
      • nodetool getstreamthroughput
      • nodetool getcompactionthroughput
    • Increase throughput values for faster streaming and compaction:
      • nodetool setstreamthroughput 100
      • nodetool setcompactionthroughput 100
    • Verify and adjust concurrent compactors with:
      • nodetool getconcurrentcompactors
      • nodetool setconcurrentcompactors <desired_value>
  2. It was emphasized that these adjustments are per-node and must be applied individually across all nodes in the cluster.

Step 3: Clarification on Persistence

  1. The client sought clarification on whether these parameter changes persist across sessions or node restarts.
  2. The expert clarified that:
    • These settings are **runtime adjustments**, remaining active until the next node restart.
    • Upon restart, Cassandra will revert to the configuration file values unless explicitly updated there.
    • Cluster-wide synchronization is manual—each node must be configured separately.

Solution:

  1. Retrieve and analyze current throughput and concurrency settings.
  2. Increase stream_throughput and compaction_throughput to optimize cleanup speed.
  3. Adjust the number of concurrent compactors to utilize system resources more effectively.
  4. Ensure all changes are applied per-node across the cluster.
  5. Update configuration files if persistent changes are desired post-restart.

Conclusion:

By fine-tuning throughput and compaction parameters, the client could significantly accelerate the nodetool cleanup process, reducing operational delays during node addition activities. Clear guidance on parameter persistence and cluster-wide application ensured long-term maintainability. The proactive adjustments restored confidence in managing large data volumes in production while safeguarding cluster stability.