Problem:

The client had a 10-node cluster across two data centers, with 5 nodes in each. They ran nodetool repair on one keyspace, but after 6 hours, it was stuck at 99% completion. They requested guidance on how to proceed with the issue, noting that their product version was ‘ReleaseVersion: 2.2.5.’

Process:

Step 1 – Initial investigation and troubleshooting
After the investigation, the expert team provided troubleshooting measures to fix the issue. There were actions performed by the client:

  1. Increased the heap memory from 8 GB to 32 GB. The parameter was increased in all the nodes one by one.
  2. After incrementing the above in the 1st node, the client checked the nodetool compaction stats, which ran automatically and finished.
  3. Ran nodetool cleanup to delete unused replicas.
  4. Ran nodetool compact at the table level in the 1st node.
  5. Ran nodetool scrub at the table level in the 1st node.
  6. Ran nodetool repair at the table level in the 1st node.
  7. Started from point 1 for node 2 and continued further one by one.

Step 2 – Further investigation and the meeting with the client

At the meeting, the expert team discussed the following topics:

1. Running Clean-up and Compaction on Machines:

The team discussed running clean-up and compaction on machines to ensure the validation step was completed successfully. They considered performing these steps in DC two as well, even if they failed, to determine if Cassandra’s native tables could be repaired.

2. Changing Password for Administrative Access:

The team encountered issues changing the password for administrative access in DC two. After exploring various methods and failing to recall the password, they decided to send an email with detailed instructions on how to change it.

3. Monitoring the Progress of the Repair Process:

The team discussed monitoring the progress of the repair process by checking compaction stats.

Solution:

The experts advised to increase memory for all the Cassandra nodes. This fixed the issue.

Conclusion:

The client faced an issue with a 10-node cluster spanning two data centers where a nodetool repair on one keyspace stalled at 99% completion after 6 hours. During the initial investigation, experts suggested increasing Cassandra’s memory to 32 GB and running per-table jobs to empty the cache. In further meetings, they discussed running clean-up and compaction, changing the administrative password, monitoring repair progress through compaction stats, and sharing logs via email. Ultimately, increasing the memory for all Cassandra nodes resolved the problem.