Resolving Row Count Inconsistencies in Apache Cassandra - Proactive Insights and Support For Open-Source Applications

Problem:

The client experienced a failure in running repairs in Apache Cassandra due to corruption in hint files. Additionally, a node in the cluster went down and could not be brought back up, causing concerns about data consistency and cluster stability.

Process:

Step 1: Initial Investigation

The client observed errors related to corrupted hint files, including “digest mismatch” errors, which led to repair failures. The logs showed issues such as:

WARN  [HintsDispatcher:2] - Failed to read a hint - digest mismatch at position...
ERROR [HintsDispatcher:2] - Failed to dispatch hints file: file is corrupted...

The expert analyzed the errors and inquired whether the client was running a manual nodetool repair or if it was a regular read repair failure.

Step 2: Initial Recommendations

The expert suggested a quick fix to address the issue:

Stop the affected node.
Move the corrupted hint files to another location (to be removed later once stability is ensured).
Restart the Cassandra service.
Run nodetool repair.

Step 3: Further Issues and Analysis

The client continued to experience issues with repair failures. The expert identified two main types of errors in the logs:

Validation Failed Errors: Indicating possible stuck repair processes or node failures.
File Exists Errors: Suggesting permission issues in the Cassandra data directory or high disk I/O.

The expert recommended:

Ensuring all nodes were up and running.
Checking for stuck repair processes using nodetool compactionstats and nodetool netstats.
Running nodetool scrub on the affected node.
If issues persisted, removing the node from the cluster and rejoining it after cleaning data directories.

Step 4: Addressing Node Downtime

The client encountered additional issues when attempting to restart a downed node. The nodetool status command showed the node as “Down” (DN), despite efforts to restore it. The expert advised the following steps:

Checking network connectivity between nodes and verifying AWS security groups.
Modifying the commit_log_segment_size_in_mb in cassandra.yml from 32 to 64.
Restarting the nodes one by one, followed by running nodetool scrub and nodetool repair on each node sequentially.

Step 5: Alternative Recovery Approaches

The client was concerned about the time required to add a new node to the cluster and requested a method to recover the failed node instead. The expert provided two possible solutions:

Option 1: Adding a New Node (Recommended)

Ensure no network restrictions exist between nodes.
Add a fresh node to the cluster.
Verify that the new node is not listed in the peers list.
Once the new node joins the cluster, remove the failed node using nodetool removenode.
Verify the cluster state to ensure only three nodes remain.

Option 2: Recovering the Existing Node

Ensure no network restrictions exist.
Force remove the dead node using nodetool removenode.
Cleanup all data, commit logs, and caches from the failed node.
Ensure configuration files are in sync.
Restart the node and rejoin the cluster.
Run nodetool repair, nodetool cleanup, and nodetool compact sequentially on all nodes.

The expert warned that operating a three-node cluster with a replication factor of three could pose risks and recommended reducing the replication factor to two.

Step 6: Final Resolution

During the recovery process, another issue was discovered where a newly added node was not building replicas properly. The expert identified that the issue was caused by the node being incorrectly listed as a seed node in cassandra.yaml. The fix included:

Stopping the Cassandra service on the new node.
Removing the node’s hostname from the seed node list.
Restarting Cassandra and monitoring the logs.

After applying the expert’s recommendations, the client successfully restored the node, and the cluster returned to a stable state.

Solution:

The expert’s recommendations allowed the client to resolve hint file corruption, fix repair failures, and recover the failed Cassandra node. The implementation of proper network settings, node configuration adjustments, and strategic repair operations resulted in a fully functional Cassandra cluster.

Conclusion:

This case demonstrated the complexity of maintaining a multi-node Apache Cassandra cluster, especially when dealing with node failures and repair processes. It highlighted the importance of proper configuration, network connectivity, and best practices in handling failures to ensure data consistency and cluster stability.