Apache Cassandra high availability issue - Proactive Insights and Support For Open-Source Applications

Problem:

The client encountered a high availability issue in their Cassandra cluster, consisting of five nodes deployed on AWS EC2. After shutting down two servers (10.51.44.25 and 10.51.46.144), it became impossible to connect to the database, even though the other nodes remained online. The issue manifested as an authentication error when trying to connect to the database using SQL:

Connection error: ('Unable to connect to any servers', {'10.51.45.173:9042': AuthenticationFailed('Failed to authenticate to 10.51.45.173:9042: Error from server: code=0100 [Bad credentials] message="Unable to perform authentication: Cannot achieve consistency level QUORUM"')})

The client also reported that the application could not connect to the database if two servers were down simultaneously, indicating an issue with fault tolerance configuration and consistency level.

Process:

Step 1: Review of Cluster Configuration

The expert requested and reviewed the Cassandra configuration, including the cassandra.yaml file used on all nodes. It was noted that the system was using the QUORUM consistency level for authentication operations, which requires the availability of a majority of replicas for successful query execution.

Step 2: Authentication Issue Analysis

Based on the error message, the expert confirmed that the failure of two nodes led to the inability to achieve the required QUORUM consistency level, causing the authentication error. Additionally, the application was set to use a LOCAL_ONE consistency level, which did not match the authentication settings.

Step 3: Cluster State Assessment

The expert requested the current cluster state using the nodetool status command. It was found that all nodes except two were in a normal state, but the QUORUM consistency level required for authentication could not be achieved due to the failure of two nodes, causing the authentication failure when trying to connect.

Step 4: Recommendations for Improving Fault Tolerance

The expert provided a set of recommendations to enhance fault tolerance and database availability when multiple nodes fail.

Solution:

Replication Factor Adjustment: The expert recommended changing the replication factor for the system_traces keyspace from 2 to 3. This would allow Cassandra to handle the failure of up to two nodes, thereby improving fault tolerance and data availability when multiple nodes fail. Increasing the replication factor to 3 would allow the system to handle node failures more effectively.
Consistency Level Configuration: To avoid connection issues, the expert suggested aligning the application’s consistency level with LOCAL_QUORUM. This consistency level is more appropriate for multi-node clusters and helps prevent issues when connecting to the database in the event of partial node failures, balancing performance and data consistency.
Node Failure Management: It was recommended to avoid scenarios where two nodes go down simultaneously. The expert suggested gradually shutting down and restoring nodes using nodetool repair and nodetool decommission commands. This would ensure stable cluster operation without disrupting the balance and minimizing the risk of data loss.
Migration Recommendations: The expert also provided a clear methodology for migration, including the gradual relocation of nodes and running nodetool repair, nodetool cleanup, and nodetool compact commands to synchronize data and maintain integrity. After migration, it was crucial to ensure that all new nodes were correctly added to the seed list and configured for proper cluster operation.

Conclusion:

The issue with Cassandra HA was resolved by implementing several key recommendations. These included adjusting the replication factor and configuring the consistency level, which prevented authentication errors and ensured stable operation of the system during node failures. These changes significantly improved the fault tolerance of the Cassandra cluster and helped ensure database availability even in scenarios involving multiple node failures.