Problem:

The client’s production Apache Cassandra cluster experienced sudden native transport failure, leading to significant operational impact. Despite efforts to diagnose the problem using system logs and debug logs, the root cause remained unidentified. Native transport errors, particularly SSLPeerUnverifiedException, were prevalent in the debug logs, indicating authentication failures for multiple nodes in the cluster.

Process:

Upon receiving the client’s request for assistance, an expert analyzed the provided debug logs and system logs from three Cassandra nodes. The investigation revealed a common error related to Certificate validation for various peers within the cluster. The error manifested as SSLPeerUnverifiedException, indicating failure to verify the peer’s SSL certificate.
To proceed with troubleshooting, the expert requested the following information from the client:
The cassandra.yaml file to review the cluster’s configuration settings, including security configurations related to SSL certificates and native transport.
Confirmation on the cluster’s operational status.
Availability of IP addresses associated with the nodes mentioned in the logs for further investigation.

Solution:

Optimizing Cluster Stability: Master Nodes as Critical Seeds: The expert identified the master nodes as seed nodes critical for cluster operation. In cases of complete cluster outage, restarting the seed nodes is recommended. This step helps in re-establishing communication and coordination among nodes.

Communication Testing: Testing communication between peers and seed nodes is crucial to ensure connectivity. This involves verifying ping and telnet access to Cassandra ports from both peers and seed nodes.

Certificate Validation: Validate certificates in the keystore to ensure their integrity and check for any expired certificates. Renewing or updating expired certificates may resolve authentication issues.

Documentation and Prevention: Documenting the troubleshooting steps and implementing preventive measures, such as regular certificate maintenance and monitoring, can help prevent similar incidents in the future.

Conclusion:

By following these steps it is possible to identify the root cause of the native transport failure and implement appropriate solutions to restore and safeguard the Apache Cassandra cluster’s stability and performance.