Problem:
The client, deploying a production Cassandra cluster on OpenShift, encountered persistent communication issues between two datacenters, each hosting three Cassandra nodes. Despite initial configuration and verification, nodes in Datacenter 1 (“RCKL”) and Datacenter 2 (“CLSP”) were unable to establish consistent communication. SSL handshake failures and inconsistent OpenSSL connectivity tests exacerbated the problem, indicating potential network or configuration issues.
Process:
Upon receiving the issue report, experts began by analyzing configuration files and logs provided by the client. They examined cassandra.yaml and cassandra-rackdc.properties across both staging and production environments to identify discrepancies. Connectivity tests using OpenSSL and ncat were conducted to verify inter-datacenter communication, revealing that while some connections succeeded, SSL handshakes failed without the CAfile specified.
Further investigation involved consulting Verizon to confirm network configurations and firewall settings between datacenters. It was established that there were no firewalls blocking traffic between the specified subnets, suggesting a potential Maximum Transmission Unit (MTU) issue causing fragmentation and connectivity failures.
Solution:
To address the MTU issue, experts recommended adjusting the MTU settings on the net1 interface, which was defined in the NetworkAttachmentDefinition (NAD) for both RCKL and CLSP datacenters. Testing revealed that setting the MTU to 1500 resolved the fragmentation problem, allowing the Cassandra nodes to establish stable communication channels across the datacenters.
Following the MTU adjustment, connectivity tests using tools like ncat and nodetool status outputs confirmed successful communication and synchronization between nodes in RCKL and CLSP. SSL handshake failures ceased, and nodetool status consistently reported all nodes as operational across both datacenters.
Conclusion:
The successful resolution highlighted the critical role of network configuration, particularly MTU settings, in ensuring seamless communication within distributed systems like Cassandra clusters deployed on OpenShift. By meticulously aligning MTU configurations with network constraints and verifying inter-datacenter connectivity, the client achieved reliable cluster operation across their production environment. This case study underscores the importance of thorough network testing and configuration management in mitigating connectivity challenges in containerized environments.