Problem:
The client encountered a critical issue while starting one of their production pods during a Cassandra migration. Although the PostgreSQL migration completed successfully, the Cassandra migration failed with a com.datastax.oss.driver.api.core.AllNodesFailedException
, indicating that the driver could not connect to any Cassandra nodes. This blocked the production deployment.
Process:
Step 1 – Initial Analysis
The logs revealed that the PostgreSQL migration step completed without issues. The problem appeared during the Cassandra migration phase, where the application attempted to connect to the Cassandra cluster using the DataStax driver but failed. The error trace indicated a NotYetConnectedException
and a ConnectTimeoutException
to the node at:
prod-cassandra-cl1-01.prod.nextgenbilling.t-mobile.com/100.64.112.171:9042
These exceptions suggested a network connectivity failure.
Step 2 – Expert Review of Environment Variables and Defaults
The Cassandra migration tool was running with a set of environment variables, many of which were not defined and defaulted to internal settings (e.g., keyspace strategy, read timeout). However, this did not appear to be the root cause, since the migration failed at the initial connection phase, not during schema execution.
Step 3 – Diagnostic Recommendations Provided by the Expert
The expert suggested a series of network and configuration checks to isolate the root cause:
-
Hostname/IP Configuration Validation:
Ensure that both the hostname (prod-cassandra-cl1-01.prod.nextgenbilling.t-mobile.com
) and its corresponding IP address (100.64.112.171
) are correct and resolvable from the pod or machine running the migration. -
Basic Network Connectivity Test (SSH):
Attempt an SSH connection to the target host using both the hostname and IP address to validate network-level access (port 22):
ssh <username>@prod-cassandra-cl1-01.prod.nextgenbilling.t-mobile.com
ssh <username>@100.64.112.171
-
Application Port Test (Telnet):
Verify that port 9042 (used by Cassandra) is open and reachable by testing both the hostname and IP:
telnet prod-cassandra-cl1-01.prod.nextgenbilling.t-mobile.com 9042
telnet 100.64.112.171 9042
-
Cassandra Service Availability:
Confirm that the Cassandra service is running and listening on port 9042 on the specified node.
Solution:
The expert diagnosed that the root cause was a network connectivity issue—likely the inability to reach the Cassandra node at the specified address and port. The solution required the client to:
- Ensure DNS resolution and routing to the Cassandra node were correctly configured.
- Open required firewall rules or security groups to allow connections on port 9042.
- Validate SSL truststore/keystore setup did not interfere with the handshake.
Conclusion:
The failure was not due to Cassandra configuration or migration logic, but rather an environmental/networking issue preventing the client’s migration tool from reaching the target Cassandra nodes. By walking through methodical network-level diagnostics—starting from basic SSH access to port-specific testing—the client was equipped to identify and resolve the underlying connectivity problem. Once resolved, the Cassandra migration could proceed as expected.