Problem:
The client needed to scale their production Cassandra cluster from 6 nodes to 12 nodes (3 to 6 nodes per data center) without any downtime. Their existing setup includes Cassandra version 4.1.6, with two data centers (PROD and DR), each containing 3 nodes, forming a 6-node cluster with a replication factor of 3 and ConsistencyLevel set to ONE. They expect 20–30 million records in the coming year and required a seamless and reliable scale-up plan under tight timelines.
Process:
Step 1 – Initial Analysis
The expert reviewed the current architecture, including version, replication setup, consistency level, and anticipated data volume. It was confirmed that the client uses a multi-DC ring topology and that both data centers (PROD and DR) are part of the same cluster.
Step 2 – Validating Scaling Strategy
The expert confirmed that Cassandra supports live scaling by adding nodes one at a time. It was emphasized that while the procedure is straightforward, attention to detail is crucial to avoid issues.
Step 3 – Providing the Rollout Plan
- Install Cassandra on the new server.
- Ensure configuration files are identical across all nodes (except for the
listen_address
, which must be set to the node’s IP). - Start new nodes one at a time, waiting for each to fully join the cluster.
- Once a node is fully added, run
nodetool cleanup
on all existing nodes to remove obsolete replica data. - Confirm compaction and cleanup are completed before proceeding to the next node.
- After all new nodes are added, run
nodetool repair
andnodetool compact
on each node (optional but recommended).
Solution:
A detailed, low-risk scale-up procedure was provided, allowing the client to double the number of nodes in each data center without downtime. The expert ensured the client understood each critical step and the importance of sequential execution to maintain cluster stability and data consistency.
Conclusion:
With a clear action plan tailored to their topology and version, the client is equipped to perform the scale-out operation confidently and safely. This proactive approach ensures high availability and performance as their data demands grow, without compromising uptime during the upgrade process.”