Problem:
A 5-node Apache Cassandra 2.2.5 cluster (two data centers) reported severe per-node disk imbalance: each node was configured with five data_file_directories (e.g. /cassandra/data1/data … /cassandra/data5/data) but some mount points on individual nodes were near full (examples showed mounts at 93% and 95% used). On one node a particular keyspace (jessi) had large sstable directories under a single data path: service_monitoring ≈100GB and service_monitoring_payload ≈690GB located under /cassandra/data4/data/
Process:
Step 1: Confirmed reported configuration and on-disk layout
Reviewed the supplied cassandra.yaml snippet showing five entries in data_file_directories and inspected the provided df output showing uneven utilization across /cassandra/data1..data5 on multiple nodes. Noted the exact sstable directory path for the large tables under /cassandra/data4/data/
Step 2: Assessed safety of moving sstables between data_file_directories
Examined how Cassandra references sstable locations and metadata; discovered that arbitrarily changing the on-disk path for an sstable (moving it to a different data directory name) effectively devalidates that sstable for the running node and will require streaming of fresh replicas (repair/bootstrap) to restore consistency. This mattered because a blind file move would not be a safe online operation and would risk inconsistent or missing data until full repair completed.
Step 3: Evaluated cluster capacity and risk of node decommissioning
Aggregated the per-node disk usage information the operator supplied and calculated that most nodes already held multiple terabytes (some nodes >3TB), exceeding practical recommendations for the Cassandra 2.x deployment size. Determined that decommissioning a node without adding capacity first would push additional token ranges onto the remaining nodes and could cause other mounts to hit full capacity—so a naïve decommission posed high risk.
Step 4: Developed two viable remediation approaches and trade-offs
Outlined a preferred long-term fix: add new nodes (or capacity) so per-node data falls below recommended levels, then decommission old nodes one-by-one and reconfigure data_file_directories to a single mount. Also presented an in-place consolidation alternative that the operator could execute immediately: attach a temporary large disk to each node, copy node data to that disk, create a software RAID0 across the original disks, restore the data to a single RAID-backed mount, then start Cassandra and run nodetool repair/cleanup. Each option was evaluated for impact on streaming volume, maintenance windows, and operational complexity.
Step 5: Produced a concrete, ordered operational procedure
Provided an explicit sequence the operator could follow (per node, one at a time): attach/format temporary storage and mount it; quiesce the node (run nodetool drain then stop Cassandra) and copy /cassandra to the temporary mount; wipe and create partitions on the original disks, build mdadm RAID0, mkfs (recommended XFS), mount as the canonical /cassandra path; copy data back, restore ownership, start Cassandra, run nodetool repair and then run nodetool cleanup cluster-wide. Emphasized monitoring nodetool status during decommission/bootstrap and running repair after each node rejoin to ensure consistency. This step transitioned into the chosen implemented fix.
Solution:
The implemented change was an in-place consolidation per node: each server received a temporary large disk, data was copied off, the five original disks were re-partitioned and assembled into a single RAID0 array with an XFS filesystem mounted at /cassandra, data was copied back, cassandra.yaml was normalized to a single data_file_directories entry (e.g. /cassandra/data), and nodes were brought back online one at a time. After each node rejoin, nodetool repair and nodetool cleanup were executed across the cluster.
This works for Apache Cassandra because it preserves sstable on-disk paths while changing only the underlying block device topology; RAID0 provides striping that eliminates per-device fill imbalance and yields more even IO distribution, and sequentially rejoining nodes with repair/cleanup ensures token ranges and replica data are consistent without introducing missing data.
Conclusion:
Post-change results: per-node disk utilization became balanced onto a single array, hotspots on individual mounts were eliminated, cluster node rejoin and repair completed without data loss, and subsequent nodetool metrics showed steady I/O behavior. The change reduced operational risk from per-mount saturation and simplified future capacity management.