Problem:

The data mount was approaching full capacity. Investigation showed that the keyspace-level backups directories inside the Cassandra data directories were consuming the majority of space (≈150 GB). No active snapshots were present on the cluster when the issue was reported. The cluster configuration had incremental_backups: true in cassandra.yaml because the customer wanted to retain incremental backup pieces and be able to remove incremental files older than the most recent snapshot point.

Process:

Step 1: Verify on-disk symptom and snapshot state

Observed high disk usage on the data mount and inspected directory tree under the Cassandra data path; backups subdirectories were the top consumers. Executed nodetool listsnapshots on nodes and found no snapshots recorded. This mattered because without a snapshot anchor, arbitrary deletion could remove the only remaining links to SSTable inodes that were intended as backup references.

Step 2: Review configuration that controls backup behavior

Reviewed cassandra.yaml and confirmed incremental_backups: true (and checked snapshot_before_compaction for completeness). This explained why many hardlinked files existed in backups/ — Cassandra creates hardlinks for incremental backups when flushing SSTables. Knowing the setting guided the pruning approach: maintain a full snapshot anchor before removing incremental hardlinks that predate that anchor.

Step 3: Assess safety of deleting files and common concerns

Investigated backup hardlink semantics rather than file handles: incremental backup files are hardlinks to SSTable files. Confirmed that keeping a snapshot creates its own hardlink(s) to the same inodes; deleting other hardlinks reduces link count but does not remove data while a snapshot link remains. This removed the need for process-level open-file checks or restarts as a prerequisite for deletion. The finding influenced the chosen safe sequence: create a snapshot first, then prune incremental files older than that snapshot.

Step 4: Design and test a node-level pruning workflow

Tested the sequence on a single node: run nodetool snapshot to create a point-in-time snapshot, then remove incremental backup files older than the snapshot timestamp across all data directories. Verified space reclaimed and validated Cassandra continued serving reads/writes without errors. This test validated that snapshot anchors preserved required SSTable inodes while allowing deletion of older incremental hardlinks.

Step 5: Implement cluster-wide coordinated pruning and automation

Coordinated snapshot creation on every node (same logical snapshot name) and executed a scripted prune across data directories to remove backup files older than the snapshot moment. The script used the filesystem mtime of files in backups/ to identify items older than the snapshot and deleted them. After pruning, ran nodetool listsnapshots and checked node disk usage to confirm the expected ~150 GB reduction and that no snapshots were inadvertently removed. This step introduced the applied change and led directly into the implemented Solution.

Solution:

Implemented a two-step, safe pruning policy for Apache Cassandra incremental backups: (1) create a coordinated full snapshot on each node (nodetool snapshot) to establish a consistent restore anchor; (2) run a controlled prune that deletes files in each keyspace/table backups/ directory whose modification time predates the snapshot. Incremental_backups was left enabled so regular incremental artifacts continue to be produced; pruning runs were automated via a scheduled job that uses the most recent snapshot timestamp as the retention boundary.

Architecturally this works because Cassandra snapshot and incremental backup files are hardlinks to SSTable inodes: the snapshot hardlink(s) preserve the SSTable data required for restore, so removing older incremental hardlinks frees space without losing the anchored snapshot data. No Cassandra restart is required and the node kept serving traffic during pruning.

Conclusion:

Pruning recovered the targeted ~150 GB of disk space and restored margin on the data mount. The cluster retained consistent restore points via snapshots while incremental artifacts continue to be produced and routinely pruned. The process reduced operational risk of full mounts and established an automated, repeatable retention mechanism that preserves restoreability for Apache Cassandra.