Cassandra Timeouts Traced to Host Oversubscription - Proactive Insights and Support For Open-Source Applications

Problem:

The client reported a sudden and significant drop in Apache Cassandra performance on a 4-node cluster. The issue appeared without any recent configuration or infrastructure changes. The application started experiencing frequent timeouts, and restarting Cassandra services on all nodes did not resolve the problem. The client provided various monitoring graphs, system logs, and other diagnostics.

Process:

Step 1 – Initial Analysis

The expert reviewed the logs and immediately noticed that one node in the cluster (IP: 10.235.14.83) was triggering frequent connection issues—timeouts and host up/down events. This indicated potential network issues, hardware faults, or system resource exhaustion. The expert recommended verifying network connectivity between nodes and running nodetool cleanup, compact, and repair.

Step 2 – Reviewing System Configuration

The client clarified that each server had 16 vCPUs, 64 GB of RAM, and 200 GB of SAN storage (140 GB in use). CPU utilization was below 4%. Cassandra and the application were on the same subnet with no firewall rules between them. nodetool output and system statistics were provided, but only up to 09:30 due to service restarts.

Step 3 – Identifying Anomalies

The expert observed a discrepancy between the reported data size in nodetool status (less than 50 GB) and the actual disk usage (over 140 GB), suggesting old replicas or snapshots might be present. Additionally, elevated IO wait times were reported, hinting at possible storage subsystem bottlenecks. Node 10.235.14.83 remained a key suspect. The expert requested full syslogs and additional stats on CPU, IO, Load Average, and networking.

Step 4 – Root Cause Analysis via Syslog Review

The client provided system logs from the nodes. The expert identified Linux kernel messages related to soft lockups and RCU stalls, which indicated problems at the hypervisor level—specifically, CPU oversubscription. The Cassandra service was not the root cause, but rather a victim of insufficient CPU scheduling time due to virtualization constraints.

Solution:

The expert recommended the following actions:

Review VMware host configuration for signs of CPU oversubscription.
Reduce the number of vCPUs assigned to the Cassandra virtual machines.
Analyze the load on all VMs on the same hypervisor to detect noisy neighbor effects.
Run nodetool cleanup and nodetool clearsnapshots to remove obsolete data.
Optionally, take node 10.235.14.83 offline temporarily to evaluate impact.

Conclusion:

The issue was not due to Cassandra itself, but rather the result of CPU scheduling delays caused by hypervisor-level oversubscription. The Cassandra nodes were unable to receive adequate CPU time, leading to cascading performance issues. After reviewing the findings with their infrastructure team, the client acknowledged the root cause and closed the case.