Problem:

The client reported recurring crashes of a Cassandra node with errors related to “too many open files”. Despite increasing the maximum open files limit, the issue persisted. The problem was observed primarily during high server load, with regular crashes around 01:15 AM. The client suspected that network instability or heavy operations, such as running nodetool repair, could be contributing factors.

Process:

The expert reviewed the error logs and system configurations provided by the client, focusing on system load, file limits, and potential network issues. To identify the root cause, the expert recommended installing a monitoring agent to collect metrics during problematic times, including CPU usage, network usage, disk IO, and Cassandra open files.

The client provided a Sar report indicating that CPU usage and load average spiked during the crashes. The expert identified nodetool repair as a potential contributor to the load, suggesting it be scheduled during non-peak hours. The expert also suspected issues with bonding interfaces and recommended disabling bonding, opting instead to connect Cassandra nodes via a single NIC to a single switch. Ongoing monitoring of key metrics such as load average, CPU usage, Cassandra heap usage, and garbage collection pauses was advised, using tools like InfluxDB, Graphite, or Prometheus, with Grafana for visualization.

Solution:

The proposed solution by the expert involved the following key actions:

  • Monitoring and Metric Collection: Install a monitoring agent to gather detailed metrics during the crashes;
  • Adjust nodetool repair Scheduling: Run nodetool repair during non-peak hours to minimize its impact on system performance;
  • Network Configuration Changes: Disable bonding interfaces and use a single NIC connection to improve network stability;
  • Ongoing Analysis: Continuously monitor and analyze the collected metrics to identify and address any recurring issues.

Conclusion:

The proposed solution is effective because it targets both immediate and underlying issues contributing to the Cassandra node crashes. By implementing detailed monitoring, the client gains visibility into the system’s performance during critical times, enabling them to pinpoint the exact cause of the crashes. Adjusting the nodetool repair scheduling reduces the strain on system resources during peak hours, mitigating one potential source of instability. Additionally, addressing the network configuration by disabling bonding interfaces eliminates a likely cause of network-related disruptions. This holistic approach not only resolves the immediate problem but also fortifies the system against future occurrences, ensuring more stable and reliable Cassandra operations.