Problem:

On July 8th, 2024, all Docker containers on all nodes within a Docker Swarm cluster suddenly crashed. The cluster consisted of 13 nodes: 1 master, 2 reachable, and 10 worker nodes. The initial logs indicated a problem with the RAFT consensus algorithm attempting and failing to elect a leader multiple times.

Process:

Upon receiving the initial log snippets, the following steps were taken to diagnose and address the issue:

  • Network Connectivity Check: Verified that all nodes could communicate over the required ports (2377, 7946, 4789). No network issues were detected.
  • Docker Swarm Topology Analysis: Collected the current Docker Swarm topology details and output from docker node ls to understand the cluster configuration.
  • Docker Daemon Logs Review: Requested and reviewed Docker daemon logs from various nodes, spanning from two hours before the incident until the time of review.
  • System Metrics Review: Collected basic metrics such as disk usage (du -h) and CPU load averages (top) to assess system health.
  • Container Logs Analysis: Analyzed the output and logs from the containers to identify any potential issues before the crash occurred.

Solution:

Short-term Resolution:

The immediate solution involved redeploying all microservices within the Docker Swarm cluster, which restored functionality. This indicated that the problem might be related to the inability of the services to recover from the initial issue.

Long-term Recommendations:

  • Improve Observability: Set up a comprehensive log collection solution, such as ELK (Elasticsearch, Logstash, Kibana), to collect and analyze all logs (syslog, Docker daemon logs, and application logs).
  • Monitoring and Alerting: Implement monitoring and alerting on basic system resources (CPU, memory) and application health to quickly identify and respond to issues.
  • Resource Utilization Analysis: Given the high CPU usage observed during the incident, monitor CPU utilization closely and investigate which processes are consuming the most resources.
  • Self-Healing Mechanisms: Enhance the resilience of the microservices to improve their ability to recover from failures and avoid cluster-wide crashes.

Conclusion:

The crash of all Docker containers within the Docker Swarm cluster was initially resolved by redeploying the microservices. The root cause was likely related to high CPU utilization, which interfered with the RAFT consensus algorithm’s ability to elect a leader. No significant network issues were detected. By implementing the recommended improvements in observability, monitoring, and system resilience, the client can better prevent and respond to similar incidents in the future.