Docker Swarm Configuration and Container Recovery Issues - Proactive Insights and Support For Open-Source Applications

Problem:

The client experienced issues with Docker Swarm configuration in production. Specifically, when a container restarted, the application failed to recover properly. The client requested a review of the configuration to identify the root cause and potential improvements to enhance the cluster’s functionality.

Process:

Step 1: Initial Investigation

The client provided details of the Docker Swarm setup, including node configurations, service lists, and network details. Additionally, they shared the output of docker info, docker node ls, docker service ls, docker network ls, and docker stats to assist in the review.

The expert identified two key concerns:

Lack of a restart policy for services.
Containers consuming resources freely without CPU/memory limits, leading to possible resource contention.

The expert requested additional information:

Service restart policies: docker service inspect server-ms_billing-arrangement-ms --format '{{json .Spec.TaskTemplate.RestartPolicy}}'
Service logs: docker service logs server-ms_billing-arrangement-ms
CPU and memory usage: docker stats

Step 2: Analysis and Recommendations

The expert reviewed the provided data and recommended the following improvements:

Implement CPU/Memory Limits:

Containers were using memory freely, which could lead to resource contention. The expert suggested adding resource limits in the stack configuration:

resources:
  limits:
    cpus: '0.5'
    memory: 4Gi
  reservations:
    cpus: '0.5'
    memory: 4Gi

Set Restart Policies:

The services lacked a restart policy, meaning containers would not restart automatically after failure. The expert recommended setting a restart policy for all services using:

$ docker service update --restart-condition any <service_name>

Alternatively, adding the following configuration to the stack:

deploy:
  restart_policy:
    condition: any
    delay: 10s
    max_attempts: 5
    window: 120s

Ensure Data Consistency with Persistent Volumes:

If services stored data, implementing persistent volumes was advised to prevent data loss:

volumes:
  db_data:
services:
  database:
    image: postgres
    volumes:
      - db_data:/var/lib/postgresql/data

Address Logstash Overload Issues:

Service logs showed connection limitations and overload in Logstash. If using Filebeat, the expert suggested modifying filebeat.yml and restarting the service:

output.logstash:
  hosts: ["172.22.36.22:5044"]
  worker: 4
  bulk_max_size: 4096
  timeout: 60s
  backoff.init: 1s
  backoff.max: 60s

Restarting Filebeat:

systemctl restart filebeat

Step 3: Implementation & Monitoring

The client was advised to apply the recommendations and monitor the cluster performance. Key focus areas included:

Verifying restart policies were correctly applied.
Monitoring CPU and memory usage post resource limitation implementation.
Ensuring persistent volumes were correctly configured for data integrity.
Checking Logstash performance after updating Filebeat configurations.

Solution:

Following the expert’s recommendations, the client implemented resource constraints and restart policies. These changes resulted in:

Improved container recovery after unexpected restarts
Better resource allocation and reduced contention
Enhanced system stability and reduced downtime

Conclusion:

By optimizing the Docker Swarm configuration, the client successfully resolved the container restart issues and improved overall cluster reliability. This case highlights the importance of setting proper resource limits and restart policies to ensure smooth containerized application operations in a production environment.