Problem:
The client experienced issues with Docker Swarm configuration in production. Specifically, when a container restarted, the application failed to recover properly. The client requested a review of the configuration to identify the root cause and potential improvements to enhance the cluster’s functionality.
Process:
Step 1: Initial Investigation
The client provided details of the Docker Swarm setup, including node configurations, service lists, and network details. Additionally, they shared the output of docker info
, docker node ls
, docker service ls
, docker network ls
, and docker stats
to assist in the review.
The expert identified two key concerns:
- Lack of a restart policy for services.
- Containers consuming resources freely without CPU/memory limits, leading to possible resource contention.
The expert requested additional information:
- Service restart policies:
docker service inspect server-ms_billing-arrangement-ms --format '{{json .Spec.TaskTemplate.RestartPolicy}}'
- Service logs:
docker service logs server-ms_billing-arrangement-ms
- CPU and memory usage:
docker stats
Step 2: Analysis and Recommendations
The expert reviewed the provided data and recommended the following improvements:
Implement CPU/Memory Limits:
Containers were using memory freely, which could lead to resource contention. The expert suggested adding resource limits in the stack configuration:
resources: limits: cpus: '0.5' memory: 4Gi reservations: cpus: '0.5' memory: 4Gi
Set Restart Policies:
The services lacked a restart policy, meaning containers would not restart automatically after failure. The expert recommended setting a restart policy for all services using:
$ docker service update --restart-condition any <service_name>
Alternatively, adding the following configuration to the stack:
deploy: restart_policy: condition: any delay: 10s max_attempts: 5 window: 120s
Ensure Data Consistency with Persistent Volumes:
If services stored data, implementing persistent volumes was advised to prevent data loss:
volumes: db_data: services: database: image: postgres volumes: - db_data:/var/lib/postgresql/data
Address Logstash Overload Issues:
Service logs showed connection limitations and overload in Logstash. If using Filebeat, the expert suggested modifying filebeat.yml
and restarting the service:
output.logstash: hosts: ["172.22.36.22:5044"] worker: 4 bulk_max_size: 4096 timeout: 60s backoff.init: 1s backoff.max: 60s
Restarting Filebeat:
systemctl restart filebeat
Step 3: Implementation & Monitoring
The client was advised to apply the recommendations and monitor the cluster performance. Key focus areas included:
- Verifying restart policies were correctly applied.
- Monitoring CPU and memory usage post resource limitation implementation.
- Ensuring persistent volumes were correctly configured for data integrity.
- Checking Logstash performance after updating Filebeat configurations.
Solution:
Following the expert’s recommendations, the client implemented resource constraints and restart policies. These changes resulted in:
- Improved container recovery after unexpected restarts
- Better resource allocation and reduced contention
- Enhanced system stability and reduced downtime
Conclusion:
By optimizing the Docker Swarm configuration, the client successfully resolved the container restart issues and improved overall cluster reliability. This case highlights the importance of setting proper resource limits and restart policies to ensure smooth containerized application operations in a production environment.