Problem
The client encountered frequent master (manager) re-elections in their production Docker Swarm cluster, despite having the dispatcher-heartbeat
value set to 2 minutes. These re-elections were happening within fractions of a second, causing concerns around Swarm stability and service availability. The client’s Docker environment was based on version 1.13.1 running on RHEL 7.9.
Key symptoms included:
- Frequent and rapid re-elections of the Swarm manager node, even under minor network fluctuations.
- No apparent way to change the
heartbeat-tick
andelection-tick
parameters via Docker CLI or configuration files. - Confusion between the purpose of
dispatcher-heartbeat
and Raft election parameters.
Process
Step 1: Initial Investigation
The expert began by reviewing the Docker Swarm cluster’s Raft configuration using docker info
, which revealed the following settings:
- Heartbeat Tick: 1
- Election Tick: 3
- Dispatcher Heartbeat Period: 5 seconds
This meant the effective election timeout was only 3 ticks, leading to unnecessary leader re-elections due to minor delays. However, since Docker 1.13.1 does not expose Raft internals for user modification, the expert identified that adjusting heartbeat-tick
or election-tick
was not feasible via CLI or daemon.json
.
To address the client’s concerns, the expert explored alternative mitigation strategies:
- Testing whether increasing
dispatcher-heartbeat
could indirectly reduce re-elections. - Attempting to change Raft parameters via
docker swarm update
anddaemon.json
(which failed). - Reviewing the Docker source code to confirm Raft values are hardcoded in version 1.13.1.
- Advising on upgrading Docker to a version where Raft parameters are configurable.
Step 2: Troubleshooting & Fixes
Attempt to Change Election Timeout via CLI
The client attempted the following command based on initial guidance:
docker swarm update --heartbeat-tick 1 --election-tick 10
This resulted in an error because Docker 1.13.1 does not support modifying these values.
Testing dispatcher-heartbeat
To explore mitigation, the expert suggested increasing dispatcher-heartbeat
using:
docker swarm update --dispatcher-heartbeat 20s
This command worked and was verified by reviewing updated docker info
output. Further testing increased the value to 30 seconds successfully.
Clarification on Heartbeat vs. Dispatcher Heartbeat
The expert clarified that dispatcher-heartbeat
only controls how often worker nodes report their health and does not impact manager node re-election timing, which is governed by heartbeat-tick
× election-tick
.
Limitations of Docker 1.13.1
It was confirmed that Docker 1.13.1 does not support changing Raft parameters via CLI or configuration files. The expert validated this by examining the Docker source code and sharing relevant documentation links:
Recommendations for Resolution
The expert provided the following options:
- Upgrade Docker: Recommended upgrading to Docker 20.10.4 or later, where Raft parameters are more configurable and include stability improvements.
- Custom Build (Advanced Option): Modify Docker’s source code to increase the hardcoded Raft tick values. This would involve locating Raft constants in the Go source code, recompiling Docker, and deploying in a test environment first. (This approach was suggested but not tested by the expert.)
- Improve Network Stability: Ensuring a highly stable, low-latency connection between manager nodes to avoid false re-election triggers.
- Add More Manager Nodes: Increasing the number of manager nodes to improve resilience and reduce the risk of leader loss due to transient issues.
Solution
Ultimately, the client acknowledged the limitations of Docker 1.13.1 and agreed to upgrade Docker Swarm to a supported version where election timeouts could be tuned. The expert recommended Docker version 20.10.4 or higher, citing improvements specifically related to heartbeat handling and Raft configuration flexibility:
This solution addressed the root cause of rapid manager re-elections and enabled the client to apply more robust Swarm configurations moving forward.
Conclusion
The root issue stemmed from hardcoded Raft parameters in Docker 1.13.1, leading to rapid and frequent Swarm manager re-elections. Despite efforts to modify the configuration, these parameters were confirmed to be immutable in this version. The expert provided clear guidance and alternative mitigation steps, including:
- Clarifying the distinction between dispatcher-heartbeat and Raft parameters.
- Successfully adjusting
dispatcher-heartbeat
to improve worker node communication. - Recommending an upgrade path to unlock full configurability of election settings.
This structured engagement enabled the client to move toward a stable and tunable Swarm configuration, aligned with best practices for production environments.