Stabilizing Docker Swarm Elections: Overcoming Raft Configuration Limitations in Version 1.13.1

Problem:

The client encountered frequent master (manager) re-elections in their production Docker Swarm cluster, despite having the dispatcher-heartbeat value set to 2 minutes. These re-elections were happening within fractions of a second, causing concerns around Swarm stability and service availability. The client’s Docker environment was based on version 1.13.1 running on RHEL 7.9.

Key symptoms included:

Frequent and rapid re-elections of the Swarm manager node, even under minor network fluctuations.
No apparent way to change the heartbeat-tick and election-tick parameters via Docker CLI or configuration files.
Confusion between the purpose of dispatcher-heartbeat and Raft election parameters.

Process:

Step 1: Initial Investigation

The expert began by reviewing the Docker Swarm cluster’s Raft configuration using docker info, which revealed the following settings:

Heartbeat Tick: 1
Election Tick: 3
Dispatcher Heartbeat Period: 5 seconds

This meant the effective election timeout was only 3 ticks, leading to unnecessary leader re-elections due to minor delays. However, since Docker 1.13.1 does not expose Raft internals for user modification, the expert identified that adjusting heartbeat-tick or election-tick was not feasible via CLI or daemon.json.

To address the client’s concerns, the expert explored alternative mitigation strategies:

Testing whether increasing dispatcher-heartbeat could indirectly reduce re-elections.
Attempting to change Raft parameters via docker swarm update and daemon.json (which failed).
Reviewing the Docker source code to confirm Raft values are hardcoded in version 1.13.1.
Advising on upgrading Docker to a version where Raft parameters are configurable.

Step 2: Troubleshooting & Fixes

Attempt to Change Election Timeout via CLI

The client attempted the following command based on initial guidance:

docker swarm update --heartbeat-tick 1 --election-tick 10

This resulted in an error because Docker 1.13.1 does not support modifying these values.

Testing `dispatcher-heartbeat`

To explore mitigation, the expert suggested increasing dispatcher-heartbeat using:

docker swarm update --dispatcher-heartbeat 20s

This command worked and was verified by reviewing updated docker info output. Further testing increased the value to 30 seconds successfully.

Clarification on Heartbeat vs. Dispatcher Heartbeat

The expert clarified that dispatcher-heartbeat only controls how often worker nodes report their health and does not impact manager node re-election timing, which is governed by heartbeat-tick × election-tick.

Limitations of Docker 1.13.1

It was confirmed that Docker 1.13.1 does not support changing Raft parameters via CLI or configuration files. The expert validated this by examining the Docker source code and sharing relevant documentation links:

Recommendations for Resolution

The expert provided the following options:

Upgrade Docker: Recommended upgrading to Docker 20.10.4 or later, where Raft parameters are more configurable and include stability improvements.
Custom Build (Advanced Option): Modify Docker’s source code to increase the hardcoded Raft tick values. This would involve locating Raft constants in the Go source code, recompiling Docker, and deploying in a test environment first. (This approach was suggested but not tested by the expert.)
Improve Network Stability: Ensuring a highly stable, low-latency connection between manager nodes to avoid false re-election triggers.
Add More Manager Nodes: Increasing the number of manager nodes to improve resilience and reduce the risk of leader loss due to transient issues.

Solution:

Ultimately, the client acknowledged the limitations of Docker 1.13.1 and agreed to upgrade Docker Swarm to a supported version where election timeouts could be tuned. The expert recommended Docker version 20.10.4 or higher, citing improvements specifically related to heartbeat handling and Raft configuration flexibility:

This solution addressed the root cause of rapid manager re-elections and enabled the client to apply more robust Swarm configurations moving forward.

Conclusion:

The root issue stemmed from hardcoded Raft parameters in Docker 1.13.1, leading to rapid and frequent Swarm manager re-elections. Despite efforts to modify the configuration, these parameters were confirmed to be immutable in this version. The expert provided clear guidance and alternative mitigation steps, including:

Clarifying the distinction between dispatcher-heartbeat and Raft parameters.
Successfully adjusting dispatcher-heartbeat to improve worker node communication.
Recommending an upgrade path to unlock full configurability of election settings.

This structured engagement enabled the client to move toward a stable and tunable Swarm configuration, aligned with best practices for production environments.