Problem:

The client faced issues with frequent re-elections in a Docker Swarm cluster whenever there were brief server-level disruptions. They sought guidance on modifying the swarm election timeout to stabilize the cluster and prevent unnecessary re-elections. Additionally, they wanted to understand the relationship between election timeout, heartbeat, and dispatcher-heartbeat settings.

Process:

  1. Understanding Swarm Election Timeout Configuration:

    The expert explained the underlying Raft consensus mechanism used by Docker Swarm, focusing on two parameters: heartbeat-tick and election-tick. These settings determine the election timeout, calculated as heartbeat-tick * election-tick.

    For example, if heartbeat-tick is set to 1 and election-tick to 10, the election timeout is 1 * 10 = 10 seconds.

    The client shared their current Docker Swarm configuration, where heartbeat-tick was set to 1 and election-tick to 3, which could contribute to frequent re-elections during transient network issues.

  2. Exploring Changes to Election Timeout:

    The client initially attempted to change the election timeout using the command:

    docker swarm update --heartbeat-tick 1 --election-tick 10

    However, they encountered an error, as this command is unsupported in their version of Docker (1.13.1).

    The expert clarified that Docker intentionally restricts direct user modification of Raft parameters through CLI to maintain cluster stability. While modifying heartbeat-tick and election-tick isn’t feasible in Docker 1.13.1, other parameters like dispatcher-heartbeat could be adjusted.

  3. Adjusting Dispatcher-Heartbeat:

    The expert suggested updating the dispatcher-heartbeat parameter to indirectly address the timeout issue:

    docker swarm update --dispatcher-heartbeat 20s

    This setting determines the interval at which worker nodes report their health status to the manager. Increasing this value can reduce the frequency of nodes being marked as down due to brief network latency, indirectly stabilizing the cluster.

    The client successfully tested this change and confirmed the updated configuration:

    • Initial dispatcher-heartbeat: 5 seconds
    • New dispatcher-heartbeat: 20 seconds
  4. Further Testing and Recommendations:

    The client inquired about increasing dispatcher-heartbeat to 5 minutes. The expert advised against such a drastic change, as it could delay detection of actual node failures. Instead, the expert recommended incrementally increasing the value, starting with 120 seconds, to observe its impact on system behavior.

  5. Addressing Version Limitations:

    The expert highlighted the limitations of Docker 1.13.1, including its outdated configuration options. To test solutions effectively, the expert obtained the client’s RPM packages and replicated the environment. They verified that the dispatcher-heartbeat changes worked in Docker 1.13.1 but reiterated the need to upgrade to a more recent Docker version for better configurability and stability.

Solution:

The expert resolved the issue by:

  • Explaining the relationship between heartbeat-tick, election-tick, and dispatcher-heartbeat.
  • Recommending and testing updates to dispatcher-heartbeat, starting with 20 seconds.
  • Advising gradual changes to the dispatcher-heartbeat value (e.g., 120 seconds) to balance stability and responsiveness.
  • Highlighting the limitations of Docker 1.13.1 and suggesting an upgrade to a newer version for long-term improvements.

Conclusion:

This case underscores the challenges of maintaining stability in Docker Swarm clusters, particularly in older versions with limited configuration options. By following the expert’s recommendations, the client stabilized their cluster and mitigated unnecessary re-elections. Incremental adjustments and thorough testing of parameters like dispatcher-heartbeat proved essential. For better flexibility and control, upgrading to a more recent Docker version remains a critical next step.