Problem:

The client faced issues with frequent re-elections in a Docker Swarm cluster whenever there were brief server-level disruptions. They sought guidance on modifying the swarm election timeout to stabilize the cluster and prevent unnecessary re-elections. Additionally, they wanted to understand the relationship between election timeout, heartbeat, and dispatcher-heartbeat settings.

Process:

Step 1: Initial Investigation

The expert began by explaining the Raft consensus mechanism underlying Docker Swarm. The focus was placed on two parameters: heartbeat-tick and election-tick, which together determine the election timeout (calculated as heartbeat-tick * election-tick).

The client shared their current Docker Swarm configuration where heartbeat-tick was set to 1 and election-tick to 3. The expert identified that this setup could lead to frequent re-elections, especially during transient network issues.

Step 2: Exploring Changes to Election Timeout

The client attempted to change the election timeout using the command:

docker swarm update --heartbeat-tick 1 --election-tick 10

However, this command resulted in an error, as it was unsupported in Docker 1.13.1. The expert explained that Docker intentionally restricts direct modification of Raft parameters through the CLI for cluster stability reasons. Despite not being able to adjust heartbeat-tick and election-tick, the expert suggested focusing on other settings such as dispatcher-heartbeat.

Step 3: Adjusting Dispatcher-Heartbeat

To address the issue indirectly, the expert recommended updating the dispatcher-heartbeat parameter. This setting controls the interval at which worker nodes report their health status to the manager. By increasing the dispatcher-heartbeat, the frequency of nodes being marked as down due to brief network latency could be reduced, thereby stabilizing the cluster.

The client applied the recommended change: docker swarm update --dispatcher-heartbeat 20s. The configuration was updated from an initial dispatcher-heartbeat of 5 seconds to 20 seconds.

Step 4: Further Testing and Recommendations

The client considered increasing the dispatcher-heartbeat to 5 minutes but was advised against such a drastic change. The expert recommended incrementally adjusting the value, starting with 120 seconds, to evaluate its effect on the system’s performance and stability.

Step 5: Addressing Version Limitations

The expert also highlighted the limitations of Docker 1.13.1, noting its outdated configuration options. To better understand the client’s environment, the expert obtained their RPM packages and replicated the setup. Testing confirmed that the dispatcher-heartbeat changes worked in Docker 1.13.1, but the expert emphasized that upgrading to a more recent Docker version would offer better configurability and long-term stability.

Solution:

The solution included the following steps:

  1. Explaining the relationship between heartbeat-tick, election-tick, and dispatcher-heartbeat.
  2. Recommending an initial update to dispatcher-heartbeat, setting it to 20 seconds.
  3. Suggesting gradual increments of dispatcher-heartbeat (e.g., 120 seconds) to balance system responsiveness and stability.
  4. Advising an upgrade to a newer Docker version for improved configuration options and overall stability.

Conclusion:

This case illustrates the complexities of maintaining Docker Swarm stability, particularly when using older versions with limited configuration flexibility. By following the expert’s recommendations, the client successfully mitigated unnecessary re-elections, leading to a more stable cluster. The key to success was incremental adjustments to dispatcher-heartbeat, along with thorough testing. Moving forward, upgrading Docker to a newer version remains crucial for enhanced stability and control over the configuration.