Problem:

The customer is facing frequent Docker Swarm re-elections, triggered even by brief server issues lasting just a few seconds. They are seeking guidance on how to modify the Swarm election timeout and whether adjusting this value will have any impact on the system.

Process:

Step 1: Initial Investigation

The customer reported frequent leader re-elections in Docker Swarm when brief server issues occurred. They asked for advice on changing the election timeout and potential effects of this change. The expert explained that Docker Swarm uses the Raft consensus algorithm, which has two key timeouts: the election timeout and the heartbeat timeout.

The expert provided the output of the docker info command showing the current Raft settings and suggested the following command to adjust the election timeout:

docker swarm update --heartbeat-tick 4 --election-tick 60

This command would set the election timeout to 240 seconds, meaning that if the follower nodes don’t receive a heartbeat from the leader within 240 seconds, a new leader election would be initiated.

Step 2: Customer Follow-up

The customer requested clarification on how to check the current election timeout and asked about the potential impact of changing it. The expert advised running the docker info command while in Swarm mode to check the current election and heartbeat tick values, which would be shown in the Raft section. The customer also inquired about the ideal election timeout for their configuration and any adverse effects.

Step 3: Recommendation

The expert recommended increasing the election tick to 10 and keeping the heartbeat tick at 1 to reduce unnecessary re-elections caused by transient network issues. This adjustment would take effect during the next leader election, with no immediate impact like service restarts. If the customer’s environment experiences brief network outages, increasing the election tick is advisable, while quick failover demands a lower tick but requires a stable network. Testing these changes in a non-production environment before applying them in production was advised to assess the impact.

Conclusion:

The customer experienced frequent Docker Swarm leader re-elections due to brief server issues. After discussing potential causes and configurations, the expert suggested adjusting the election tick to prevent unnecessary re-elections without affecting system performance. No further response from the customer was received, so the ticket was closed from the support side.