Diagnosing and Resolving ETCD Cluster Sync Issues - Proactive Insights and Support For Open-Source Applications

Problem:

The client encountered an issue with desynchronization of nodes in an ETCD cluster running on RHEL 8.8. Error logs indicated significant disk write delays (slow fdatasync), which caused the Patroni cluster to fail and become unavailable.

Process:

Step 1 – Initial Analysis

The expert asked the client to check the health of the cluster using etcdctl endpoint health to confirm node availability and rule out TLS or network-related issues.

Step 2 – Disk Performance Analysis

Commands such as iostat -xz, vmstat, and dmesg were suggested to assess the performance of the disk subsystem and determine whether the delays were caused by I/O overload or hardware problems.

Step 3 – Disk Type Verification

The expert recommended verifying that SSDs (e.g., gp2/gp3 in AWS or equivalent in other environments) were being used, as the fdatasync delays could stem from slow or overloaded storage.

Step 4 – Fragmentation and Log Review

Logs revealed frequent slow fdatasync errors on nodes etcd04 and etcd05. Output from diagnostics indicated a highly fragmented ETCD data store (370 MB physical vs. 20–25 KB logical), which contributed to degraded write performance.

Solution:

To resolve the issue, the expert recommended performing regular compaction and defragmentation of the ETCD data store using etcdctl compact and etcdctl defrag –cluster, which reduces fragmentation and improves write efficiency. Additionally, the expert advised checking the configuration of VMware-based virtual machines, ensuring the use of a PVSCSI controller, confirming that VMware Tools were installed and running, and evaluating datastore performance to rule out underlying infrastructure issues. These steps helped restore ETCD cluster synchronization and prevent further failures.

Conclusion:

The issue was caused by data fragmentation and potential limitations in the disk subsystem. Through careful analysis of logs and system metrics, the expert provided effective optimization and maintenance steps, enabling the client to restore node synchronization and ensure the stable operation of the Patroni cluster.