Problem
The client experienced a failover event in their PostgreSQL cluster managed by Patroni, between 01:00 AM and 02:00 AM on May 23, 2025.
Process
Step 1 – Initial Investigation
Initial logs from PostgreSQL (postgresql-Fri-00.log
and postgresql-Fri-01.log
) revealed regular query activity. This included frequent queries from monitoring tools (pg_stat_all_tables
, pg_locks
, etc.), checkpoint logging, and client connection messages. There were no immediate signs of crash or shutdown events (PANIC
, FATAL
, or database system is shutting down
), nor evidence of role transitions.
This absence of failover indicators led to a request for additional data: Patroni logs, thread dumps, and logs from both the primary and standby nodes, including those surrounding the suspected failover window.
Step 2 – Detailed Analysis of Patroni and PostgreSQL Logs
Upon review of patroni_logs_node1.txt
and postgresql-Fri-03.log
, the issue became clear.
At 01:11:59, Node 1 (r1_prod_n1
, the primary) abruptly closed its connection to PostgreSQL. This was immediately followed by:
- Multiple failed attempts to re-establish a connection to PostgreSQL (
psycopg2.OperationalError: timeout expired
); - Inability to update the leader lock in etcd:
EtcdKeyNotFound
; - Failure of
last_operation()
checks inside Patroni, which ultimately triggered a self-demotion (immediate-nolock
behavior); - The node entered recovery and attempted to reconnect as a standby, but connection issues persisted for nearly an hour.
Meanwhile, logs from Node 2 (r1_prod_n2
) showed that at 01:12:31, the standby assumed leadership:
I am (r1_prod_n2), the leader with the lock.
This transition confirms that the cluster’s HA mechanism worked as intended: when the primary failed to maintain its lock, the standby took over.
Step 3 – PostgreSQL Behavior Under Load and Replication Failures
PostgreSQL logs around the same time showed intense activity, including:
- Mass
DELETE
andINSERT
operations; - Continuous
VACUUM
processes (manual and autovacuum); CURSOR
,BEGIN
, and high-frequencyUPDATE
statements to control tables likeAPR1_DB2E_CTRL
;
At 01:12:23, a duplicate key violation occurred during an INSERT
, indicating a concurrency issue:
ERROR: duplicate key value violates unique constraint "apr1_db2e_ctrl_pkey"
At 01:59:55, a replication failure occurred on the standby:
ERROR: requested WAL segment 00000049000520D300000063 has already been removed
This means the replica lagged behind, and the required WAL segment had already been purged, likely due to default wal_keep_size
being too low or the replica being under high I/O pressure.
This did not impact service directly but indicates that replication was temporarily broken and would require reinitialization or archived WAL recovery.
Solution
Based on the evidence, the failover was a result of connectivity issues between Patroni and PostgreSQL on Node 1. This led to a failure in updating the DCS (etcd) lock, triggering automatic demotion. The high volume of queries may have contributed to CPU or I/O saturation, worsening connection stability.
To prevent recurrence:
Tune Patroni and PostgreSQL Settings
- Increase
connect_timeout
,retry_timeout
, andloop_wait
in Patroni. - Review Patroni’s interaction with etcd — ensure TTLs and quorum are correctly set.
WAL Retention and Replica Stability
- Increase
wal_keep_size
or configurearchive_command
to retain WAL files longer. - Monitor replication lag and adjust
max_slot_wal_keep_size
if replication slots are used.
Application-Level Improvements
- Address concurrency issues in application logic — consider
UPSERT
(ON CONFLICT
) or advisory locking. - Review batch jobs for potential bottlenecks or excessive update rates.
Monitoring Improvements
- Enhance monitoring to distinguish between real failover and replica lag.
- Track Patroni state changes and promote alerting on demotion/promotion events specifically.
Conclusion
The PostgreSQL cluster experienced a leader transition initiated by a loss of connectivity between Patroni and PostgreSQL on the original primary. The system correctly demoted the problematic node and promoted the standby, ensuring service continuity. However, the incident highlighted risks associated with high load, default configuration limits (especially WAL retention), and the need for fine-tuned HA settings in Patroni and PostgreSQL.
By addressing both infrastructure-level timeouts and application query patterns, the client can significantly improve the stability and resilience of their PostgreSQL HA setup.