Problem

The client experienced a failover event in their PostgreSQL cluster managed by Patroni, between 01:00 AM and 02:00 AM on May 23, 2025.

Process

Step 1 – Initial Investigation

Initial logs from PostgreSQL (postgresql-Fri-00.log and postgresql-Fri-01.log) revealed regular query activity. This included frequent queries from monitoring tools (pg_stat_all_tables, pg_locks, etc.), checkpoint logging, and client connection messages. There were no immediate signs of crash or shutdown events (PANIC, FATAL, or database system is shutting down), nor evidence of role transitions.

This absence of failover indicators led to a request for additional data: Patroni logs, thread dumps, and logs from both the primary and standby nodes, including those surrounding the suspected failover window.

Step 2 – Detailed Analysis of Patroni and PostgreSQL Logs

Upon review of patroni_logs_node1.txt and postgresql-Fri-03.log, the issue became clear.

At 01:11:59, Node 1 (r1_prod_n1, the primary) abruptly closed its connection to PostgreSQL. This was immediately followed by:

  • Multiple failed attempts to re-establish a connection to PostgreSQL (psycopg2.OperationalError: timeout expired);
  • Inability to update the leader lock in etcd: EtcdKeyNotFound;
  • Failure of last_operation() checks inside Patroni, which ultimately triggered a self-demotion (immediate-nolock behavior);
  • The node entered recovery and attempted to reconnect as a standby, but connection issues persisted for nearly an hour.

Meanwhile, logs from Node 2 (r1_prod_n2) showed that at 01:12:31, the standby assumed leadership:
I am (r1_prod_n2), the leader with the lock.

This transition confirms that the cluster’s HA mechanism worked as intended: when the primary failed to maintain its lock, the standby took over.

Step 3 – PostgreSQL Behavior Under Load and Replication Failures

PostgreSQL logs around the same time showed intense activity, including:

  • Mass DELETE and INSERT operations;
  • Continuous VACUUM processes (manual and autovacuum);
  • CURSOR, BEGIN, and high-frequency UPDATE statements to control tables like APR1_DB2E_CTRL;

At 01:12:23, a duplicate key violation occurred during an INSERT, indicating a concurrency issue:
ERROR: duplicate key value violates unique constraint "apr1_db2e_ctrl_pkey"

At 01:59:55, a replication failure occurred on the standby:
ERROR: requested WAL segment 00000049000520D300000063 has already been removed

This means the replica lagged behind, and the required WAL segment had already been purged, likely due to default wal_keep_size being too low or the replica being under high I/O pressure.

This did not impact service directly but indicates that replication was temporarily broken and would require reinitialization or archived WAL recovery.

Solution

Based on the evidence, the failover was a result of connectivity issues between Patroni and PostgreSQL on Node 1. This led to a failure in updating the DCS (etcd) lock, triggering automatic demotion. The high volume of queries may have contributed to CPU or I/O saturation, worsening connection stability.

To prevent recurrence:

Tune Patroni and PostgreSQL Settings

  • Increase connect_timeout, retry_timeout, and loop_wait in Patroni.
  • Review Patroni’s interaction with etcd — ensure TTLs and quorum are correctly set.

WAL Retention and Replica Stability

  • Increase wal_keep_size or configure archive_command to retain WAL files longer.
  • Monitor replication lag and adjust max_slot_wal_keep_size if replication slots are used.

Application-Level Improvements

  • Address concurrency issues in application logic — consider UPSERT (ON CONFLICT) or advisory locking.
  • Review batch jobs for potential bottlenecks or excessive update rates.

Monitoring Improvements

  • Enhance monitoring to distinguish between real failover and replica lag.
  • Track Patroni state changes and promote alerting on demotion/promotion events specifically.

Conclusion

The PostgreSQL cluster experienced a leader transition initiated by a loss of connectivity between Patroni and PostgreSQL on the original primary. The system correctly demoted the problematic node and promoted the standby, ensuring service continuity. However, the incident highlighted risks associated with high load, default configuration limits (especially WAL retention), and the need for fine-tuned HA settings in Patroni and PostgreSQL.

By addressing both infrastructure-level timeouts and application query patterns, the client can significantly improve the stability and resilience of their PostgreSQL HA setup.