PostgreSQL Patroni Failover - Proactive Insights and Support For Open-Source Applications

Problem:

The client experienced an unexpected failover in a production PostgreSQL Patroni cluster running PostgreSQL Community version 15.8 and Patroni version 2.1.4 configured in asynchronous replication mode.

Node A unexpectedly went offline and Patroni automatically promoted Node B to the primary node. The client needed to determine:

The reason for the failover.
Whether any transactions were lost due to asynchronous replication.
How to prevent the issue from occurring again.
How to ensure data integrity in a system that cannot use synchronous replication due to performance requirements.

The client also provided database logs and confirmation that no differences were detected between the data on both nodes after the failover.

Process:

Step 1: Initial Analysis

The expert reviewed the PostgreSQL logs provided by the client and identified multiple memory allocation errors prior to the crash of the PostgreSQL service on Node A.

It became clear from the logs that PostgreSQL experienced a severe Out of Memory condition, which prevented it from allocating even very small memory blocks, ultimately leading to a crash of the database process.

Patroni responded correctly by promoting Node B to primary in order to restore service availability.

Step 2: Root Cause Investigation

Further analysis revealed that the memory exhaustion occurred due to a combination of:

Lack of swap space or insufficient swap allocation.
Excessive PostgreSQL memory parameters.
A high number of simultaneously open database connections.
Complex application queries joining more than 50 tables.
No connection pooling mechanism in front of PostgreSQL.
No query timeouts or memory limitation policies in place.

Together, these factors created a scenario where PostgreSQL was unable to control total memory usage under load, causing the operating system to kill PostgreSQL processes.

Step 3: Data Integrity Assessment

Because the cluster was running with asynchronous replication, the expert explained that transaction loss is theoretically possible during failover.

The client performed a comparison between Node A and Node B and confirmed that no data differences were found. Based on this, the expert concluded that no transaction loss occurred during this failover event.

The expert also explained Patroni’s maximum_lag_on_failover setting and clarified that the configured 1 MB threshold limits how far a replica may lag behind before being promoted, but it does not provide a guarantee of zero data loss.

Step 4: Advanced Verification Offering

For stronger assurance, the expert recommended advanced WAL-level validation techniques:

Comparing WAL segments between the old primary and promoted node.
Using pg_waldump and pg_last_wal_replay_lsn().

To proceed with this deeper analysis, the expert requested confirmation regarding:

The operational status of the former primary node.
Access to WAL archives at the time of failover.

Solution:

The expert provided a comprehensive set of corrective and preventive measures:

Implement a connection pooler (such as PgBouncer).
Reduce max_connections to a controlled range.
Lower work_mem to approximately 4 MB.
Optimize application SQL queries.
Introduce query limits and execution timeouts.
Add monitoring and alerts for memory utilization and connection spikes.
Consider system memory upgrades if required.

For complete data safety, synchronous replication was recommended, with acknowledgment of possible performance trade-offs.

Conclusion:

The failover was caused by server-wide memory exhaustion on Node A, which resulted in PostgreSQL crashing and triggering Patroni’s automatic failover.

No data loss was detected based on post-event data comparison. However, due to the asynchronous replication model, absolute certainty could only be achieved through WAL-level analysis that the client chose not to pursue.

The expert delivered:

Root cause analysis.
Risk assessment.
Data integrity guidance.
Preventive architecture recommendations.

The issue was successfully diagnosed and mitigated, and the environment was left with a clear optimization roadmap for improving resilience and stability.