Failover Investigation and Resolution for PostgreSQL Cluster - Proactive Insights and Support For Open-Source Applications

Problem:

PG Prod Node 1 failed over to Node 2, accompanied by high WAL (Write-Ahead Logging) generation. The client requested an investigation into the cause of the failover on Node 1.

Process:

To conduct a thorough investigation, the following data and logs were requested:

PostgreSQL logs
Cluster logs
Database logs
WAL logs
Cluster configuration
Monitoring metrics (CPU usage, memory usage, disk I/O, and network activity)
Cluster status

Upon analyzing the provided monitoring logs, the following key observations were made:

Low Memory Availability: The instance had both free RAM and SWAP RAM at 0, indicating a severe memory shortage.
High CPU Usage: A PostgreSQL backup task was consuming 100% of the CPU, likely exacerbating the memory issue and potentially triggering the failover.

Given the circumstances, it was suspected that the PostgreSQL backup task caused the instance to run out of memory, leading to a failover to Node 2. To further investigate the potential root causes, the following scenarios were considered:

1. WAL Corruption:

If the WAL file corruption occurred while writing to disk, the archive might remain intact unless the corruption happened earlier, such as in RAM or due to a software bug.

2. Replicator-Induced Failover:

The logs revealed that the replicator started a job and triggered the failover, as indicated by the message: pganalyze standby failover. The issue seemed to be related to the OS failing to release RAM properly when an application used a significant amount, potentially due to the service not ending gracefully.

To address the situation, the following actions were recommended:

Clearing Cache and Reclaiming Memory:
```
sudo sync; echo 3 > /proc/sys/vm/drop_caches
sudo swapoff -a && sudo swapon -a
```
These commands were advised to clear the cache and restore used memory. However, it was emphasized to ensure no critical services were running to avoid data loss.
Restarting PostgreSQL Service: After clearing the RAM, it was recommended to restart the PostgreSQL service. Additionally, it was suggested to identify which jobs were running during the incident and optimize them to prevent future issues.
Increasing System Resources: Given the memory and swap depletion, increasing the machine’s resources was recommended to avoid similar incidents in the future.
Monitoring and Alerts: To proactively manage memory resources, it was advised to configure alerts to trigger when memory usage reaches a certain threshold (e.g., 80%).

Further analysis of the logs suggested potential issues with the Patroni configuration:

Network Issues: Logs indicated possible network problems between two Patroni instances. Adjustments to Patroni’s configuration were recommended to mitigate these issues.
Patroni and etcd Server Location: It was suggested that the Patroni server and the etcd server should be located on different physical machines. This was due to observations of heavy and long requests, which could cause timeouts on the server, as referenced here: GitHub Issue 1362.

Recommended configuration adjustments:

Set retry_timeout to 20 seconds.
Set loop_wait to 20 seconds.
Set ttl to 60 seconds.

Additionally, it was important to check the rights for Patroni to run /dev/watchdog, as indicated by the log message:

2024-07-03 00:40:01,337 WARNING: Could not activate Linux watchdog device: "Can't open watchdog device: [Errno 2] No such file or directory: '/dev/watchdog'"

Reference links for Patroni configuration:

After applying the recommended changes, it was advised to synchronize the parameters across the Patroni nodes to ensure stability during future leader transitions.

Solution:

The investigation concluded that the failover was primarily caused by memory overload due to a CPU-intensive PostgreSQL backup task, which led to the crash of Node 1. Configuration issues with Patroni, particularly in the context of network delays and misconfigured timeout settings, also contributed to the failover.

The solution involved:

Clearing the cache and reclaiming memory.
Restarting the PostgreSQL service and optimizing the jobs running during the incident.
Adjusting the Patroni configuration to mitigate network and timeout issues.
Increasing system resources and setting up monitoring alerts.

Conclusion:

The failover of PG Prod Node 1 to Node 2 was traced back to memory and CPU exhaustion caused by a PostgreSQL backup task, compounded by configuration issues within Patroni. By implementing the recommended solutions, including memory management, service optimization, and Patroni configuration adjustments, the client can prevent similar incidents in the future and maintain a stable PostgreSQL cluster.