Problem:
The client reported a replication issue in their A1 BG Production environment, consisting of a Patroni cluster with two PostgreSQL instances (Leader and Replica). Replication stopped, causing the leader’s /pgcluster file system to fill up with pg_wal files, leading to a full disk. The client requested help to identify the root cause of the replication failure.
Process:
The expert analyzed over 100,000 logs from the three-node setup: primary leader, primary replica, and DR standby leader. Initial data provided by the client included log files and details about the Patroni and PostgreSQL versions. The expert requested additional system logs, database activity, replication slot details, and disk usage to conduct a thorough investigation.
Solution:
The expert discovered that the replication failure was caused by the premature removal of required WAL segments on the primary node, which was still needed by the DR node. This issue was further exacerbated by a replication slot timeout, preventing successful replication. As a result, the primary node accumulated WAL files, leading to a full file system. The expert proposed two solutions:
1. Increase WAL Retention.
Adjust the WAL retention settings to ensure segments are retained longer, allowing the DR system to synchronize despite any potential delays.
2. Monitor Disk Usage.
Implement a monitoring system to alert the team before the file system fills up due to excess WAL files, preventing future issues.
Conclusion:
After applying the expert’s recommendations, replication resumed successfully, and the system returned to normal operation. The client was advised to periodically review the settings and monitor disk usage to prevent similar issues in the future.