Resolving PostgreSQL Replication Alert Discrepancies - Proactive Insights and Support For Open-Source Applications

Problem:

The system is generating alerts indicating replication lag on the “productCatalog” PostgreSQL instance, specifically targeting the machine “pa3fnd02.” However, upon DBA investigation, there is no observable lag in the cluster, with both the leader (pa3fnd02) and the sync standby (pa3fnd01) reporting no lag. The discrepancy raises concerns about the persistence of alerts despite the absence of actual lag.

Solution:

Monitoring on Postgres:
- Clarification on monitoring the leader node for replication lag.
- Consideration of monitoring on the leader node for verification purposes.
Screen Sharing Issues:
- Challenges in screen sharing due to bank environment restrictions.
- Expertise required for Prometheus technology, and approval to bring in another expert.
Alert Rules and Database Monitoring:
- Review and adjustment of alert rules for PG replication lag.
- Recommendation to use a cluster control tool for SQL monitoring.
- Confirmation of using Prometheus but limitations on installing new tools.
Query Execution and Logs:
- Execution of specific queries on both primary and standby nodes.
- Sharing of logs for further investigation.
Alerting System:
- Discussion on implementing an alerting system for lagging issues on the primary node.
Communication Constraints:
- Acknowledgment of communication constraints due to security policies.
- Commitment to finding solutions within constraints.

Action Items:

Running queries provided by the expert on both primary and standby nodes.
Sharing feedback and logs for further investigation.
Potential appointment of a new meeting if needed.

Additional Information Requested:

Network latency between primary and standby databases.
Disk I/O status to verify disk space.
max_wal_senders parameter value.
Replication type (asynchronous or synchronous).

Tools and Resources Recommended:

Installation of pgwatch2 for monitoring long-running queries.
Reference to a guide on PostgreSQL replication lag troubleshooting.

Identifying Long-Running CRUD Operations:

Request to list all long-running CRUD operations with time taken information.

Suggestions for Configuration Changes:

Increase max_wal_senders value to at least 100 for better handling of transaction requests.
Consider switching from synchronous to asynchronous replication to reduce replication lag.

Gradual Adjustment of max_wal_senders:

Recommendation to increase the max_wal_senders value gradually until the issue is resolved, ensuring no impact on server memory/loading.

Conclusion:

To address the Postgres replication issue, the team has undertaken a multifaceted approach. This includes refining monitoring strategies for the leader node, addressing screen sharing challenges within secure banking environments, and fine-tuning alert rules while navigating communication constraints. The execution of specific queries, sharing logs, and exploring tools like pgwatch2 form key action items, along with the recommendation to gradually adjust max_wal_senders to at least 100 for optimal performance, ultimately leading to a successful resolution of the replication problem.