Resolving Patroni Output Visibility Issue in Disaster Recovery Database - Proactive Insights and Support For Open-Source Applications

Problem:

Not Showing Patroni Output in PG DR Database. The team faced an issue where the Patroni status was visible on the production (PROD) server but not on the disaster recovery (DR) server. The discrepancy hindered the ability to monitor the DR environment effectively.

Process:

Request Patroni Configuration and Logs:
- Obtain patroni.yml from both PROD and DR servers to review Patroni and PostgreSQL settings.
- Collect recent Patroni logs from the DR server to identify any specific errors or warnings.
- Execute and gather outputs from both servers:
  - SELECT * FROM pg_stat_replication;
  - SELECT * FROM pg_stat_activity WHERE state = 'idle' OR state = 'active';
- Check the status of the Patroni service on both servers using systemctl status patroni.
- Verify Patroni and PostgreSQL versions on both servers.

Initial Findings:

Shared patroni.yml files and logs revealed a lack of recent logs on the DR server.
Outputs from the given commands indicated potential issues with replication visibility.

Patroni service statuses were collected:

DR:


[postgres@cescdbdr1 pgarchive]$ patronictl -c /etc/patroni.yml list
+ Cluster: r1_pg_dr_pr (7117601140351769230) ---+---------+----+-----------+-----------------+
| Member     | Host        | Role            | State   | TL | Lag in MB | Pending restart |
+------------+-------------+-----------------+---------+----+-----------+-----------------+
| r1_pg_dr_pr_n1 | 172.22.8.52 | Standby Leader | running | 67 |           | *               |
+------------+-------------+-----------------+---------+----+-----------+-----------------+

PR:


[postgres@cescdbpr1 pgarchive]$ patronictl -c /etc/patroni.yml list
+ Cluster: r1_prod (7117601140351769230) -------+----+-----------+
| Member   | Host        | Role    | State  | TL | Lag in MB |
+----------+-------------+---------+--------+----+-----------+
| r1_prod_n1 | 172.22.33.52 | Leader  | running | 67 |           |
| r1_prod_n2 | 172.22.33.53 | Replica | running | 67 |         0 |
+----------+-------------+---------+--------+----+-----------+

Solution:

Verify Patroni Service Status:
- Ensure the Patroni service is running on the DR server using sudo systemctl status patroni.
- Start the service if it’s not running: sudo systemctl start patroni.
Verify PostgreSQL Replication Status:
- Use sudo -u postgres patronictl list to check the replication health.
Check Connectivity and DNS Resolution:
- Test connectivity to the primary server from the DR server using ping 172.22.33.52 and nslookup <PRIMARY_SERVER_HOSTNAME>.
Restart Patroni Service:
- Restart the Patroni service on the DR server: sudo systemctl restart patroni.
Manually Check Patroni API:
- Use a web browser or curl to access the Patroni API: http://172.22.33.52:5432/patroni.
Update Configuration:
- Update the patroni.yml file on the DR server to match the scope and namespace settings of the production file.
The issue was due to a namespace mismatch between the main production server and the disaster recovery (DR) server. Our experts updated the DR configuration file to have the same scope and namespace as the production file, which resolved the problem. Our expert also shared relevant documentation links for further reference:
- Patroni Documentation on Dynamic Configuration
- Highly Available PostgreSQL Cluster using Patroni and HAProxy

Conclusion:

The root cause of the issue was identified as a namespace mismatch between the production and disaster recovery configurations. By aligning the namespace and scope settings in the DR configuration to match those of the production environment, the visibility issue was resolved.

Follow-Up Actions:

Confirmed the recommendation of updating namespace and scope settings.
Monitored the cluster behavior post-implementation to ensure stability and consistent display of replica status.
Provided additional resources and documentation for reference.

The solution effectively resolved the Patroni output visibility issue in the DR database. The steps taken ensured that the DR environment was properly aligned with the production settings, thereby restoring full monitoring capabilities.