Problem:
Not Showing Patroni Output in PG DR Database. The team faced an issue where the Patroni status was visible on the production (PROD) server but not on the disaster recovery (DR) server. The discrepancy hindered the ability to monitor the DR environment effectively.
Process:
- Request Patroni Configuration and Logs:
- Obtain
patroni.yml
from both PROD and DR servers to review Patroni and PostgreSQL settings. - Collect recent Patroni logs from the DR server to identify any specific errors or warnings.
- Execute and gather outputs from both servers:
SELECT * FROM pg_stat_replication;
SELECT * FROM pg_stat_activity WHERE state = 'idle' OR state = 'active';
- Check the status of the Patroni service on both servers using
systemctl status patroni
. - Verify Patroni and PostgreSQL versions on both servers.
- Obtain
- Initial Findings:
- Shared
patroni.yml
files and logs revealed a lack of recent logs on the DR server. - Outputs from the given commands indicated potential issues with replication visibility.
- Patroni service statuses were collected:
- DR:
[postgres@cescdbdr1 pgarchive]$ patronictl -c /etc/patroni.yml list + Cluster: r1_pg_dr_pr (7117601140351769230) ---+---------+----+-----------+-----------------+ | Member | Host | Role | State | TL | Lag in MB | Pending restart | +------------+-------------+-----------------+---------+----+-----------+-----------------+ | r1_pg_dr_pr_n1 | 172.22.8.52 | Standby Leader | running | 67 | | * | +------------+-------------+-----------------+---------+----+-----------+-----------------+
- PR:
[postgres@cescdbpr1 pgarchive]$ patronictl -c /etc/patroni.yml list + Cluster: r1_prod (7117601140351769230) -------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +----------+-------------+---------+--------+----+-----------+ | r1_prod_n1 | 172.22.33.52 | Leader | running | 67 | | | r1_prod_n2 | 172.22.33.53 | Replica | running | 67 | 0 | +----------+-------------+---------+--------+----+-----------+
- DR:
- Shared
Solution:
- Verify Patroni Service Status:
- Ensure the Patroni service is running on the DR server using
sudo systemctl status patroni
. - Start the service if it’s not running:
sudo systemctl start patroni
.
- Ensure the Patroni service is running on the DR server using
- Verify PostgreSQL Replication Status:
- Use
sudo -u postgres patronictl list
to check the replication health.
- Use
- Check Connectivity and DNS Resolution:
- Test connectivity to the primary server from the DR server using
ping 172.22.33.52
andnslookup <PRIMARY_SERVER_HOSTNAME>
.
- Test connectivity to the primary server from the DR server using
- Restart Patroni Service:
- Restart the Patroni service on the DR server:
sudo systemctl restart patroni
.
- Restart the Patroni service on the DR server:
- Manually Check Patroni API:
- Use a web browser or
curl
to access the Patroni API:http://172.22.33.52:5432/patroni
.
- Use a web browser or
- Update Configuration:
- Update the
patroni.yml
file on the DR server to match the scope and namespace settings of the production file.
- Update the
- The issue was due to a namespace mismatch between the main production server and the disaster recovery (DR) server. Our experts updated the DR configuration file to have the same scope and namespace as the production file, which resolved the problem. Our expert also shared relevant documentation links for further reference:
Conclusion:
The root cause of the issue was identified as a namespace mismatch between the production and disaster recovery configurations. By aligning the namespace and scope settings in the DR configuration to match those of the production environment, the visibility issue was resolved.
Follow-Up Actions:
- Confirmed the recommendation of updating namespace and scope settings.
- Monitored the cluster behavior post-implementation to ensure stability and consistent display of replica status.
- Provided additional resources and documentation for reference.
The solution effectively resolved the Patroni output visibility issue in the DR database. The steps taken ensured that the DR environment was properly aligned with the production settings, thereby restoring full monitoring capabilities.