Problem:
The client’s billing system experienced around 10 billing cycles each month. During each billing cycle, the billing process was initiated and connected to the database cluster with 20 concurrent streams. However, upon starting the process, both the replication and the billing process failed, displaying the following error: “Could not send data to the client: Connection reset by peer.” To troubleshoot, the customer reduced the number of streams from 20 to 10, but the issue persisted. As a temporary workaround, the customer has been shutting down other daemon processes that normally run in parallel with the billing process.
Solution:
Step 1: Initial Investigation
To address this issue experts have requested the logs of PostgreSQL and PGpool for further investigation.
Step 2: Further Investigation
For further investigation, a meeting was held where were discussed the following topics:
- Connection Reset Error: The expert team investigated the situation involving connection resets during bulk database processing, affecting both logical and physical replication. External checks with Red Hat, VMWare, and network tools found no issues, indicating a database problem.
- Database Issues: Despite ruling out other components, the exact cause within the database remains unidentified.
- Network Utilization: Grafana monitoring during an incident showed no significant network anomalies, leaving the error cause unclear.
- Joint Call with Network Team: Future troubleshooting was included in the network team to address connection resets during bulk processing in PostgreSQL databases.
- Database Connection Issues: Suspected missing database parameters may be causing dropped connections and Pgpool issues affecting replication.
- Network Configuration: The database and Pgpool were connected through a router, not on the same subnet.
- Performance Degradation: Performance was improved after running analyze and vacuum commands during a process run.
Step 3: Deeper Investigation
For the next steps was held 2nd meeting where were discussed the following topics:
- Connection Reset Issue: The team discussed connection resets during bulk operations, concluding it’s likely a network issue. Our experts suggested extended packet loss monitoring to diagnose the problem.
- Network Analysis: More in-depth network analysis with tools like tcpdump was recommended to uncover unusual parameters that might cause connection resets.
- Subnet Configuration: It was confirmed that the Pgpool and P1-time database are on the same subnet, but further network expert analysis is needed.
- Evidence Gathering: To definitively determine whether the issue was network or PostgreSQL-related, comprehensive evidence collection from both sides was necessary.
- TCP Zero Packet Issue: TCPdump captured TCP zero packets and suggested potential issues with source/destination machines rather than PostgreSQL.
- Network Issue: Network problems were impacting application performance, logical and physical replication, and streaming replication, with complex virtual network architecture being a factor.
- Database Size: The database size was 500 GB with 32 CPUs.
- Replication Process: Replicated from production to standby databases using streaming and logical replication.
- IP Addresses: Different IP addresses were used for processes such as Pgpool VIP and connections to data sites.
- Connection Resets: Network issues caused connection resets that affected databases, applications, and replication processes.
- Virtual IP: Discussions covered virtual IPs for database failover scenarios.
- Upgrading Network Drivers: Upgrading network drivers was planned to address connection resets.
- Operating System Tuning: Was suggested to tune the operating system for better network, memory, and routing performance.
- Database Tuning: Shared load configurations for Pgpool from both database renters to identify necessary fine-tuning.
Solution:
In conclusion, experts found that the issue wasn’t related to the database. Logs showed that it is a network issue, because not only client’s connections are dropped but streaming replication connections are also dropped.
Conclusion:
After investigating, the team of experts concluded that the problem wasn’t related to the database. The main issue was with the network, and the client needed to upgrade their system, adjust the network connection, and update the drivers.