Problem:
The client had 10 monthly bill cycles, and during such bill cycle days, the billing process was connected to the DB cluster in 20 streams. After the process had started the replication the above-mentioned billing process failed with the “could not send data to the client: Connection reset by peer” error. The client reduced the number of streams to 10 still facing the same issue. After this, the client shuts another daemon process which should run in parallel with the billing process as a workaround.
Process:
Step 1 – initial investigation and the meeting with the client:
In the meeting, our experts discussed the following topics:
Connection Reset Troubleshooting:
Database Issues:
Network Utilization:
Database Connection Issues:
Network Configuration:
PG Pool Issue:
Scheduled another meeting with network experts to address connection drops and performance issues.
Step 2 – follow-up meeting with the client:
At this meeting, our expert team discussed with the client the following topics:
1. Connection Reset Issue:
2. Network Analysis:
3. Subnet Configuration:
4. Evidence Gathering:
5. TCP Zero Packet Issue:
6. Network Issue:
7. Database Size and Replication:
8. IP Addresses:
9. Network Issues Recap:
10. Virtual IP:
11. Network Drivers and OS Tuning:
12. Database Tuning:
Solution:
After troubleshooting and checking the previous logs, the expert team suggested that the issue wasn’t related to the database. It was a network issue because not only client’s connections were dropped but streaming replication connections were also dropped.
Conclusion:
The client experienced connection resets during monthly billing cycles when connected to the database cluster, even after reducing the number of streams and shutting down parallel processes. Initial investigations and meetings identified the issue as a network problem rather than a database issue, despite TCP dumps showing no clear network spikes. Further analysis revealed that both client and replication connections were affected, indicating a complex network issue rather than a database fault. The team plans to upgrade network drivers, tune the OS, and conduct a deeper network analysis to resolve the problem.