Problem:

The client had 10 monthly bill cycles, and during such bill cycle days, the billing process was connected to the DB cluster in 20 streams. After the process had started the replication the above-mentioned billing process failed with the “could not send data to the client: Connection reset by peer” error. The client reduced the number of streams to 10 still facing the same issue. After this, the client shuts another daemon process which should run in parallel with the billing process as a workaround.

Process:

Step 1 – initial investigation and the meeting with the client:

In the meeting, our experts discussed the following topics:

Connection Reset Troubleshooting:

  • Discussed connection reset error during bulk database processing.
  • Error that impacted logical and physical replication and terminated some connections.
  • I supported cases with Red Hat and VMWare and found no external issues; the TCP dump also showed no network problems.
  • Database Issues:

  • The team suspected the database itself was causing the issue, as all other components functioned properly.
  • Network Utilization:

  • Grafana monitoring on April 15th showed no network spikes correlating with the errors.
  • Planned as a next step to joint call with the network team to address connection resets during bulk processing in Postgres databases.
  • Database Connection Issues:

  • Suspected missing database parameters and Pgpool dropping connections, affecting replication.
  • Network Configuration:

  • Database and Pgpool were not on the same subnet but connected through a horse.
  • Performance improved after running analysis and vacuum commands
  • PG Pool Issue:

  • Debated over PG Pool causing idle session termination without a solution yet.
  • Scheduled another meeting with network experts to address connection drops and performance issues.

    Step 2 – follow-up meeting with the client:

    At this meeting, our expert team discussed with the client the following topics:

    1. Connection Reset Issue:

  • Bulk processing on the production system caused connection resets.
  • Concluded as a network issue, suggested extended packet loss monitoring.
  • 2. Network Analysis:

  • Deeper network analysis with tools like Tcpdump is recommended.
  • Checked parameters like Ccp or timeout for anomalies.
  • 3. Subnet Configuration:

  • Pgpool and P1-time database confirmed on the same subnet; further expert analysis was needed.
  • 4. Evidence Gathering:

  • Collected evidence to determine if the issue is network-related or Postgres-related.
  • 5. TCP Zero Packet Issue:

  • Tcpdump showed TCP zero packets, indicating possible machine issues rather than Postgres.
  • 6. Network Issue:

  • Complex network setup affected application, replication, and connections; impacted both master and slave servers.
  • 7. Database Size and Replication:

  • Database size was: 500 GB with 32 CPUs.
  • Replication from production to standby using streaming and logical methods.
  • 8. IP Addresses:

  • Discussed various IPs for Pgpool VIP and connections to data sites.
  • 9. Network Issues Recap:

  • Network issues caused connection resets for databases and replication processes.
  • 10. Virtual IP:

  • Virtual IP for database 14 correlated with database 19, affecting primary database roles.
  • 11. Network Drivers and OS Tuning:

  • Planned to upgrade network drivers and tune operating systems for better performance.
  • 12. Database Tuning:

  • Shared load configuration for fine-tuning Pgpool settings.
  • Solution:

    After troubleshooting and checking the previous logs, the expert team suggested that the issue wasn’t related to the database. It was a network issue because not only client’s connections were dropped but streaming replication connections were also dropped.

    Conclusion:

    The client experienced connection resets during monthly billing cycles when connected to the database cluster, even after reducing the number of streams and shutting down parallel processes. Initial investigations and meetings identified the issue as a network problem rather than a database issue, despite TCP dumps showing no clear network spikes. Further analysis revealed that both client and replication connections were affected, indicating a complex network issue rather than a database fault. The team plans to upgrade network drivers, tune the OS, and conduct a deeper network analysis to resolve the problem.