Problem:

The client experienced connection issues on the PostgreSQL database server, with an abnormally high number of connections reaching around 20,000 at a given time. The client asked for assistance from the expert team in identifying the possible reasons behind this issue. Based on the client’s analysis, there were multiple wait events on HAProxy, and the maximum connection limit is set to approximately 10,000.

Process:

The expert team started investigating logs and the troubleshooting process.

Step 1:

The team’s investigation and recommendations included:

  • Haproxy Version Inquiry: Inquiry about the version of Haproxy used.
  • Application Inspection: Suggestion to scrutinize the application sending requests to PostgreSQL via Haproxy.
  • Suspected issue: Potential failure to close connections properly.
  • Operating System Information: Query regarding the operating system where Haproxy and PostgreSQL were deployed.
  • Netstat Analysis: Proposal to analyze netstat output on the PostgreSQL server.
  • TCP Parameter Adjustment: Recommendation to adjust core TCP parameters, particularly TCP connection timeouts. Reason: Default settings may be too large for high-load systems.
  • PGBouncer Consideration: Suggestion to consider implementing PGBouncer in front of PostgreSQL. Dependent on application logic and requirements.
  • Step 2:

    During the client’s live session, our team, alongside an expert, tackled several critical issues:

  • Haproxy Connection Closure: Our experts delved into why Haproxy wasn’t closing connections as expected. This involved scrutinizing netstat outputs and seeking guidance from a Haproxy specialist.
  • Persistent Connections: Our experts investigated why connections persisted even when the application was offline. By sifting through logs, they identified the root cause and implemented necessary fixes.
  • Configuration Adjustment: Updates were made to Kubernetes or direct edge proxy settings on the server to ensure smooth operation.
  • Client Machine Troubleshooting: To address issues with specific client machines, our experts introduced session timeouts and conducted thorough checks for any additional running processes that might be causing disruptions.
  • Timeout Reduction: In an effort to optimize performance, our experts reduced FIN timeouts from 1500 minutes to a more efficient range of 1-10 minutes.
  • Haproxy Weight Connections: Our experts explored whether the issue stemmed from a Haproxy bug or was related to the application code, ensuring a comprehensive approach to problem-solving.
  • Addressing Unique Cases: Specific connection problems with certain client machines were resolved, tailoring solutions to accommodate unique circumstances.
  • Step 3:

    1. Modified HAProxy Logging Level

  • SSH’ed into the HAProxy server and ensured permissions.
  • Located and opened the HAProxy configuration file, typically haproxy.cfg.
  • Found the relevant logging configuration (commonly in the defaults section).
  • Changed the logging level to debug for detailed information.
  • Saved the changes and exited the text editor.
  • 2. Reloaded HAProxy

  • Checked the configuration for syntax errors using sudo haproxy -c.
  • If no errors were found, reloaded HAProxy gracefully with sudo haproxy -D -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -sf $(cat /var/run/haproxy.pid) or did a full restart with sudo systemctl restart haproxy.

    3. Evaluated Impact on HAProxy Performance

  • Recognized that debug-level logging could impact HAProxy’s performance due to increased log volume.
  • Understood that detailed logs required more processing and disk I/O, potentially affecting CPU and disk usage.
  • Monitored system performance after enabling debug logging and considered reverting to a lower level (e.g., warning) if performance issues arose.
  • After gathering troubleshooting data, returned to the default logging level for routine operations.
  • Solution:

    After the investigation, the expert team connected to the HAProxy server, and edited the haproxy.cfg file, and set the logging level to debug to modify the login level. Verified the configuration for syntax errors and reloaded HAProxy gracefully or restarted it if necessary. Monitored system performance due to increased logging, and reverted to the default logging level after troubleshooting.

    Conclusion:

    The client faced connection issues on the PostgreSQL database server, with connections surging to about 20,000 at a time. The expert team investigated, focusing on HAProxy settings, possible application faults, and system configurations. They recommended steps including querying the HAProxy version, checking the application’s connection management, examining OS specifics, using netstat, tuning TCP parameters, and considering PGBouncer. The team modified the HAProxy logging level to debug, reloaded HAProxy configurations, and assessed performance impacts. They discussed findings and solutions with the client, planned follow-up meetings, and aimed to return to standard logging post-troubleshooting.