Problem:

The client reported issues with Keepalived crashes leading to high availability disruptions and application errors, particularly connection timeouts with the PostgreSQL server. Initial investigations revealed suspicions of network instability and outdated software versions contributing to the problem.

Process:

  1. Requesting initial information for further investigation of the problem
    • The number of servers in the HAProxy and Keepalived pool.
    • HAProxy and Keepalived configurations from all servers.
    • Full, unfiltered /var/log/messages files from all servers.
    • “dmesg” logs from all servers.
    • Details on the deployment of HAProxy, Keepalived, and Postgres.
    • “iptables” configurations from all servers.
    • Screenshots or instructions for enabling HAProxy admin.
    • Installation and running of dstat utility on all servers.
    • A code snippet to enable HAProxy admin interface.
  2. Investigation of Network Issues

    The process began with suspecting network issues causing intermittent problems with an application hosted on different VMs. The team conducted checks on errors on interfaces using commands like “ip -a” and “ifconfig ens224.” Additionally, the logs for connections from backend servers to correlate them with application failures were examined. This step involved gathering relevant logs from the client’s live session with the expert.

  3. Troubleshooting Keepalived Version 2.1.5.

    The discussion shifted towards troubleshooting Keepalived version 2.1.5, identified as an older version causing timer expiration issues leading to split brain scenarios. The expert recommended upgrading Keepalived to a stable version (2.2.8) due to the issues found. A GitHub issue (https://github.com/acassen/keepalived/issues/2066) was referenced to provide further context and potential solutions for the timer expiration problem.

  4. System Configuration Review

    There was an emphasis on reviewing system configurations, including parameters related specifically to Keepalived. The client inquired about default parameters requiring tuning while the expert highlighted that Keepalived doesn’t typically require much tuning by default due to its simple VRRP protocol running over IP.

  5. Bug Impact and Resolution

    It was supposed that a bug in the Keepalived system was causing the issue. It was recommended to update Keepalived as a temporary fix and suggested upgrading HAproxy to version 2.6 for better stability. The expert inquired about restarting the system as a quick workaround, to which the client mentioned potential impacts on connections and advised changing VM options if needed in order to alleviate the Keepalived older version issue.

  6. System Compatibility and Recommendations

    The client was recommended to upgrade HAproxy to version 2.6 due to security issues with the current version. Addressing minor issues related to Red Hat updates were also emphasized.

  7. Impact of Restarting Services

    The discussion delved into how restarting services would impact existing connections, particularly at the database level. Scenarios related to active connections, transaction databases, connection pools from application sides or database sides were discussed.

Solution:

Keepalived and HAProxy Upgrades:
The client proceeded with the upgrade of Keepalived to the latest stable version, addressing timer expiration issues and improving the overall stability of the high availability configuration. Simultaneously, HAProxy was migrated to version 2.6, mitigating security risks and leveraging performance optimizations to enhance load balancing capabilities.

Network and System Configuration Optimization:
Configuration adjustments were implemented based on the recommendations to optimize network settings, CPU and memory allocation, and load balancing algorithms. These optimizations aimed to improve resource utilization, reduce latency, and enhance the overall resilience of the infrastructure.

VMWare Snapshot Adjustments:
Changes were made to the VMWare snapshot schedules for affected virtual machines to minimize the impact on system performance and prevent potential freezes during snapshot operations. Staggered snapshot schedules were introduced to distribute resource-intensive tasks more evenly, ensuring smoother operation and mitigating risks associated with snapshot-related disruptions.

Conclusion:

The collaborative efforts and thorough analysis undertaken have successfully identified and addressed the root causes of Keepalived crashes and application errors. By proposing adjustments to the VMWare snapshot schedule and exploring vSphere backup mechanisms, we have laid a foundation for enhancing system stability and preventing future disruptions. Continued vigilance and adherence to recommended best practices will be essential in ensuring sustained system reliability and performance.