Case Studies Archives - Proactive Insights and Support For Open-Source Applications

27 Jun 2025 PostgreSQL Failover Analysis

Problem The client experienced a failover event in their PostgreSQL cluster managed by Patroni, between 01:00 AM and 02:00 AM on May 23, 2025. Process Step 1 – Initial Investigation Initial logs from PostgreSQL (postgresql-Fri-00.log and postgresql-Fri-01.log) revealed regular query activity. This included frequent queries from monitoring tools (pg_stat_all_tables, pg_locks, etc.), checkpoint logging, and client […]

Database 16 Jun 2025 Resolving File Descriptor Exhaustion in PostgreSQL with Patroni HA Cluster

Problem: The client encountered a persistent issue related to file descriptor exhaustion on their PostgreSQL version 15 database, running on a Patroni High Availability cluster with RHEL 8.10. The PostgreSQL logs frequently reported the error: “out of file descriptors: Too many open files; release and retry” during database operations. Although the client had significantly increased […]

Database 13 Jun 2025 Implementing Quorum-Based Semi-Synchronous Replication in PostgreSQL with Patroni

Problem: The client implemented Patroni High Availability PostgreSQL clusters using PostgreSQL version 15. The current configuration consisted of two-node clusters: one primary node and one replica node, with asynchronous replication between them. The client requested a more durable replication setup that would be resilient to operating system crashes, while maintaining high database performance, as the […]

Database 2 Jun 2025 Implementing SSL and High Availability in a Multi-Node OpenSearch Cluster

Problem: A financial services client encountered critical SSL-related errors while deploying a two-node OpenSearch 1.3.6 cluster for high availability. Despite both nodes appearing operational, accessing indices or interacting with the cluster through a Java application resulted in errors such as: SSLHandshakeException: Insufficient buffer remaining for AEAD cipher fragment IBMCertPathBuilderException: unable to find valid certification path […]

Data Analytics 30 May 2025 Troubleshooting and Securing Access to Cassandra from SRE and EC2 Nodes

Problem: The client was operating a 5-node Apache Cassandra cluster (version 4.1.5) and needed to establish secure access to the database from both an SRE server and an EC2 server. While basic connectivity (e.g., telnet) between the source and Cassandra target nodes was verified, direct access to Cassandra using cqlsh was unsuccessful. The client sought […]

Database 28 May 2025 Renaming a Kubernetes Control-Plane Node in a Production Cluster

Problem: A telecommunications customer operating a production Kubernetes cluster deployed via Kubespray encountered an infrastructure challenge. One of their original control-plane nodes (kz-bss-k8om01) had previously failed and was replaced with a new node named kz-bss-k8om04. Later, the client requested to rename this node back to its original FQDN (kz-bss-k8om01) and ideally retain the original IP […]

Developer Tools 26 May 2025 Resolving Cassandra Backup Failures Due to Priam Incompatibility

Problem: The client encountered a failure while attempting to run a Cassandra backup using Commvault on their QAT cluster. The backup process failed with a 500 HTTP error originating from the local Priam REST endpoint: HTTP ERROR 500 Problem accessing /REST/v1/cassadmin/info. Reason: Commvault support traced the issue to the Priam service and advised the client […]

Database 23 May 2025 Optimizing Nodetool Cleanup Performance in a Large-Scale Apache Cassandra 4.1.5 Cluster During Node Addition

Problem: The client faced performance challenges while running nodetool cleanup on an Apache Cassandra 4.1.5 cluster during a node addition activity in a production environment. Specifically, the cleanup process was taking an unexpectedly long time on nodes with over 600GiB of data load, raising concerns about the overall timeline and impact on production workflows. The […]

Database 23 May 2025 Optimizing Apache Cassandra Repair: Reducing CPU Utilization from 90% to 30%

Problem: The client reported high CPU utilization (up to 90%) across all nodes in their 3-node Apache Cassandra 4.1.3 cluster during full or incremental repair operations initiated from any single node. Despite relatively low data volumes (~25 GB per node), the CPU spike raised concerns about system performance, stability, and potential downtime during repairs. Process: […]

Database