Problem: A customer running Patroni-managed PostgreSQL (PostgreSQL 15.17, Patroni v3.3) had production databases ranging from >2-3 TB up to a ~52 TB physical replica. Rebuilding the empty DR site from production relied on manual pg_basebackup executions and manual post-backup recovery steps, which was too slow, operationally intensive, and prone to human error for such large volumes. The […]
Knowledge Base Case Studies 22 Jun 2026 PostgreSQL LWLock LockManager Contention caused by unpruned partition scans and high concurrencyProblem: An OLTP system using Patroni-managed PostgreSQL 15 experienced global slowdowns under load: CPU climbed to ~80%, large numbers of concurrent sessions, and application timeouts. The client reported bursts of traffic (hundreds of workers in parallel) routed through HAProxy to the leader, and provided PostgreSQL and Patroni logs, auto_explain output, and query samples. Key database […]
Knowledge Base Case Studies 1 Jun 2026 Controlling heavy queries and resource usage on a Patroni PostgreSQL clusterProblem: A production Patroni-managed PostgreSQL 15 cluster experienced periodic heavy queries that threatened availability. An example slow job ran for ~84 seconds and performed a full scan of a 1.8 TB partitioned table (arbor.CDR_DATA) that uses daily partitions starting in early April. Most clients connect through generic application users rather than distinct personal accounts. The […]
Knowledge Base Database Case Studies 28 May 2026 Rebalancing uneven disk usage in a 5-node Apache Cassandra 2.2.5 clusterProblem: A 5-node Apache Cassandra 2.2.5 cluster (two data centers) reported severe per-node disk imbalance: each node was configured with five data_file_directories (e.g. /cassandra/data1/data … /cassandra/data5/data) but some mount points on individual nodes were near full (examples showed mounts at 93% and 95% used). On one node a particular keyspace (jessi) had large sstable directories […]
Knowledge Base Case Studies 22 May 2026 Root cause analysis: PostgreSQL primary crashed from system-wide file-descriptor exhaustionProblem: A production Patroni-managed PostgreSQL cluster (PostgreSQL v15.17, Patroni 3.3.2) experienced a primary process abort with SIGABRT during normal operation. Server logs reported that a server process was terminated by signal 6 (Aborted) and that the failed process was executing a COMMIT when the postmaster began terminating other server processes. Subsequent messages showed PostgreSQL could […]
Data Management and Analytics Database Case Studies 20 May 2026 Resolving a Complex PostgreSQL Patroni Replica FailureProblem The database architecture reached its limit when a critical node stopped responding. A standby node experienced a PostgreSQL Patroni replica failure, repeatedly logging “incorrect resource manager data checksum” errors. The system stopped working because corrupted Write-Ahead Log segments completely broke the replication stream. A dangerous shortcut would involve running continuous base backups to force […]
Database 20 May 2026 Transitioning Zabbix LLD to an External Dynamic SourceProblem: The client needed to modernize their Zabbix v7.4.8 setup by migrating a Low-Level Discovery rule away from a static, local JSON file. They required best practices and an architectural recommendation for dynamically triggering and feeding the discovery rule from an external Windows server, while ensuring reliable and efficient updates. Process: Step 1: Architecture Review […]
Monitoring 11 May 2026 Zabbix TLS Handshake Fix: Host & SNI MismatchProblem: During migration to Zabbix 7.4, hundreds of HTTPS health checks failed because the HTTP Host header differed from the SNI hostname used for the TLS handshake. In this environment, monitoring items connect to a load-balancer VIP using a custom Host header to reach specific virtual sites. However, the target servers detected the discrepancy between […]
Monitoring Knowledge Base Case Studies DevOps 8 May 2026 Enabling WAL archiving on a DR Patroni standby to allow backups from the replicaProblem: A customer running Patroni-managed PostgreSQL v15.17 (Patroni 3.3.2) asked whether Commvault backups can be taken from the DR site’s Standby Leader. The DR cluster is a replicating standby of Production. The request asked specifically whether WAL file generation can be started on the DR Standby Leader and whether Commvault’s option to delete WALs after […]
Knowledge Base Case Studies Data Management and Analytics Database