Problem:
A production Patroni-managed PostgreSQL (community 15.17, Patroni 3.3.2) cluster experienced recurrent nightly performance degradation. Symptoms included large numbers of LWLock wait events and high query latency during night operations reported by the application and visible in the database analytical report. Patroni and PostgreSQL logs were provided for analysis.
Operational context: PgBouncer was present as a front-end pooler handling roughly 1.6k client connections (configured listen_port 6432, auth_type=md5, pool_mode=transaction, max_client_conn=5000, default_pool_size originally 1000 then locally reduced to 50). The PostgreSQL servers were set with max_connections=4000 (not reducible for the application), the environment had ~1.5k direct DB connections in addition to pooled traffic, and periodic spikes of >2k server-side connections were observed at the database level. Client proposals under consideration included increasing max_wal_size (32GB → 64GB), increasing checkpoint_timeout (15min → 30min), and raising work_mem (2MB → 16MB), with concerns about recovery time and RAM usage.
Process:
Step 1: Correlate nightly latency with lock/wal events
PostgreSQL server logs and the supplied nightly analytical report were reviewed to align timestamps. A clear correlation was found between nightly job windows and peaks in LWLock wait events and checkpoint activity; this established that lock contention and IO bursts were temporally linked to the night workload and checkpoint/WAL cycles, making checkpoint tuning and workload scheduling prime investigation targets.
Step 2: Map connection topology and pooling behavior
PgBouncer configuration and runtime connection counts were inspected alongside pg_stat_activity. The pooler was running in transaction mode with a very large default_pool_size configuration previously set extremely high, and the database showed both many direct connections and elevated server-side backends during peaks. This indicated pooling configuration was permitting bursts of server backends that increased per-backend memory and contention, and that client-side connection storms could translate into sudden backend growth on PostgreSQL.
Step 3: Identify LWLock types and hotspots
pg_stat_activity, pg_locks, and relevant LWLock counters (from server logs and the diagnostic snapshots provided) were examined to break down the lock waits. The majority of waits aligned with buffer-manager and WAL-related LWLocks during checkpoint and heavy write windows rather than short-term SQL-level lock waits, pointing to I/O-driven contention rather than application-level locking bugs.
Step 4: Compute memory exposure for proposed work_mem change
Configuration math was performed using current max_connections and observed connection patterns to estimate worst-case and realistic memory footprints. With the observed peak backend count and available RAM, raising work_mem from 2MB to 16MB reduced on-disk temp usage for sorts/joins without exceeding safe memory limits in practice. The calculation justified the increase while keeping max_connections unchanged.
Step 5: Evaluate checkpoint/WAL parameter trade-offs
Checkpoint frequency and WAL growth were modeled for the nightly load using existing max_wal_size and checkpoint_timeout. Increasing max_wal_size to 64GB and extending checkpoint_timeout to 30 minutes was predicted to reduce checkpoint frequency and smooth I/O, lowering the incidence of WAL-related LWLock spikes. Recovery-time trade-offs were quantified and accepted as within the operational SLA after discussion with the client.
Step 6: Apply changes, validate under production-like windows
PgBouncer pool sizing was adjusted to limit server-side backend growth (default_pool_size reduced to a conservative value consistent with application connection patterns while keeping pool_mode=transaction and max_client_conn sufficient). PostgreSQL parameters were updated: max_wal_size=64GB, checkpoint_timeout=30min, checkpoint_completion_target remained 0.9, and work_mem increased to 16MB. Nightly jobs were rescheduled/staggered where feasible to avoid coincidence with remaining checkpoint peaks. Post-change monitoring during subsequent night windows showed marked reductions in LWLock wait counts and improved query responsiveness, confirming the interaction between connection bursts, checkpoint I/O, and lock contention and transitioning into the implemented Solution.
Solution:
PostgreSQL was tuned and the pooling layer reconfigured: max_wal_size was raised to 64GB, checkpoint_timeout extended to 30 minutes, checkpoint_completion_target retained at 0.9, work_mem increased to 16MB, and PgBouncer pool sizing was reduced to prevent large, simultaneous server-side backend growth. Nightly batch windows were staggered away from heavy checkpoint activity.
These changes work because they (1) reduce checkpoint frequency and spread write I/O over longer intervals in PostgreSQL, lowering WAL- and buffer-manager LWLock contention, (2) limit the number of active server backends created by pooling bursts so per-backend memory and lock pressure are constrained, and (3) reduce on-disk temp usage by giving queries more memory for sorts/joins, cutting additional I/O spikes.
Conclusion:
After the tuning and pooler adjustments, nightly LWLock wait events and query latency dropped significantly during validation windows. The environment achieved more stable night-time performance without reducing max_connections, with quantified trade-offs on recovery time accepted into the SLA and continued monitoring in place to detect regressions.