Problem:
Seven identical Patroni-managed PostgreSQL 15.15 clusters (Patroni 3.3.2) showed divergent behaviour for the same SELECT query: clusters labelled Cust 1–5 returned results quickly while Cust 6–7 returned the same query much slower. Dataset sizes, indexes, constraints and foreign keys were reported as equivalent across clusters. One temporary mitigation (set enable_mergejoin = false) had been applied on affected hosts to avoid slowness, but the issue persisted on at least one cluster. Execution plans and runtime timings were provided for comparison.
Additional signals included table analyze metadata showing varying last_analyze timestamps and non-zero modification counters on affected tables, and execution plans indicating different join orders and much higher processed-row estimates on slow clusters.
Process:
Step 1: Confirm symptom scope and compare execution plans
Observed the client-provided execution plans and runtime timings from fast (Cust 5) and slow (Cust 6/7) clusters. The slow plans showed a different join order and dramatically higher estimated/actual row processing than the fast plan. This established that the performance divergence was planner-driven (plan selection) rather than an I/O or CPU bottleneck specific to a host.
Step 2: Inspect planner inputs — statistics and configuration
Reviewed pg_stat_all_tables / pg_stat_user_tables outputs supplied by the client and noted differences in last_analyze/last_autoanalyze timestamps and n_mod_since_analyze counts for the tables used by the query. One key relation had a high n_mod_since_analyze value on the affected cluster, indicating potentially stale statistics. Also validated that enable_mergejoin had been manually disabled on some nodes, producing a configuration discrepancy relevant to join algorithm availability.
Step 3: Run consolidated diagnostics on three reference clusters
Requested and received outputs from a read-only diagnostic script executed on a representative fast cluster (Cust 5) and the two slow clusters (Cust 6 and Cust 7). The script captured planner-related settings (planner_cost_constants, enable_* flags), per-relation pg_class/pg_stat statistics and actual EXPLAIN (ANALYZE, BUFFERS) runs. Comparison confirmed that the slow cluster with worst performance had both a stale statistics snapshot for the primary join table and the planner choosing a nested loop/poor join order because of incorrect row-count estimates.
Step 4: Validate effect of statistics refresh and planner configuration
Compared the client’s post-refresh behavior and historical timeline: after a manual global ANALYZE run reported by the client on April 1, query execution times improved across the previously affected clusters. This correlated the timing of performance recovery with statistics refresh rather than the mergejoin flag change, demonstrating that outdated statistics were the primary driver of incorrect plan selection.
Step 5: Implement fix in the affected environment and verify
Requested and observed a controlled refresh of planner statistics on the affected cluster(s) (ANALYZE on the involved schema/tables). Post-ANALYZE EXPLAIN (ANALYZE, BUFFERS) showed the planner switching to the same join order and physical operations as the fast reference cluster with substantially lower processed-row counts. With statistics current, enable_mergejoin was returned to its default (enabled) to allow the planner full algorithm options; no regressions appeared in verification runs. This step transitions to the Solution below.
Solution:
Applied targeted PostgreSQL maintenance: refreshed table statistics (ANALYZE) for the involved schema and related relations on the affected clusters, and restored the enable_mergejoin setting to its default (enabled). Additionally, a small diagnostic script was provided for periodic capture of planner inputs (pg_stat_user_tables, relevant planner GUCs, and EXPLAIN outputs) to assist quicker troubleshooting if similar regressions occur.
Accurate PostgreSQL table statistics enable the planner to estimate row counts and choose an efficient join order and join algorithm. Refreshing statistics corrected row estimates so the planner selected the same low-cost physical plan used by the fast clusters; re-enabling merge join allowed the planner to use the best available join implementation.
Conclusion:
After refreshing statistics and restoring planner configuration, the problematic SELECT returned to expected performance parity with the reference clusters. Query plans stabilized, processed-row counts dropped to match the fast cluster profile, and the environment regained predictable query performance. Ongoing recommendation: schedule targeted ANALYZE for high-write relations and include the provided diagnostic captures in runbooks to reduce time-to-diagnosis for future planner-related regressions.