Problem:

A Patroni-managed PostgreSQL cluster (PostgreSQL 15.17, Patroni 3.3.2) running asynchronous replication intermittently failed to rejoin an ex-master after a physical host reboot performed as part of high‑availability testing. Test pattern: hard shutdown of the former primary for ~5 minutes, then restart. Symptom observed on restart: the node failed to rejoin with the error “requested starting point … on timeline … is not in this server’s history”. Configuration context: Patroni had use_pg_rewind: true, wal_log_hints: true, remove_data_directory_on_rewind_failure: true, archive_mode: true with an archive_command that copied WAL files to a local archive directory (copy_wal.sh). wal_keep_size was 5GB while max_wal_size was 64GB. The environment’s backup system removes archived WAL files from the archive location after they are saved to the backup server.

Process:

Step 1: Observe failure mode and error signal

Observed the rejoin failure and captured the Patroni/postgres startup logs that contained the timeline/LSN mismatch message. This identified the immediate technical trigger: the restarted node’s WAL sequence diverged from the currently promoted master’s timeline. That mattered because the message is produced when committed WAL on the ex‑master is not present on the new primary’s history.

Step 2: Review cluster replication and recovery settings

Reviewed Patroni YAML and PostgreSQL parameters: asynchronous replication, use_pg_rewind: true, wal_log_hints: true, remove_data_directory_on_rewind_failure: true, archive_mode enabled, archive_command copying WAL to a local archive path, wal_keep_size=5GB, max_wal_size=64GB. This showed pg_rewind was enabled but no restore_command was configured; the small wal_keep_size relative to workload suggested recycled WAL could be unavailable during the outage window. These settings pointed to two failure classes: (A) true data divergence because async replication acknowledged commits that did not reach any replica, and (B) rewind failure because required historic WAL had been recycled or removed from the archive.

Step 3: Confirm backup/archival behavior and its impact

Validated that the site backup system removes archived WAL files from the archive directory after ingesting them. This established a concrete operational constraint: a restore_command would only succeed if the backup system left the required WAL segments in the archive long enough for pg_rewind to fetch them. Because copy_wal.sh populated a local archive (not a shared store), the rewinding node could not always access the needed segments after recycle/cleanup.

Step 4: Inspect replication slots and WAL retention state

Queried the new primary for slot and WAL status (for example, SELECT slot_name, active, restart_lsn, wal_status FROM pg_replication_slots). Observations included slots showing normal reservation behavior when WAL existed and evidence of recycled/lost WAL in cases where the ex‑master could not reconnect. This confirmed that WAL retention policy and archive availability were the proximate cause of rewind failures.

Step 5: Analyze pg_rewind role and limitations

Analyzed how pg_rewind operates in this environment: it will discard divergent local commits, locate the last common ancestor using WAL history, and copy changed data blocks from the current primary. That means pg_rewind does not recover lost transactions; instead it requires access to WAL history up to the divergence point. This clarified expectations for test results: ex‑master rejoin via rewind is limited by WAL availability, not by a Patroni bug.

Step 6: Implement practical mitigations and verify behavior

Introduced a configuration change set to reduce fallback full restores: add a restore_command that reads archived WAL into pg_rewind on the rewinding node (reverse of copy_wal.sh), and increase wal_keep_size from 5GB to 64GB to reduce the chance of WAL recycling during short outages. After applying these, repeated the hard reboot test and observed successful pg_rewind-based rejoin in cases where the WAL segments were still present. This transitioned the investigation into the applied solution below.

Solution:

PostgreSQL configuration changes applied: (1) configured postgresql.restore_command that fetches archived WAL into pg_wal for rewind (the script mirrors the archive-side copy operation), and (2) increased wal_keep_size from 5GB to 64GB to widen on-disk retention of recent WAL. No changes were made to replication mode (asynchronous replication remained). Architecturally, these changes work because pg_rewind requires WAL history to locate the common ancestor and roll the ex‑master back to that point; a restore_command lets the rewinding node retrieve missing WAL segments from the archive, and a larger wal_keep_size reduces the probability that required segments are recycled before a reconnect.

Conclusion:

Operational outcome: most hard-shutdown test rejoin failures changed from full base‑backup fallbacks to successful pg_rewind reattachments when the required WAL remained available. System stability improved by reducing long recovery windows and expensive full restores; risk of data divergence remains inherent to asynchronous replication and was managed operationally by improving WAL availability and retention policies.