Home
Knowledge Base
Case Studies
Automating DR site rebuild for a large Patroni PostgreSQL cluster

Automating DR site rebuild for a large Patroni PostgreSQL cluster

by the Hossted team

22.06.2026

Problem:

A customer running Patroni-managed PostgreSQL (PostgreSQL 15.17, Patroni v3.3) had production databases ranging from >2-3 TB up to a ~52 TB physical replica. Rebuilding the empty DR site from production relied on manual pg_basebackup executions and manual post-backup recovery steps, which was too slow, operationally intensive, and prone to human error for such large volumes. The client requested an automated “standby recovery bootstrap” where Patroni would automatically initiate the recovery on empty directories. They also asked if primary_slot_name is mandatory and whether third-party enterprise solutions (Commvault or pgBackRest) could be used instead of pg_basebackup.

Process:

Step 1: Replication Slot Analysis

We analyzed the necessity of primary_slot_name for the DR site. We concluded that a named physical slot is not mandatory. In fact, omitting it is a deliberate and sound architectural choice, because a permanent physical slot pins WAL files on the primary; if the DR site is inactive, this risks filling the production disk.

Step 2: Selecting a Scalable Restore Method

Since pg_basebackup is not the only option and is poorly suited for a ~52 TB database, integrating enterprise backup systems was proposed. pgBackRest was recommended as the preferred solution for >2-3 TB databases, and Commvault was identified for the massive ~52 TB replica.

Step 3: Configuring Patroni Orchestration

Patroni supports custom replica-creation methods via the create_replica_methods parameter. This generic mechanism allows integration with any utility capable of producing a valid physical PostgreSQL data directory. A process was designed where Patroni invokes a dedicated bash script (e.g., /etc/patroni/commvault_restore.sh) instead of native tools.

Step 4: Designing the Custom Restore Pipeline

A custom bash script skeleton was prepared. When called by Patroni, the script receives parameters like --datadir and drives Commvault to restore the physical data directory directly into $PGDATA with PostgreSQL kept stopped, ensuring tablespaces are also restored and path-mapped. A Proof of Concept (PoC) was highly recommended to validate this workflow at the 52 TB scale.

Solution:

The DR Patroni configuration was conceptually updated. The primary_slot_name parameter was deliberately omitted. The create_replica_methods block within standby_cluster was updated to prioritize enterprise integrations (commvault or pgbackrest), leaving basebackup only as a fallback.

The process is now fully automated: when Patroni starts on an empty node, it selects the custom method and calls the restore script. Upon a successful restore (exit 0), Patroni automatically writes standby.signal and primary_conninfo, and starts PostgreSQL. The database then performs recovery and begins streaming from the production primary without needing a slot.

Conclusion:

This architectural shift eliminated a slow and risky manual rebuild process. By configuring Patroni on a wiped DR node to automatically trigger high-speed recovery via Commvault or pgBackRest, the organization drastically reduced its recovery time objective (RTO) for multi-terabyte databases. Furthermore, the updated configuration removes the need for manual database startup interventions and safely protects the primary production system from WAL-induced disk-full events, ensuring a highly resilient disaster recovery strategy.