Problem:

The client implemented Patroni High Availability PostgreSQL clusters using PostgreSQL version 15. The current configuration consisted of two-node clusters: one primary node and one replica node, with asynchronous replication between them.

The client requested a more durable replication setup that would be resilient to operating system crashes, while maintaining high database performance, as the system is critical for online charging and billing operations.

The client was specifically interested in exploring quorum-based semi-synchronous replication, introduced in PostgreSQL 14, which offers a balance between performance and data durability.

Process:

Step 1 – Initial Analysis

The expert confirmed that quorum-based semi-synchronous replication is supported in PostgreSQL versions starting from 14 and can be configured in Patroni-managed clusters. This replication mode allows the primary node to wait for acknowledgment from at least one replica before confirming a transaction to the client, which significantly improves data durability.

The expert analyzed the client’s existing setup, which consisted of only two nodes — a primary and a single replica — and explained that while quorum-based replication is technically possible, its effectiveness is limited in two-node clusters. If the replica is unavailable, the primary node may block write operations unless specific fallback options are configured.

Step 2 – Evaluation of Replication Options

The expert provided a detailed explanation of the different synchronous_commit settings available in PostgreSQL and their impact on latency and data durability.

Asynchronous replication (synchronous_commit set to off) offers the best performance with virtually zero added latency but carries a high risk of data loss if the primary crashes before the data reaches the replica.

The local option slightly increases latency but only waits for the local write-ahead log (WAL) flush, providing no additional replica safety.

The remote_write setting, which waits for the replica to receive and write the WAL without waiting for it to be applied, adds approximately one to five milliseconds of latency. This provides a much higher level of durability, especially against OS crashes.

The on setting waits for the replica to apply the WAL, which increases latency to approximately five to fifteen milliseconds and can significantly affect high-performance systems like online charging.

The remote_apply option offers the strongest consistency but typically introduces the highest latency, in the range of five to twenty milliseconds.

Step 3 – Providing Solution and Recommendations

The expert recommended moving to a three-node cluster to fully leverage the benefits of quorum-based replication. In a three-node setup, the primary can wait for acknowledgment from any one replica, ensuring both durability and high availability, even if one replica becomes unavailable.

The proposed configuration for a three-node cluster includes:

  • Setting synchronous_commit to remote_write, which provides a good balance between performance and durability.
  • Using synchronous_standby_names set to ANY 1 (*), allowing Patroni to dynamically manage the list of eligible replicas.
  • Enabling synchronous mode in Patroni, with synchronous_mode_strict set to false, which ensures that write operations are not permanently blocked if a replica is temporarily unavailable.

For the client’s current two-node setup, the expert confirmed that the same configuration is technically possible, but with some limitations. In particular, if the replica goes down, the primary node will block writes unless the synchronous_mode_strict parameter is disabled, which allows the system to temporarily fall back to asynchronous replication.

The expert emphasized that while this configuration improves data durability compared to pure asynchronous replication, it does not provide full fault tolerance in a two-node setup. The system would still be vulnerable if the replica becomes unavailable for an extended period.

Step 4 – Latency Impact Clarification

After the client specifically requested a latency estimate for each synchronous_commit setting, the expert provided a detailed written assessment.

Asynchronous replication (off) introduces virtually no additional latency but carries a high risk of data loss if the primary crashes.

The remote_write option introduces approximately one to five milliseconds of additional latency and is considered a good compromise between performance and durability.

Using on or remote_apply would introduce significantly more latency, typically ranging from five to twenty milliseconds, and could negatively impact performance-sensitive applications like online charging and billing systems.

The expert confirmed that the recommended remote_write setting provides sufficient protection against OS crashes while maintaining a minimal impact on performance.

Solution:

The expert recommended the following configuration:

  • Set synchronous_commit to remote_write.
  • Use synchronous_standby_names as ANY 1 (*).
  • Enable synchronous_mode in Patroni.
  • Set synchronous_mode_strict to false to prevent write blockage if the replica is temporarily unavailable.
  • Use use_pg_rewind: true for faster failover recovery.

This setup improves durability in the event of operating system crashes while introducing only minimal latency, typically around one to five milliseconds.

The configuration also supports fallback to asynchronous replication when necessary, which is crucial for maintaining write availability in the event that the replica becomes temporarily unreachable.

Conclusion:

The expert confirmed that quorum-based semi-synchronous replication can be successfully implemented in PostgreSQL 15 with Patroni clusters to improve data durability while maintaining acceptable performance levels.

The most reliable and resilient configuration would require scaling the cluster to three nodes. However, even within the existing two-node setup, significant durability improvements can be achieved by using the recommended remote_write configuration with fallback options.

The client was provided with clear, detailed information about the trade-offs of each replication mode and the estimated latency impacts, allowing them to make an informed decision that balances system performance with data protection requirements.

This collaborative process ensured that the client’s system remains both high-performing and resilient, aligned with the operational demands of their critical online charging and billing infrastructure.