Resolving Memory Consumption Issues in PostgreSQL Cluster - Proactive Insights and Support For Open-Source Applications

Problem:

The client reported high memory consumption on both leader (Node 1) and replica (Node 2) nodes in PostgreSQL version 13. Memory utilization on both nodes was observed to be significantly elevated. On Node 1, high memory usage was associated with PostgreSQL processes such as checkpoint and background writer operations, while Node 2 was undergoing recovery. The client sought assistance in diagnosing and resolving the issue to ensure system stability.

Process:

Step 1 – Initial Analysis:

The expert team began by requesting system logs, PostgreSQL version information, memory usage statistics, and relevant database monitoring metrics. The client provided detailed logs, configuration files, memory usage reports, and outputs from various PostgreSQL commands, including pg_stat_activity and pg_stat_all_tables.

Step 2 – Memory Consumption Diagnosis:

From the provided logs, the expert team identified that memory consumption was unusually high due to several PostgreSQL processes running on both nodes. Notably, processes related to database replication and query execution were consuming significant amounts of memory. In Node 1, specific queries involving pg_stat_database and pg_stat_all_tables were flagged as potential contributors to high memory usage, likely due to the large size of the database schema and the volume of data being aggregated. On Node 2, recovery processes were also identified as contributing to memory consumption.

Step 3 – Query Analysis and Optimization:

To further investigate, the experts suggested running several diagnostic queries to monitor active processes, check for long-running queries, and identify any potential table or index bloat:

To monitor and analyze running processes on the leader Node:

SELECT pid, usename, application_name, client_addr, state, query FROM pg_stat_activity WHERE state <> 'idle';

On the Replica Node:

SELECT * FROM pg_stat_wal_receiver;

SELECT * FROM pg_stat_replication;

To identify and optimize long-running queries that might be consuming a lot of memory:

SELECT pid, age(clock_timestamp(), query_start) AS age, usename, query FROM pg_stat_activity WHERE state <> 'idle' ORDER BY age DESC;

To identify and address bloat, install the pgstattuple extension:
```
CREATE EXTENSION pgstattuple;
```

To check bloat:

SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid)) AS bloat_size FROM pg_stat_user_tables ORDER BY bloat_size DESC;

The client was asked to monitor and optimize long-running queries and ensure that transactions were properly closed, as PostgreSQL retains memory for unfinished transactions.

Step 4 – Error Identification and Fixes:

During the analysis, the expert team found frequent occurrences of two specific errors in the logs:

ERROR: could not obtain lock on row in relation 'ac1_control_p1': This indicated a locking issue in PostgreSQL, where concurrent queries were attempting to access the same row. This can indirectly lead to increased memory usage, as transactions wait in memory to acquire the necessary locks.
ERROR: duplicate key value violates unique constraint 'service_monitoring_pk': This error suggested application-level issues with duplicate data insertion, which could lead to memory pressure due to frequent rollbacks and retries.

The experts recommended analyzing and resolving these errors by optimizing the query logic and adjusting the database’s locking behavior:

If the ac1_control_p1 table is very large, then partitioning can improve performance, which will divide the table into physical chunks of the data portion.
Optimize queries to reduce their execution time, which is frequently accessing/modifying the ac1_control_p1 table.
Implement proper concurrency control mechanisms: In applications with high concurrency, implementing appropriate locking mechanisms can prevent deadlocks and improve performance.
Adjust PostgreSQL configuration parameters related to shared buffers (shared_buffers) and work memory (work_mem) based on workload. Apply the recommended values for these parameters.

Step 5 – Configuration Review and Resource Tuning:

The PostgreSQL configuration files were reviewed, and the expert team suggested several optimizations:

Reducing the maximum number of connections from 5000 to a more reasonable number (e.g., 3000), unless high concurrency was necessary.
Adjusting memory-related parameters such as shared_buffers, work_mem, and maintenance_work_mem to ensure optimal memory utilization.
Regularly running VACUUM and REINDEX commands to reclaim space and reduce table bloat.

The client was also advised to implement proper concurrency control mechanisms and partition large tables, such as the ac1_control_p1 table, to improve performance and reduce memory consumption.

Solution:

By addressing the identified errors, optimizing query behavior, and adjusting PostgreSQL configuration settings, the expert team helped the client manage memory consumption more effectively. Key actions included:

Closing transactions properly to reduce memory retention by PostgreSQL.
Monitoring and optimizing long-running queries.
Resolving locking issues by reviewing and adjusting the locking strategy for specific queries.
Reducing the maximum number of connections to alleviate memory pressure.

Conclusion:

The expert team’s thorough investigation and targeted optimizations successfully resolved the memory consumption issues in the client’s PostgreSQL cluster. By addressing query behavior, optimizing resource usage, and implementing better concurrency control, the system’s overall performance was stabilized. The client can now operate the PostgreSQL cluster efficiently with reduced memory usage and improved reliability.