Problem:
The client encountered a critical issue where LW Locking on the PostgreSQL DB level, specifically the MultiXactOffsetControlLock
, led to a complete outage of the production system. This resulted in 255 sessions being locked from 18:30 GMT to 22:30 GMT, causing a significant impact on the system’s availability.
Solution:
-
PostgreSQL Version and Configuration:
The PostgreSQL version in use is 12.5. Relevant configurations, especially those related to locks and concurrency, need to be provided.
PostgreSQL 12.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit
-
Logs:
Extracted PostgreSQL error logs for the hang period (18:30 GMT – 22:30 GMT) have been provided in an attached zip file.
-
PostgreSQL Session Information:
A file named
DB_locks
contains information about the 255 locked sessions. -
System Resource Metrics:
CPU usage during the hang period was 5%, with 95% idle.
Memory consumption metrics: total – 188, used – 82, free – 2, shared – 1, buff/cache – 103, available – 103.
-
PostgreSQL Locking Information:
No locks were found on the database during the investigation.
Upon review, it was observed that long-running updates were a contributing factor to the system outage. To address this and enhance system efficiency, a comprehensive database optimization effort was proposed. This included Performance Analysis, Query Tuning, Index Optimization, and Database Configuration.
The client raised concerns about the impact on SELECT queries and the unexpected blocking of transactions due to MultiXactOffsetControlLock
. The lock occurs when multiple processes attempt to work with the same row simultaneously, resulting in slow query performance. Enabling lock logging (log_lock_waits = on
) in PostgreSQL configuration to analyze the situation more thoroughly.
Monitoring through Grafana was suggested to visualize system performance and identify patterns during critical times. Provided information on NUM_MXACTOFFSET_BUFFERS
and NUM_MXACTMEMBER_BUFFERS
settings, recommending adjustments to address the issue. Customization and build instructions for PostgreSQL source code were provided to implement the recommended changes.
Conclusion:
The recommended changes will require to make the changes in the PostgreSQL source code and then build it. The changes are in the file /src/include/access/multixact.h line 32:
Number of SLRU buffers to use for multixact:
- NUM_MXACTOFFSET_BUFFERS: Change from 8 to 32
- NUM_MXACTMEMBER_BUFFERS: Change from 16 to 64