Resolving PostgreSQL Filesystem Bloat and Replication Slot Stuck Issue - Proactive Insights and Support For Open-Source Applications

Problem:

The client encountered a significant issue with their PostgreSQL database (PGDB). They reported that the filesystem (FS) utilization suddenly increased from 74% to 94% without any new objects being created. Despite their efforts to recreate the replication slot and restart PGPool, the filesystem remained at 94%. Logs revealed a termination error related to another server process, potentially corrupting shared memory.

Key details of the problem included:

PGDB FS utilization unexpectedly reached 94%.
Replication slot got stuck.
PGPool went down.
Large files associated with a specific relation (787479913) consumed over 400 GB.
SQL query failed to cancel properly, leading to filesystem clutter.

The client requested assistance on safely cleaning up the file (787479913) without impacting the production database.

Process:

Upon receiving the request, the expert team requested specific logs and database details for a thorough analysis. The following logs and data were necessary for diagnostics:

PostgreSQL Logs: to analyze the database’s behavior leading up to the issue.
System Logs: to check for any system-level errors that could have impacted the database.
Database Activity: to determine if any active queries or transactions were causing the FS bloat.
Table and Relation Details: to identify the file associated with relation 787479913.
Replication Slot Details: to check if replication was functioning as expected.
Disk Usage Details: focusing on the PostgreSQL data directory.

After receiving partial data from the client, the expert team proceeded to analyze the issue based on the information provided. The logs indicated that automatic vacuum and analysis operations were being performed on various tables. The vacuum process, however, appeared to be overwhelmed, likely due to long-running transactions or a high volume of dead tuples.

Several steps were outlined for the client to safely clean up the orphaned file (787479913) without affecting production. These steps were as follows:

Solution:

The expert provided the following steps to safely clean up the file system and ensure that deleting the 787479913 file would not impact the database:

Identify the Relation: The file 787479913 was traced to a specific relation (table or index) using a query on pg_class.
Check Integrity: The integrity of the associated table was checked by querying its record count.
Backup the Database: A full backup of the database was recommended before making any changes.
Check for Orphaned Files: If the file was found to be orphaned (not linked to any active table), it was safe to delete.
File Removal: Once verified, the file was deleted, and the system was monitored for errors.
Monitor Performance: Post-deletion, the database was monitored to ensure no unexpected issues arose.
Rebuild Indexes (If Necessary): If the file was linked to an index, the index was rebuilt to maintain performance.

The client successfully removed the orphaned objects from the file system without any issues.

Conclusion:

The immediate issue was resolved by identifying and removing orphaned files that were consuming space. However, the underlying cause, a canceled SQL query that generated large files, required further attention. The expert recommended adjusting PostgreSQL’s auto-vacuum parameters and switching from PGPOOL to PgBouncer to improve resource efficiency and avoid similar issues in the future. Regular maintenance, including VACUUM ANALYZE operations and monitoring file system usage, was advised to prevent recurrence.