Resolving PostgreSQL Failover and Transaction File Access Issue - Proactive Insights and Support For Open-Source Applications

Problem:

After performing a manual failover in PostgreSQL, the client encountered the following error when running a query on a partitioned table ‘ac1_control’:

ERROR: could not access the status of transaction 613182547;
DETAIL: Could not open file ‘pg_xact/0248’: No such file or directory.

Despite restarting the PostgreSQL instance, the issue persisted. The client was operating a multi-node PostgreSQL setup, and after the failover, node1 (previous leader) became the replica, while node2 (previous replica) assumed the role of leader. The problem emerged only after the failover, causing the client significant downtime.

Process:

Step 1: Gathering Initial Data:

The client provided PostgreSQL log files, output from the ‘df -h’ command on both nodes, and system information:

OS Version: RHEL 8.4;
PostgreSQL Version: 13.

The expert reviewed the logs and discovered that the error pointed to a missing transaction status file in the ‘pg_xact’ directory, indicating potential file system corruption, disk issues, or an incomplete transaction state following the failover.

Step 2: Diagnostics and Analysis:

The expert requested additional information to diagnose the problem:

Database Information: The client was asked to connect to the database via ‘psql’ and run the following commands to retrieve the data directory path, table information, and table size:
- show data_directory;
- \d+ ac1_control;
- select pg_size_pretty(pg_total_relation_size(‘ac1_control’)).
File System Check: The expert advised stopping PostgreSQL and performing a file system check (using ‘fsck -y’) on the data directory’s partition to identify any possible file system corruption. Remounting the partition and restarting PostgreSQL was suggested to ensure the file system was in a stable state.
Vacuuming the Database: If the file system check did not resolve the issue, the expert recommended running a ‘VACUUM FULL, VERBOSE’ on the problematic table to clean up dead rows, reorganize the table, and fix any internal issues within PostgreSQL.

Step 3: Client’s Actions:

The client followed the expert’s guidance:

They connected to PostgreSQL and executed the recommended commands.
Logs and additional data were provided, confirming the presence of a partitioned table structure.
The client stopped PostgreSQL, ran ‘fsck -y’ on the data directory, and remounted the partition as advised.

Step 4: Expert’s Follow-Up:

Upon receiving the latest logs, the expert noticed that the problem was likely due to corruption at the file system or storage layer. The expert then recommended performing a vacuum operation on the entire database:

Vacuum: Running ‘VACUUM (VERBOSE)’ on the entire database to ensure that no table had corrupted data preventing normal operation.

Step 5: Final Recommendation:

As the error persisted after the file system check and vacuuming, the expert recommended restoring the database from the most recent backup. If the transaction status files or underlying disk were corrupted, recovery from backup was the most reliable way to ensure the integrity of the data and allow PostgreSQL to function normally.

Solution:

After performing a vacuum and checking the integrity of the file system, the client proceeded to restore the database from a backup. This restored data integrity and allowed PostgreSQL to resume normal operations. The client was advised to monitor the health of their storage system and consider replacing any failing disks to prevent similar issues in the future.

Conclusion:

This case illustrates the importance of regular database and file system maintenance, particularly in failover scenarios where data corruption can occur. Through a structured diagnostic process and expert guidance, the client was able to restore their PostgreSQL service with minimal data loss. Ensuring regular backups and monitoring disk health are key practices for avoiding similar issues in the future.