Problem:

A production Patroni-managed PostgreSQL cluster (PostgreSQL v15.17, Patroni 3.3.2) experienced a primary process abort with SIGABRT during normal operation. Server logs reported that a server process was terminated by signal 6 (Aborted) and that the failed process was executing a COMMIT when the postmaster began terminating other server processes. Subsequent messages showed PostgreSQL could not accept new connections with the error “Too many open files in system.” The database was configured with max_connections = 3000 and max_files_per_process = 4096. Host-level file descriptor ceiling (fs.file-max) had been 1,500,000 at the time of the incident and was raised to 5,000,000 as an immediate post-incident action by the operator.

Process:

Step 1: Incident intake and initial evidence

Received error context and configuration files from the operator, including PostgreSQL and Patroni configuration plus extracted server logs. Key observations: PostgreSQL reported inability to accept new connections due to exhausted system file descriptors and a backend aborted during a COMMIT. This confirmed an OS-enforced resource exhaustion rather than a pure PostgreSQL software bug and established investigation focus on file-descriptor usage and connection behavior.

Step 2: Inspect process and kernel limits

Checked supplied sysctl and service limits files to compare host-wide and per-process ceilings. Identified fs.file-max recorded at 1.5M at incident time (operator had already increased it to 5M). Reviewed service unit and limits.d expectations versus effective /proc//limits to ensure PostgreSQL processes would inherit higher nofile limits. This showed the kernel ceiling had been tight relative to the database configuration, making exhaustion plausible under load.

Step 3: Attribute file-descriptor ownership

Analyzed open-handle distribution using process-level fd counts (via /proc/*/fd and lsof summaries provided by the operator). Findings: PostgreSQL processes accounted for the majority (~93%) of open descriptors; system services consumed a small fraction. Multiple client backends and background workers were each maintaining hundreds of file descriptors (open table files, WAL segments, temp files, sockets), explaining why total consumption approached the kernel ceiling. Pinpointing PostgreSQL as the primary consumer directed remediation toward database connection behavior and descriptor headroom.

Step 4: Correlate workload patterns and lock behavior

Reviewed database session and lock traces from the time window around the crash. Observed long wait times for row-level locks on a frequently-updated control table (one session waited several minutes), which kept connections and their file handles alive longer than expected. This explained part of the sudden spike in steady FD usage and why the server did not recover by closing idle handles quickly.

Step 5: Quantify sizing and safety margins

Used observed per-backend FD averages to model required fs.file-max under current configuration: with 3,000 client connections and background processes, projected peak consumption exceeded the prior 1.5M and remained close to 5M. Calculations showed a safer operational ceiling around 8,000,000 without architectural changes, and a much lower ceiling (≈3,000,000) once connection pooling reduced real database connections. This sizing exercise balanced headroom against fault containment to avoid masking leaks.

Step 6: Recommend and apply immediate mitigations

Advised raising the kernel ceiling to an intermediate operational value and ensuring persistent per-process limits so PostgreSQL and Patroni inherit them. Operator applied an increased fs.file-max and added a limits.d entry for the postgres user. Recommended a controlled restart sequence (replica first, then primary) so new limits propagate safely. These actions cleared the immediate exhaustion condition and prevented recurrence while longer-term mitigation was planned.

Solution:

Implemented changes: PostgreSQL hosts had fs.file-max increased to an operational ceiling, persistent nofile limits were added for the postgres user, and a controlled Patroni restart was performed (replica then primary) to ensure processes inherited the new limits. Architecturally, raising the kernel file-descriptor ceiling and ensuring service-level limits prevents OS-level rejection of new sockets and file handles; this gives PostgreSQL the necessary headroom to maintain active connections and background workers without immediate failure.

Conclusion:

Post-implementation monitoring showed the cluster remained stable with no further “Too many open files” rejections. The changes removed the immediate single-point failure, reduced production incident risk, and created capacity for planned architecture work (connection pooling and WAL sizing adjustments) that will materially lower descriptor usage and improve operational containment.