Problem:
The client experienced repeated failures in two Apache Airflow DAGs responsible for launching Kubernetes Jobs.
Each DAG followed a delete-and-recreate pattern using a fixed Kubernetes Job name.
Although the Airflow task responsible for deleting the Job reported success, the subsequent Job creation
consistently failed with a Kubernetes conflict error indicating that the Job already existed and was still
being deleted. As a result:
- No new Job or Pod was created
- No execution logs were available
- Pipelines stalled and required manual intervention
The expected behavior was that the Job would be fully removed before a new Job with the same name was created.
Process:
Step 1: Error Review
Initial analysis showed that Kubernetes was rejecting Job creation requests with an HTTP 409 (Conflict) error.
The error message explicitly stated that the Job object was still being deleted, even though the delete task
in Airflow had completed successfully.
This indicated a discrepancy between Airflow task completion and the actual Kubernetes object lifecycle.
Step 2: Kubernetes State Investigation
Inspection of the Kubernetes Job metadata revealed that the Job had a deletionTimestamp set and
included an orphan finalizer. While the delete request had been accepted, Kubernetes was unable
to fully remove the Job object.
Because deletion in Kubernetes is asynchronous, the presence of the finalizer caused the Job to remain
indefinitely in a Terminating state. While in this state, Kubernetes does not allow a new Job with
the same name to be created.
Step 3: Root Cause Identification
The root cause was identified as the repeated application of the orphan finalizer during Job
deletion. This behavior caused Jobs to remain permanently stuck in a deleting state, even when foreground
deletion was configured.
- The Job name remained reserved
- New Jobs could not be created
- Airflow proceeded without verifying that deletion had fully completed
Solution:
As an immediate remediation, the stuck Jobs were manually cleaned up by removing the finalizer and deleting
the Job, allowing pipelines to resume.
For a permanent resolution, the delete-and-recreate dependency was eliminated entirely by:
- Using unique Kubernetes Job names per execution
- Allowing Kubernetes to handle cleanup automatically
- Configuring a TTL (Time-To-Live) for completed Jobs
This approach involved appending a timestamp to each Job name and configuring
ttlSecondsAfterFinished to ensure automatic cleanup. It removed the need for explicit deletion
logic and prevented future name collisions.
Outcome:
- Job creation failures were eliminated
- Pipelines executed reliably without manual intervention
- Kubernetes cleanup became predictable and self-managed
- Operational risk from stuck Jobs and race conditions was removed
The client confirmed that the issue was fully resolved and requested closure of the case.
Conclusion:
The repeated Kubernetes Job creation failures were caused by Jobs remaining permanently in a deleting state
due to an orphan finalizer, despite successful delete task execution in Apache Airflow.
Kubernetes treated these Jobs as still existing, which blocked recreation with the same name and resulted
in conflict errors, missing Pods, and stalled pipelines.
By identifying the finalizer-related deletion behavior and eliminating the dependency on delete-and-recreate
logic, the client stabilized Job execution. Adopting unique Job names per run combined with TTL-based cleanup
allowed Kubernetes to manage lifecycle termination reliably, removed race conditions, and ensured consistent
pipeline execution without manual intervention or forced cleanup operations.