Problem:
The client reported continuous heartbeat issues in the Airflow scheduler, causing failure to generate controller DAGs in a production environment. This critical issue impacted job execution, especially when multiple jobs were triggered simultaneously, leading to timeouts and job failures.
Process:
Step 1: Initial Identification
The error message displayed in the logs indicated that the DagFileProcessorManager had failed to send a heartbeat within the expected interval:
[2024-11-26T10:35:29.082+0000] {manager.py:302} ERROR - DagFileProcessorManager (PID=539) last sent a heartbeat 50.35 seconds ago! Restarting it.
Step 2: Analysis by the Expert
The expert began by asking specific questions about the environment and setup:
- Whether Airflow was running on Kubernetes.
- How the cleanup of older Spark applications was performed.
- Where the scheduler logs were stored.
- Details about the deployment and resource allocation.
The client confirmed that Airflow was running on Kubernetes, with a cleanup script that deleted completed Spark applications older than 15 days. However, resource limitations and the use of a single replica for the scheduler seemed to exacerbate the issue.
Step 3: Recommendations for Resource Adjustment
The expert suggested:
- Allocating more resources (CPU and memory) to the scheduler.
- Adding more replicas to improve resilience.
- Reviewing the Kubernetes executor configuration.
Step 4: Root Cause Analysis and Solution Proposal
The expert’s root cause analysis indicated that the issue was primarily caused by excessive historical Spark applications and insufficient resources for the scheduler. The cleanup script helped temporarily resolve the issue by freeing up resources, but more permanent solutions were needed.
Key factors contributing to the problem:
- Excessive Historical Spark Applications: Completed Spark applications were consuming Kubernetes cluster resources (disk space, memory, CPU for logs), leading to resource pressure on the node running the Airflow scheduler. The accumulation of these resources resulted in CPU and memory contention, causing delays in sending heartbeats.
- Scheduler Resource Limitations: The scheduler was running with minimal resources, with requests of 500m CPU and 1 GB of RAM, which was insufficient for handling the load. With only one replica of the scheduler, any resource contention or failure disrupted scheduling and triggered heartbeat issues.
- Inefficient Resource Allocation in Kubernetes: The Kubernetes cluster had 54 nodes, but the scheduler’s CPU and memory allocation were not optimized for the workload. The configuration for DAG parallelism and task concurrency were set too high, which put additional pressure on the scheduler.
Solution:
To address the issue, the expert recommended increasing the CPU and memory resources allocated to the scheduler, with a suggested configuration of 4 CPU cores and 4Gi of RAM for requests, and up to 6 CPU cores and 12Gi of RAM for limits. Additionally, adding more scheduler replicas (from 1 to 2) was advised to improve availability and load balancing. Airflow’s configuration should be optimized by adjusting settings such as PARALLELISM, MAX_ACTIVE_TASKS_PER_DAG, and MAX_ACTIVE_RUNS_PER_DAG. Switching from LocalExecutor to KubernetesExecutor would also improve scalability. Finally, the cleanup pipeline should be adjusted to run more frequently to avoid resource pressure caused by old Spark applications.
Conclusion:
The Airflow heartbeat issue was caused by a combination of resource exhaustion, limited scheduling replicas, and high resource demands due to historical Spark applications. By increasing the resource allocation for the scheduler, adding more replicas, and optimizing Airflow’s configuration, the client can significantly reduce the likelihood of this issue recurring.
The client also expressed interest in understanding whether the spark-application-cleanup pipeline was solely responsible for resolving the issue. The expert clarified that while the cleanup pipeline temporarily alleviated the resource pressure, the root cause lay in the scheduler’s insufficient resources, which required a more robust solution. To prevent the issue from recurring, it was recommended to increase the frequency of the cleanup job to every 12 hours, monitor resource trends, and adjust as necessary.