Problem:
The client’s operations team reported issues with triggering jobs via Apache Airflow, specifically through a custom solution, the dag_factory. While jobs triggered outside of the dag_factory worked without problems, those initiated through it were not being processed as expected. Attempts to gather logs in the Airflow UI yielded no entries, as the DAG triggering from dag_factory completed without errors, yet the schema-level DAG did not pick up the task. The client shared comprehensive Airflow service logs, seeking an effective solution.
Process:
The expert conducted a root cause analysis, reviewing the Airflow logs provided by the client. This analysis pinpointed the likely issue to the “DagFileProcessorManager,” a key component within the Airflow scheduler responsible for processing and parsing DAG files. It appeared that this process frequently stopped, disrupting the entire DAG triggering workflow.
The expert noted that as the number of DAG files increased, the scheduler’s efficiency dropped due to limitations in Airflow’s default configuration. Airflow uses Python and the multiprocessing library, which can degrade performance significantly with thousands of DAGs. Two primary solutions were recommended to address these limitations.
Solution:
The expert proposed two configuration changes to improve the Airflow scheduler’s capacity and reliability in handling large volumes of DAGs:
- Modifying Airflow Configuration Parameters:
- min_file_process_interval: This parameter, set to 30 seconds by default, controlled the frequency of DAG file parsing. With a high volume of DAGs, the expert recommended increasing this interval to around 500 seconds to prevent timeouts.
- scheduler_health_check_threshold: This parameter, also defaulting to 30 seconds, determined the interval for the scheduler’s health check. Raising this to around 240 seconds allowed the scheduler more recovery time, promoting stability.
- Enabling Multiple Schedulers: For greater scalability, especially in production environments, the expert suggested configuring multiple schedulers. This setup would distribute the processing load, reducing the risk of overloading a single scheduler.
Following these adjustments, the client observed that the issue had been resolved, confirming that the DAG was now successfully triggered from the dag_factory, picked up by the scheduler, and processed as expected.
Conclusion:
The recommendations provided by the expert addressed the client’s issues effectively. By adjusting critical configuration parameters and enabling multiple schedulers, the Airflow scheduler could handle a higher volume of DAGs without encountering timeouts or processing disruptions. The client’s operations team reported that all jobs were successfully triggered, and no further heartbeat issues appeared in the Airflow UI.