Dag Factory Trigger Failure: Diagnosis and Resolution of Airflow Pod Issues - Proactive Insights and Support For Open-Source Applications

Problem:

The client reported issues with all dags triggered from dag-factory failing, with associated problems such as pods not coming up when triggered and errors in collector pods. Furthermore, there was a lack of error messages in the airflow pods logs, making it challenging to identify the root cause.

Process:

In order to address the reported problems, the following data was requested for review:

Airflow Configuration:
- Relevant details of the Apache Airflow configuration.
Server Stats:
- Statistics on CPU usage, RAM utilization, and free disk space for the servers involved.
Monitoring Screenshots:
- Screenshots of system metrics and specific Airflow metrics from monitoring tools.
Logs:
- Logs from both the web server and scheduler service.

Provided details: Airflow version 1.6.0, Scheduler and web server parameters, Server stats, Monitoring screenshots, Logs.

Solution:

After a thorough analysis, the following findings and recommendations were made:

OCI Runtime Execution Failure:
- Error Description: OCI runtime exec failed: exec failed: unable to start container process: exec /bin/sh: argument list too long: unknown.
- Analysis: System error related to the length of the argument list during container execution.
- Configuration Adjustment: Reducing the parallelism setting from 64 to 32 to alleviate stress on Airflow.
DevOps and Dev Team Actions:
- Identified and fixed issues with the “aia-spark-application-cleanup” scheduled job.
- Triggered the job to clean up spark applications from the last 15 days.
- Bounced the airflow-scheduler and airflow-webserver for effective changes.

Conclusion:

The cleanup of long-standing spark applications, fix of the scheduled cleanup job, and adjustments to Airflow configuration successfully resolved the reported problems. The collaboration between development and operations teams played a crucial role in achieving a stable and functional Apache Airflow environment.