Problem:
The client reported issues with all dags triggered from dag-factory failing, with associated problems such as pods not coming up when triggered and errors in collector pods. Furthermore, there was a lack of error messages in the airflow pods logs, making it challenging to identify the root cause.
Process:
In order to address the reported problems, the following data was requested for review:
- Airflow Configuration:
- Relevant details of the Apache Airflow configuration.
- Server Stats:
- Statistics on CPU usage, RAM utilization, and free disk space for the servers involved.
- Monitoring Screenshots:
- Screenshots of system metrics and specific Airflow metrics from monitoring tools.
- Logs:
- Logs from both the web server and scheduler service.
Provided details: Airflow version 1.6.0, Scheduler and web server parameters, Server stats, Monitoring screenshots, Logs.
Solution:
After a thorough analysis, the following findings and recommendations were made:
- OCI Runtime Execution Failure:
- Error Description: OCI runtime exec failed: exec failed: unable to start container process: exec /bin/sh: argument list too long: unknown.
- Analysis: System error related to the length of the argument list during container execution.
- Configuration Adjustment: Reducing the parallelism setting from 64 to 32 to alleviate stress on Airflow.
- DevOps and Dev Team Actions:
- Identified and fixed issues with the “aia-spark-application-cleanup” scheduled job.
- Triggered the job to clean up spark applications from the last 15 days.
- Bounced the airflow-scheduler and airflow-webserver for effective changes.
Conclusion:
The cleanup of long-standing spark applications, fix of the scheduled cleanup job, and adjustments to Airflow configuration successfully resolved the reported problems. The collaboration between development and operations teams played a crucial role in achieving a stable and functional Apache Airflow environment.