Problem:
A scheduled job did not trigger at its designated time, and upon attempting to run the job manually, the pods failed to come up in the backend. Reviewing the scheduler logs, the client did not find any specific errors.
Process:
Our expert investigated the following logs and data:
- Airflow Configuration: Relevant details of the Apache Airflow configuration.
- Server Stats: CPU usage, RAM utilization, and free disk space statistics for the servers involved.
- Monitoring Screenshots: System and Airflow metrics from monitoring tools.
- Logs: Logs from both the web server and scheduler service.
Solution:
Our experts proposed the following solutions:
- Temporarily remove all DAGs from the Airflow DAG folder.
- Clear the logs and introduce new data with a single dummy operator.
- Restart the services (web server / scheduler).
- Provide screenshots of CPU metrics for the backend DB and all worker and frontend machines.
- Review the Airflow.CFG configuration file, and adjust parameters with “timeout” if present.
- Double the value of parameters such as `dag_file_processor_timeout` and restart the services.
- Provide screenshots of stats and monitoring metrics, including CPU metrics, for the last 48 hours.
- Double the CPUs following the scheduler CPU’s observation.
- Address concerns regarding the provided script for clearing DAGs, ensuring data preservation by taking a psql Airflow table dump before running the script.
Conclusion:
The project underscores the importance of meticulous data gathering, systematic troubleshooting, and proactive solution implementation to resolve complex issues within the Apache Airflow environment.