Problem:

Triggering DAGs from Dag_factory for the initial load was not successful. The Dag_factory indicates success, but Collectors did not trigger. As a workaround, when the client triggers the DAG directly from the SQL collector, the drivers come up and are running.

Solution:

In a series of meetings, the team addressed Airflow issues encompassing the DAG file process manager, scheduler, and log file system. The client and our experts conducted troubleshooting on manual DAG triggering, local Airflow runs, and file cleanup. Technical challenges in file system accessibility and insufficient logs were identified, leading to considerations of solutions such as scheduler restarts, Airflow redeployment, and reinstallation with a new Kubernetes namespace. Concerns about data loss and file storage switching were discussed.

In the second meeting, an Airflow scheduler issue with non-responsive behavior was identified. The client made changes related to concurrency and max active but faced system non-availability, potentially due to autoscaling. The client raised a CSRE ticket to T-Mobile for assistance.

The third meeting involved exploring solutions, updating Airflow versions, and testing simulations. Difficulties in the interaction between the scheduler and trigger were revealed, leading to continued efforts in checking logs, reviewing metrics, and debugging.

The fourth meeting focused on debugging and scheduling dynamic DAGs created by a primary DAG. An issue was identified: generated DAGs were not appearing on the Airflow UI. Emphasis was placed on timely information sharing for upcoming client updates. However, attempts to simulate the client system were hindered by the inability to access DAG factory code due to the client’s internal reasons.

In the fifth meeting, efforts to address performance issues were summarized. Various errors, including script and namespace issues, were identified and addressed. Solutions included increasing scheduler instances and tuning the Airflow web server. Recommendations for configuration updates and monitoring were provided to the client.

As a result, the client successfully executed the planned steps to clear DAGs using the Airflow API, resulting in the resolution of the DAGs issue during testing. The initial problem could no longer be reproduced, and steps for cleanup mechanisms were shared with the client’s operational teams to prevent similar occurrences in the future.

Conclusion:

The comprehensive approach undertaken to address the Apache Airflow issue involved a thorough investigation and exploration of the client’s system. Simulations were conducted in our environment to understand the intricacies of the issue.

The team dedicated efforts to debugging and troubleshooting the dynamic DAGs issue, considering multiple solutions. Recommendations were offered to the client, encompassing configuration updates, monitoring, and resolution strategies. A structured plan for future operations was provided, outlining steps to prevent and address similar issues, ensuring a more resilient and optimized system moving forward.