Scaling Airflow and Spark on Kubernetes with Hossted - Proactive Insights and Support For Open-Source Applications

Introduction

The installation of Airflow and Apache Spark can be fine-tuned for optimal
performance by adjusting over 150 environment variables, thereby maximizing
the number of DAGs running and fully utilizing allocated resources.
During recent hosted support sessions with an ISV that develops software for
telecommunications companies, we encountered multiple challenges in
effectively scaling their Airflow deployment. Specifically, we focused on
optimizing the number of Directed Acyclic Graphs (DAGs) to make the most of
their Kubernetes cluster and allocated resources.
Our client was limited to executing no more than 15 DAGs at the same time,
and several subsystems were identified as problematic. To address the
multifaceted challenges, we enlisted the help of two of our experts: one
specialized in Data and Airflow mechanics, and the other highly experienced in
Spark running on Kubernetes.

Challenges

The project encountered several challenges that impeded operational
efficiency and real-time monitoring. Spark struggled to execute all DAGs
entirely, leading to bottlenecks in the workflow. The Airflow UI displayed only a
fraction of the expected DAGRuns (25 out of 115), making it difficult to monitor
the system effectively. Additionally, jobs were often marred by incorrect pod
names, causing confusion and inefficiency outdated ConfigMap settings forced
the system to use the less efficient “LocalExecutor” instead of the using the

“KubernetesExecutor” tapping into Kubernetes’s ability to schedule the spark-
based DAGS optimally

Methodology

To address these challenges, Our experts recommended that we reevaluate the
airflow installation, the Helm chart was reinstalled on the development cluster,
and the ConfigMap was configured with “KubernetesExecutor” to lay the
groundwork for optimal use of the allocated k8s resources.
A granular analysis of job instances was carried out to identify gaps in UI
visibility and root causes. Our Hossted Experts helped uncover tasks that were
available but not registered in the Airflow UI or Database. Our experts realized
chronic issues with Persistent Volume Claim (PVC) “spark-history-server-pvc” ,
particularly the “Init:0/1” status as spark needed to write it’s history to a
shared volume and we found out that the current pvc was inconsistent.
Our experts explored the Azure Files CSI driver’s functionality to overcome
shared mounting obstacles. Finally, a careful examination of version
compatibility among Apache Spark, Airflow, and Kubernetes components
ensured seamless integration.

Data Collection

Data collection was a critical aspect of the project. The team analyzed how
Spark was utilized in the client’s environment, scrutinized Airflow and
Kubernetes configuration files, and evaluated the job submission code for best
practices. Version details for Apache Spark, Airflow, and Kubernetes were
gathered to verify compatibility. Logs from both successful and failed Spark
jobs were analyzed to identify error messages, exceptions, and warnings. Deep
dives into logs for failed Spark runs helped uncover specific failures related to
Spark applications. Airflow configuration file changes were reviewed to confirm
accurate reflection in the Kubernetes executor and to address DAG visibility
issues.

The client had an extensive Grafana and Prometheus-based observability
platform in place but we found out that the spark executors were not showing
up in the current Monitoring Dashboard. Therefore, our experts suggested that
the client should install the needed exporters and Grafana dashboard to
complete the missing spark observability which helped us reach the reasoning
behind the spark-executor failures.

Observations

During the investigation, two key runs were observed. In the first run, 36 out of
40 triggered DAGs were visible in the UI, with one task failing under the name
“DeleteSparkApplicationIfExists.” In the second run, 33 out of 40 triggered
DAGs were visible, with one task failing named “init_custom_variables.”

Solutions

Several solutions were implemented to address the challenges. The
max_active_runs parameter was adjusted to resolve conflicts with global
settings. The pod template file was updated to avoid hard-coded values,
enabling the use of the desired “KubernetesExecutor.” Stale ConfigMap issues
were resolved by the client’s internal team, ensuring correct executor usage.

Implementation & Results

As a result of these interventions, RAM and CPU limits were increased for
improved performance. While an ongoing investigation is being conducted to
resolve SparkSession invocation issues related to PVC mounting, the executor
and visibility issues were successfully resolved through ConfigMap and pod
template file updates. Airflow job visibility issues addressed.

Recommended Actions

For future improvements, the team recommended proactive intervention to
manually terminate pods in “Init:0/1” status and to consider removing the
spark-history-server feature for enhanced cluster efficiency. Incremental
testing with 20 DAGRuns and performance optimization through code

adjustments are also advised and performance optimization through cod-
adjustments are also advised. Then finally, to optimize performance, tweak the

‘max_active_runs’ and ‘concurrency’ settings in the code. Then, run 40
instances of the workflow (DAGRuns) to assess improvements in both the user
interface responsiveness and the execution speed on Kubernetes.

Conclusion

This experience underscored the importance of maintaining all configurations
within a single ConfigMap and handling exceptional scenarios within the DAG
Python code. It also highlighted the need to avoid hard-coding values in
configuration files to prevent unexpected issues. Rigorous testing and
monitoring were found to be crucial for the early detection and resolution of
integration issues, ensuring smooth operations.
By addressing these challenges head-on and implementing targeted solutions,
our client was able to significantly improve the performance and reliability of its
Spark and Airflow deployments, setting a strong precedent for open-source
interoperability in complex, hosted environments.