Apache Spark: DAGs didn’t move to the running state

Problem:

The client faced Spark issues during job submissions through Airflow. Out of 15 max DAGs that got triggered, there were always 3-4 dags getting triggered in the Initial state and never moving to the running state. A few DAGs also got stuck in the ContainerCreate state and never came up and ran.

Process:

Step 1 – Initial investigation and troubleshootingThe expert team initiated the preliminary investigation and promptly requested information from the client. Below is the list of information provided by the client for further issue investigation:

1. Job Submission Statistics:

Number of jobs submitted: Not more than 50.
Average job duration: Typically around 4 hours.
Concurrent jobs: Up to 15 jobs, as mentioned.

2. Kubernetes Resource Usage:

SQL Collector Configuration

– Driver:

Cores: 2
Core Limit: 2000m
Memory: 8096m

– Executor:

Cores: 1
Instances: 1
Memory: 8096m
Core Limit: 2000m

3. Error and Failure Details:

Client-side (Airflow) and Kubernetes-side issues:

Common Errors:

Webhook connection issues – “no route to host” and “context deadline exceeded.”

Job Failures:

Failure in driver node setup due to webhook request.
Resource mount failures.

4. Logs and Configuration:

Complete logs for one successful and one failed job.

– Configuration Files:

Airflow and Kubernetes conf/iguration (YAML, Airflow variables).

– Job Submission Code:

Available in the logs.

– Component Versions:

Spark Operator: Upgraded from 1.1.19 to 1.1.25.

– Kubernetes:

Client Version: 1.22.0
Server Version: 1.21.12

5. Cluster Access and Issues:

No direct access to the cluster.
Specific logs needed for further diagnosis.

6. Diagnosed Problems:

Webhook request failures.
Routing issues potentially related to the Calico component in Kubernetes.
Resource mounting failures.

7. Ongoing Diagnosis:

Checking potential issues with Spark Operator Kubernetes.
Requesting specific logs for further investigation.

Step 2 – Further investigationAfter the investigation, the experts found that Airflow reads the config from 3 places in the below order:

Configs defined in DAG Python code.
Read the config defined in Airflow environment variables.
Lastly, it reads the config from the Airflow config file/Config name.

The experts concluded that the standard practice was to define everything in the config file/config map at the persistence location and copied in Airflow containers. Exceptional scenarios could be handled in dag Python code. Experts suggested defining everything in a single configMap and handling it during the Airflow deployment pipeline.

Step 3 – Meeting with the clientAt the meeting, the team and expert discussed the following topics:

Configuring Airflow Executor: Discussing the need to change Airflow’s config map manually and integrate it into the installation itself, our experts proceeded to configure it as a Kubernetes executor in the Helm chart.
Updating Helm Chart: Identifying a bug in Airflow’s Helm chart where a value is hardcoded, the team discussed opening a bug with Airflow to address it in future releases.
Maintaining Local Workaround: Deciding to implement a local workaround and continue with it until the issue is resolved in future Airflow releases, the team discussed the necessity of managing this locally if not fixed in upcoming versions.
Local executor and Kubernetes executor: Discussing the need for Kubernetes executors and managing them in the helm chart, including exposing variables in the values.yaml file.
Issue with Airflow Helm chart: Identifying a bug in the Airflow Helm chart related to local executors, the team agreed to open a bug report and temporarily hardcode a solution.
Defining environment variables: Discussing the definition of environment variables in the values.yaml file and potential conflicts with dynamically generated sections.
Updating Helm chart: Suggest updating the Helm chart to reference the executor variable in the values.yaml file instead of hardcoding it.
Sharing files and adding to the watch list: Agree to share relevant files and add team members to a watch list for updates on the bug report.

Step 4 – Founded ProblemsAfter meetings with the client, the expert team found the following issues:

1. Global airflow configWhen Job has failed, SparkSession was never invoked. Java job is reading the config file from the local file which we mounted as part of our YML. In many cases, Job started before PVC mounted/failed.

2. Spark-history PVCPVC “spark-history-server-pvc” (with related PV “pvc-81398e51-ca4d-4708-aa96-7d2aecdaf4a3”) didn’t mount on driver pods, which caused “Init:0/1” status. This PVC was RWX and based on the Azure Files CSI driver.

3. Airflow issue with DAGs not visible in UINot all DAGRuns were shown correctly in Airflow UI, and some random jobs used incorrect pod names.

Solution:

Issue 1 – VPC mounted the open issue (Previously opened and closed) as part of the current issue:

(Open Issue) https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/1619.

(Previously Closed) https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/784.

Issue 2 – Experts suggested reviewing all pods in the cluster across all namespaces and manually terminating those in the “Init:0/1” status (as well as in other statuses that could lock this PVC).

As an alternative, they recommended considering the removal of the spark-history-server feature (and its associated PVC). They asked the user to decide if it was really necessary:
https://github.com/mesosphere/kudo-spark-operator/blob/master/kudo-spark-operator/docs/latest/history-server.md.
Removing this optional PVC would have significantly improved the cluster’s performance, especially given the large workloads.

Issue 3 – The expert team discussed the received data and made the below change as WA in the helm chart to adopt the suggested fix through the pipeline:

Hard-coded value

        - name: AIRFLOW__CORE__EXECUTOR
          value: LocalExecutor

changed to

        - name: AIRFLOW__CORE__EXECUTOR
          value: {{ .Values.executor }}

In file “airflow-1.6.0\airflow\files\pod-template-file.kubernetes-helm-yaml” inside airflow helm-chart.

The team identified that the Airflow job didn’t show up on the UI, but it is visible in Task Details. Experts created two issues on GitHub to bring it to the maintainers’ attention:

Airflow Discussions: #32493 (https://github.com/apache/airflow/discussions/32493).
Airflow Issues: #32485 (https://github.com/apache/airflow/issues/32485).

Conclusion:

The client experienced Spark issues during job submissions through Airflow, with several DAGs stuck in the Initial or ContainerCreate states. Initial investigation revealed limited job submissions and common errors like webhook connection issues and resource mount failures. The expert team recommended standardizing configuration management through a single ConfigMap and made changes to the Airflow Helm chart to replace hardcoded executor values with variables. In a meeting, they discussed configuring Airflow executors, updating the Helm chart, and managing local workarounds, leading to opening GitHub issues to address visibility and configuration problems.

Problem:

Process:

Solution:

Conclusion: