spark on kubernetes removes dependency on Spark Exit code #46817

bin-lian · 2025-02-17T07:18:03Z

SparkSubmitOperator on kubernetes .Regardless of cluster or client mode, you only need to monitor the final status of the subprocess. The final status of the subprocess is the final status of the spark program.

closes:#44810

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

boring-cyborg · 2025-02-17T07:18:07Z

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
Here are some useful points:

Pay attention to the quality of your code (ruff, mypy and type annotations). Our pre-commits will help you with that.
In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
Be sure to read the Airflow Coding style.
Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
Apache Airflow is a community-driven project and together we are making it better 🚀.
In case of doubts contact the developers at:
Mailing List: dev@airflow.apache.org
Slack: https://s.apache.org/airflow-slack

jscheffl · 2025-02-17T07:27:27Z

I assume the referenced bug thicket is not the right one, it refers to a PR

bin-lian · 2025-02-17T07:43:09Z

I have modified

bin-lian · 2025-02-17T08:51:01Z

Hello, if you have any questions about the process, I can modify it.

providers/apache/spark/src/airflow/providers/apache/spark/hooks/spark_submit.py

nevcohen

In ours cluster, we get that the returncode is 0 but the spark k8s driver exit code is 1.

And this is a normal scenario that happens from time to time, we want the process to be marked as failed in Airflow!

And even more, I would change the log so that it would be more indicative, because it did manage to execute the spark submit command, but the spatk app failed.

nevcohen · 2025-02-18T09:58:48Z

I think I found your problem, could you send the logs of your task that runs spark in airflow (the airflow worker logs)?

I assume that the phrase exit code appears in your logs several times, but they do not belong to the driver, so here it loads the wrong value.

Then to fix your bug, you'll have to fix the regex.

bin-lian · 2025-02-18T11:06:41Z

If the return code is 1, then the spark submit subprocess exited abnormally.

First, I am using Kubernetes Executor, spark client mode. This is the worker log. The exit code is displayed in the log, but it does not mean that the spark program has really exited due to an error.

bin-lian · 2025-02-18T11:14:58Z

The cluster mode is similar. The sparksubmit process is a child process, and the oom exit code may appear occasionally.

nevcohen · 2025-02-18T17:07:50Z

The cluster mode is similar. The sparksubmit process is a child process, and the oom exit code may appear occasionally.

The exit code in your logs is belongs to the executor, not the driver's exit code.

nevcohen · 2025-02-18T17:09:26Z

If the return code is 1, then the spark submit subprocess exited abnormally.

First, I am using Kubernetes Executor, spark client mode. This is the worker log. The exit code is displayed in the log, but it does not mean that the spark program has really exited due to an error.

You run the spark driver on the worker itself?

bin-lian · 2025-02-21T06:49:16Z

I did some tests here(spark on kubernetes).
cluster mode:we get that the returncode is 0 but the spark k8s driver exit code is 1.

import subprocess

if __name__ == '__main__':
    spark_submit_cmd = ["/usr/hdp/3.1.5.0-152/spark3/bin/spark-submit",
                        "--master", "k8s://https://kubernetes.default.svc.cluster.local" ,
                        "--deploy-mode", "cluster" ,
                        "--conf" ,"spark.executor.memoryOverhead=1g" ,
                        "--conf" ,"spark.kubernetes.container.image=patsnap-us.tencentcloudcr.com/data/spark:v3.2.4.16" ,
                        "--conf" ,"spark.kubernetes.authenticate.driver.serviceAccountName=spark" ,
                        "--conf" ,"spark.kubernetes.file.upload.path=s3a://testpatsnapus-1251949819/patent2/spark/upload" ,
                        "--conf" ,"spark.hadoop.fs.s3a.endpoint=http://na-ashburn.lan.s3-proxy.info",
                        "--conf" ,"spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem" ,
                        "--conf" ,"spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem" ,
                        "--conf" ,"spark.hadoop.fs.s3a.fast.upload=true",
                        "--conf" ,"spark.hadoop.fs.s3a.access.key=***" ,
                        "--conf" ,"spark.hadoop.fs.s3a.secret.key='***'" ,
                        "--conf", "spark.hadoop.fs.ofs.user.appid=1250000000",
                        "--conf", "spark.hadoop.fs.ofs.tmp.cache.dir=/tmp/hadoop_cos",
                        "--conf" ,"spark.hadoop.fs.cosn.credentials.provider=org.apache.hadoop.fs.auth.SimpleCredentialProvider",
                        "--conf", "spark.hadoop.fs.cosn.impl=org.apache.hadoop.fs.CosFileSystem",
                        "--conf", "spark.hadoop.fs.cosn.bucket.region=na-ashburn",
                        "--conf", "spark.hadoop.fs.cosn.bucket.endpoint_suffix=cos.na-ashburn.myqcloud.com" ,
                        "--conf", "spark.hadoop.fs.cosn.userinfo.secretId=***",
                        "--conf", "spark.hadoop.fs.cosn.userinfo.secretKey=***",
                        "--conf", "spark.hadoop.fs.AbstractFileSystem.cosn.impl=org.apache.hadoop.fs.CosN",
                        "--conf", "spark.kubernetes.driver.podTemplateFile=/usr/hdp/3.1.5.0-152/spark3/kubernetes/template/driver-template.yml",
                        "--conf", "spark.kubernetes.executor.podTemplateFile=/usr/hdp/3.1.5.0-152/spark3/kubernetes/template/executor-template.yml",
                        "--conf", "spark.history.fs.logDirectory=cosn://testpatsnapus-1251949819/patent2/spark3/share_log/spark_history_server/",
                        "--conf", "spark.eventLog.dir=cosn://testpatsnapus-1251949819/patent2/spark3/share_log/spark_history_server/",
                        "--conf", "spark.memory.fraction=0.1",
                        "--conf" ,"spark.eventLog.enabled=True" ,
                        "--conf" ,"spark.kubernetes.namespace=dm-poc" ,
                        "--num-executors" ,"2" ,
                        "--executor-cores" ,"10" ,
                        "--executor-memory" ,"512m" ,
                        "--driver-memory" ,"512m" ,
                        "--name" ,"test_code",
                        "--class","org.apache.spark.examples.SparkPi" ,
                        "/usr/hdp/3.1.5.0-152/spark3/examples/jars/spark-examples_2.12-3.2.4.jar" ,
                        "100000000"]
    _submit_sp = subprocess.Popen(
        spark_submit_cmd,
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,
        bufsize=-1,
        universal_newlines=True,
    )

    for line in iter(_submit_sp.stdout):
        print(line)

    returncode = _submit_sp.wait()
    print(returncode)

client mode: When the driver program runs on a worker, an exit code will appear in the log, but it cannot be used as a basis for judging the program status. You can directly use the return code to judge.
The same sparkpi program has an exit code, and the final calculation is successful.

I made corresponding adjustments to the program

providers/apache/spark/src/airflow/providers/apache/spark/hooks/spark_submit.py

nevcohen · 2025-02-21T13:21:47Z

I did some tests here(spark on kubernetes).
cluster mode:we get that the returncode is 0 but the spark k8s driver exit code is 1.

import subprocess

if __name__ == '__main__':
    spark_submit_cmd = ["/usr/hdp/3.1.5.0-152/spark3/bin/spark-submit",
                        "--master", "k8s://https://kubernetes.default.svc.cluster.local" ,
                        "--deploy-mode", "cluster" ,
                        "--conf" ,"spark.executor.memoryOverhead=1g" ,
                        "--conf" ,"spark.kubernetes.container.image=patsnap-us.tencentcloudcr.com/data/spark:v3.2.4.16" ,
                        "--conf" ,"spark.kubernetes.authenticate.driver.serviceAccountName=spark" ,
                        "--conf" ,"spark.kubernetes.file.upload.path=s3a://testpatsnapus-1251949819/patent2/spark/upload" ,
                        "--conf" ,"spark.hadoop.fs.s3a.endpoint=http://na-ashburn.lan.s3-proxy.info",
                        "--conf" ,"spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem" ,
                        "--conf" ,"spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem" ,
                        "--conf" ,"spark.hadoop.fs.s3a.fast.upload=true",
                        "--conf" ,"spark.hadoop.fs.s3a.access.key=***" ,
                        "--conf" ,"spark.hadoop.fs.s3a.secret.key='***'" ,
                        "--conf", "spark.hadoop.fs.ofs.user.appid=1250000000",
                        "--conf", "spark.hadoop.fs.ofs.tmp.cache.dir=/tmp/hadoop_cos",
                        "--conf" ,"spark.hadoop.fs.cosn.credentials.provider=org.apache.hadoop.fs.auth.SimpleCredentialProvider",
                        "--conf", "spark.hadoop.fs.cosn.impl=org.apache.hadoop.fs.CosFileSystem",
                        "--conf", "spark.hadoop.fs.cosn.bucket.region=na-ashburn",
                        "--conf", "spark.hadoop.fs.cosn.bucket.endpoint_suffix=cos.na-ashburn.myqcloud.com" ,
                        "--conf", "spark.hadoop.fs.cosn.userinfo.secretId=***",
                        "--conf", "spark.hadoop.fs.cosn.userinfo.secretKey=***",
                        "--conf", "spark.hadoop.fs.AbstractFileSystem.cosn.impl=org.apache.hadoop.fs.CosN",
                        "--conf", "spark.kubernetes.driver.podTemplateFile=/usr/hdp/3.1.5.0-152/spark3/kubernetes/template/driver-template.yml",
                        "--conf", "spark.kubernetes.executor.podTemplateFile=/usr/hdp/3.1.5.0-152/spark3/kubernetes/template/executor-template.yml",
                        "--conf", "spark.history.fs.logDirectory=cosn://testpatsnapus-1251949819/patent2/spark3/share_log/spark_history_server/",
                        "--conf", "spark.eventLog.dir=cosn://testpatsnapus-1251949819/patent2/spark3/share_log/spark_history_server/",
                        "--conf", "spark.memory.fraction=0.1",
                        "--conf" ,"spark.eventLog.enabled=True" ,
                        "--conf" ,"spark.kubernetes.namespace=dm-poc" ,
                        "--num-executors" ,"2" ,
                        "--executor-cores" ,"10" ,
                        "--executor-memory" ,"512m" ,
                        "--driver-memory" ,"512m" ,
                        "--name" ,"test_code",
                        "--class","org.apache.spark.examples.SparkPi" ,
                        "/usr/hdp/3.1.5.0-152/spark3/examples/jars/spark-examples_2.12-3.2.4.jar" ,
                        "100000000"]
    _submit_sp = subprocess.Popen(
        spark_submit_cmd,
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,
        bufsize=-1,
        universal_newlines=True,
    )

    for line in iter(_submit_sp.stdout):
        print(line)

    returncode = _submit_sp.wait()
    print(returncode)

client mode: When the driver program runs on a worker, an exit code will appear in the log, but it cannot be used as a basis for judging the program status. You can directly use the return code to judge.
The same sparkpi program has an exit code, and the final calculation is successful.

I made corresponding adjustments to the program

Excellent! Now that makes sense! Looks great!

bugraoz93

Looks good! Thanks for the changes and everyone for testing and involving!
Could you please also add/update the unit test for this new case?

nevcohen

On second thought, I think it's better to approach this from another direction, instead of changing the if like you did, it's better to add an if here.

Only if it is cluster mode will it process the exit code from the logs.

bin-lian · 2025-02-24T02:16:11Z

OK, I'll make the adjustment and add/update a unit test accordingly

bin-lian · 2025-02-24T03:37:39Z

I have changed

nevcohen

Now that's great!

bugraoz93

This looks a lot cleaner! Thanks for including the tests!

boring-cyborg · 2025-02-24T20:03:03Z

Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions.

* spark on kubernetes removes dependency on Spark Exit code * spark on kubernetes removes dependency on Spark Exit code * spark on kubernetes removes dependency on Spark Exit code * spark on kubernetes client mode remove dependency on Spark Exit code * spark on kubernetes client mode remove dependency on Spark Exit code * spark on kubernetes client mode remove dependency on Spark Exit code * spark on kubernetes client mode remove dependency on Spark Exit code --------- Co-authored-by: Bin Lian <lianbin@patsnap.com>

bin-lian · 2025-02-25T02:32:30Z

Thanks everyone for the assistance!

bugraoz93 · 2025-02-25T17:53:39Z

Congrats @bin-lian!

* spark on kubernetes removes dependency on Spark Exit code * spark on kubernetes removes dependency on Spark Exit code * spark on kubernetes removes dependency on Spark Exit code * spark on kubernetes client mode remove dependency on Spark Exit code * spark on kubernetes client mode remove dependency on Spark Exit code * spark on kubernetes client mode remove dependency on Spark Exit code * spark on kubernetes client mode remove dependency on Spark Exit code --------- Co-authored-by: Bin Lian <lianbin@patsnap.com>

spark on kubernetes removes dependency on Spark Exit code

3fce129

boring-cyborg bot added the provider:apache-spark label Feb 17, 2025

eladkal reviewed Feb 17, 2025

View reviewed changes

providers/apache/spark/src/airflow/providers/apache/spark/hooks/spark_submit.py Outdated Show resolved Hide resolved

Bin Lian added 3 commits February 17, 2025 17:29

spark on kubernetes removes dependency on Spark Exit code

5b97d77

spark on kubernetes removes dependency on Spark Exit code

f9dbf4b

Merge branch 'main' into fix-spark-submit-kubernetes-exit-code

9e14325

nevcohen suggested changes Feb 18, 2025

View reviewed changes

Bin Lian added 2 commits February 21, 2025 14:46

spark on kubernetes client mode remove dependency on Spark Exit code

b730659

Merge branch 'main' into fix-spark-submit-kubernetes-exit-code

3a07aaa

spark on kubernetes client mode remove dependency on Spark Exit code

7e1a8db

nevcohen reviewed Feb 21, 2025

View reviewed changes

providers/apache/spark/src/airflow/providers/apache/spark/hooks/spark_submit.py Outdated Show resolved Hide resolved

spark on kubernetes client mode remove dependency on Spark Exit code

c205db5

bugraoz93 reviewed Feb 21, 2025

View reviewed changes

nevcohen approved these changes Feb 23, 2025

View reviewed changes

nevcohen suggested changes Feb 23, 2025

View reviewed changes

spark on kubernetes client mode remove dependency on Spark Exit code

d6b538f

Merge branch 'main' into fix-spark-submit-kubernetes-exit-code

c254c24

nevcohen approved these changes Feb 24, 2025

View reviewed changes

bugraoz93 approved these changes Feb 24, 2025

View reviewed changes

potiuk approved these changes Feb 24, 2025

View reviewed changes

potiuk merged commit 8d0895b into apache:main Feb 24, 2025
61 checks passed

bin-lian deleted the fix-spark-submit-kubernetes-exit-code branch February 25, 2025 02:12

eladkal mentioned this pull request Mar 9, 2025

Status of testing Providers that were prepared on March 09, 2025 #47549

Closed

67 tasks

spark on kubernetes removes dependency on Spark Exit code #46817

spark on kubernetes removes dependency on Spark Exit code #46817

Uh oh!

Conversation

bin-lian commented Feb 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

boring-cyborg bot commented Feb 17, 2025

Uh oh!

jscheffl commented Feb 17, 2025

Uh oh!

bin-lian commented Feb 17, 2025

Uh oh!

bin-lian commented Feb 17, 2025

Uh oh!

Uh oh!

nevcohen left a comment

Choose a reason for hiding this comment

Uh oh!

nevcohen commented Feb 18, 2025

Uh oh!

bin-lian commented Feb 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bin-lian commented Feb 18, 2025

Uh oh!

nevcohen commented Feb 18, 2025

Uh oh!

nevcohen commented Feb 18, 2025

Uh oh!

bin-lian commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

nevcohen commented Feb 21, 2025

Uh oh!

bugraoz93 left a comment

Choose a reason for hiding this comment

Uh oh!

nevcohen left a comment

Choose a reason for hiding this comment

Uh oh!

bin-lian commented Feb 24, 2025

Uh oh!

bin-lian commented Feb 24, 2025

Uh oh!

nevcohen left a comment

Choose a reason for hiding this comment

Uh oh!

bugraoz93 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

boring-cyborg bot commented Feb 24, 2025

Uh oh!

bin-lian commented Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bugraoz93 commented Feb 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

bin-lian commented Feb 17, 2025 •

edited

Loading

bin-lian commented Feb 18, 2025 •

edited

Loading

bin-lian commented Feb 21, 2025 •

edited

Loading

bin-lian commented Feb 25, 2025 •

edited

Loading