[SPARK-41933][CONNECT] Provide local mode that automatically starts the server #39441

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

HyukjinKwon wants to merge 3 commits into apache:master from HyukjinKwon:SPARK-41933

Member

HyukjinKwon commented Jan 7, 2023 •

edited

Loading

What changes were proposed in this pull request?

This PR proposes local mode for Spark Connect. It automatically starts a Spark session (with bypassing local* master string) that launches the Spark Connect server, which introduces two user-facing changes below.

Notice that local mode exactly follows the regular PySpark session's stop behavior by terminating the server (whereas non-local mode would not close the server and other sessions). See also the newly added comments for pyspark.sql.connect.SparkSession.stop.

Local build of Apache Spark (for developers)

Automatically finds the jars for Spark Connect (because the jars for Spark Connect are not bundled in the regular Apache Spark release).

PySpark shell
```
pyspark --remote local
```
PySpark application submission
```
spark-submit --remote "local[4]" app.py
```

Use it as a Python library

from pyspark.sql import SparkSession
SparkSession.builder.remote("local[*]").getOrCreate()

Official release of Apache Spark (for end-users)

Users must specify jars or packages. Jars aren't automatically searched.

PySpark shell

pyspark --packages org.apache.spark:spark-connect_2.12:3.4.0 --remote local

PySpark application submission

spark-submit --packages org.apache.spark:spark-connect_2.12:3.4.0 --remote "local[4]" app.py

Use it as a Python library

from pyspark.sql import SparkSession
SparkSession.builder.config(
    "spark.jars.packages", "org.apache.spark:spark-connect_2.12:3.4.0"
).remote("local[*]").getOrCreate()

Why are the changes needed?

In order to provide an easier mode to try Spark Connect for both developers and end-users.

Does this PR introduce any user-facing change?

No to end users because Spark Connect has not been released.
To the dev, yes. See the examples above.

How was this patch tested?

Unittests were refactored to use/test this feature (that also deduplicated the codes).

HyukjinKwon marked this pull request as draft

January 7, 2023 06:28

github-actions bot added CORE PYTHON SQL labels

HyukjinKwon force-pushed the SPARK-41933 branch 2 times, most recently from 5943010 to b1de5cc Compare

January 7, 2023 07:16

HyukjinKwon commented

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/api/python/PythonSQLUtils.scala Outdated

Member Author

HyukjinKwon Jan 7, 2023

Got this logic from my own (unmerged) PR long time ago: #19643.
This code path will only be used in development, and cannot be used in production. I will clean up with some docs later.

Contributor

amaliujia Jan 7, 2023

Question though:

Does this util help if we want to support the application_jar in spark-submit if we enable Connect there?

Member Author

HyukjinKwon Jan 7, 2023 •

edited

Loading

Regardless this logic would have to stay here because we're automatically searching the jars in the server side (dev mode only). The jars are passed to the server side (since remote side doesn't need this jar).

Contributor

amaliujia Jan 8, 2023

ack. This class is private to sql package so defining this function seems to be safe for internal use only (private[sql] object PythonSQLUtils)

github-actions bot added the CONNECT label

HyukjinKwon commented

View reviewed changes

python/pyspark/sql/tests/test_dataframe.py Outdated

Member Author

HyukjinKwon Jan 7, 2023

This wasn't actually testing Spark Connect (because of self.df created from the regular Spark session).

HyukjinKwon force-pushed the SPARK-41933 branch 6 times, most recently from 889d31a to 2f31685 Compare

January 7, 2023 10:41

HyukjinKwon changed the title ~~[WIP][SPARK-41933][CONNECT] Provide local mode that automatically starts the server~~ [SPARK-41933][CONNECT] Provide local mode that automatically starts the server

HyukjinKwon marked this pull request as ready for review

January 7, 2023 10:56

HyukjinKwon force-pushed the SPARK-41933 branch 2 times, most recently from a295e33 to 7241127 Compare

January 7, 2023 11:21

HyukjinKwon commented

View reviewed changes

python/pyspark/sql/connect/session.py Outdated

Member Author

HyukjinKwon Jan 7, 2023

I piggyback this fix so garbage-collected instances will close the connection to avoid a resource leak.

Member Author

HyukjinKwon commented Jan 7, 2023

cc @zhengruifeng @hvanhovell @amaliujia @grundprinzip @xinrong-meng @cloud-fan FYI

amaliujia reviewed

View reviewed changes

python/pyspark/sql/connect/session.py Outdated

Contributor

amaliujia Jan 7, 2023

Question: in the case of exception it might still leak resource?

Member Author

HyukjinKwon Jan 7, 2023

Yup. it's just best effort for now

Contributor

amaliujia Jan 8, 2023

I see thanks for the clarification.

python/pyspark/sql/connect/session.py Outdated Show resolved Hide resolved

sql/core/src/main/scala/org/apache/spark/sql/api/python/PythonSQLUtils.scala Outdated

Contributor

amaliujia Jan 7, 2023

Question though:

Does this util help if we want to support the application_jar in spark-submit if we enable Connect there?

HyukjinKwon force-pushed the SPARK-41933 branch 2 times, most recently from 0bcee0c to e655299 Compare

January 8, 2023 01:35

HyukjinKwon added 2 commits

January 8, 2023 11:32


          local mode support

ec8ab08


          Update comments

3c2852a

HyukjinKwon force-pushed the SPARK-41933 branch from e655299 to 3c2852a Compare

January 8, 2023 02:33

zhengruifeng approved these changes

View reviewed changes

python/pyspark/sql/connect/session.py Outdated Show resolved Hide resolved

amaliujia approved these changes

View reviewed changes

HyukjinKwon force-pushed the SPARK-41933 branch from 5cd90a6 to db58506 Compare

January 8, 2023 05:24

HyukjinKwon force-pushed the SPARK-41933 branch from db58506 to 4ae53eb Compare

January 8, 2023 05:31


          Address a comment

5adc731

HyukjinKwon force-pushed the SPARK-41933 branch from 4ae53eb to 5adc731 Compare

January 8, 2023 05:54

HyukjinKwon commented

View reviewed changes

python/pyspark/sql/connect/session.py

    
                              # Remove "--remote" option specified, and use plain arguments.

                              # NOTE that this is not used in regular PySpark application

                              # submission because JVM at this point is already running.

                              os.environ["PYSPARK_SUBMIT_ARGS"] = '"--name" "PySparkShell" "pyspark-shell"'

Member Author

HyukjinKwon Jan 8, 2023

I think I should take another look to make sure on passing all extra arguments go through here (although Spark Connect doesn't support all of them ...). Let me revisit this after merging this PR.

Member Author

HyukjinKwon commented Jan 8, 2023

All related tests passed.

Merged to master.

HyukjinKwon closed this in

56c7cf3

HyukjinKwon mentioned this pull request

[SPARK-41944][CONNECT] Pass configurations when local remote mode is on #39463

Closed

HyukjinKwon added a commit that referenced this pull request


          [SPARK-41944][CONNECT] Pass configurations when local remote mode is on

55fe522

### What changes were proposed in this pull request?

This PR mainly proposes to pass the user-specified configurations to local remote mode.

Previously, all user-specific configurations were ignored in case of PySpark shell such as`./bin/pyspark` or plain Python interpreter - PySpark application submission case was fine.

Now, configurations are properly passed to the server side, e.g., `./bin/pyspark --remote local --conf aaa=bbb` and `aaa=bbb` is properly passed to the server side.

For `spark.master` and `spark.plugins`, user-specific configurations are respected. If they are unset, they are automatically set, e.g., `org.apache.spark.sql.connect.SparkConnectPlugin`. If they are set, users have to provide the proper values to overwrite them, meaning that either:

```bash
./bin/pyspark --remote local --conf spark.plugins="other.Plugin,org.apache.spark.sql.connect.SparkConnectPlugin"
```

or

```bash
./bin/pyspark --remote local
```

In addition, this PR fixes the related code as below:
- Adds `spark.local.connect` internal configuration to be used in Spark Submit (so we don't have to parse and manipulate user specified arguments in Python in order to remove `--remote` or `spark.remote` configuration).
- Adds some more validation on arguments in `SparkSubmitCommandBuilder` so invalid combination can fail fast (e.g., setting both remote and master like `--master ...` and `--conf spark.remote=...`)
- In dev mode, do not set `spark.jars` anymore since it adds the jars into the class path of the JVM through `addJarToCurrentClassLoader`.

### Why are the changes needed?

To correctly pass the configurations specified from users.

### Does this PR introduce _any_ user-facing change?

No, Spark Connect has not been released yet.
This is kind of a followup of #39441 to complete its support.

### How was this patch tested?

Manually tested all combinations such as:

```bash
./bin/pyspark --conf spark.remote=local
./bin/pyspark --conf spark.remote=local --conf spark.jars=a
./bin/pyspark --conf spark.remote=local --jars /.../spark/connector/connect/server/target/scala-2.12/spark-connect-assembly-3.4.0-SNAPSHOT.jar
./bin/spark-submit --conf spark.remote=local --jars /.../spark/connector/connect/server/target/scala-2.12/spark-connect-assembly-3.4.0-SNAPSHOT.jar app.py
./bin/pyspark --conf spark.remote=local --conf spark.jars=/.../spark/connector/connect/server/target/scala-2.12/spark-connect-assembly-3.4.0-SNAPSHOT.jar
./bin/pyspark --master "local[*]" --remote "local"
./bin/spark-submit --conf spark.remote=local app.py
./bin/spark-submit --master="local[*]" --conf spark.remote=local app.py
./bin/spark-submit --master="local[*]" --remote=local app.py
./bin/pyspark --master "local[*]" --remote "local"
./bin/pyspark --master "local[*]" --remote "local"
./bin/pyspark --master "local[*]" --conf spark.remote="local"
./bin/spark-submit --master="local[*]" --remote=local app.py
./bin/pyspark --remote local
```

Closes #39463 from HyukjinKwon/SPARK-41933-conf.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

itholic mentioned this pull request

[SPARK-41933][FOLLOWUP][CONNECT] Correct an error message #40112

Closed

HyukjinKwon pushed a commit that referenced this pull request


          [SPARK-41933][FOLLOWUP][CONNECT] Correct an error message

58efc4b

### What changes were proposed in this pull request?

This PR follow-ups for #39441 to fix the wrong error message.

### Why are the changes needed?

Error message correction.

### Does this PR introduce _any_ user-facing change?

No, but it's just about error message.

### How was this patch tested?

The existing CI should pass

Closes #40112 from itholic/SPARK-41933-followup.

Lead-authored-by: itholic <haejoon.lee@databricks.com>
Co-authored-by: Haejoon Lee <44108233+itholic@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

HyukjinKwon pushed a commit that referenced this pull request


          [SPARK-41933][FOLLOWUP][CONNECT] Correct an error message

6e6993a

### What changes were proposed in this pull request?

This PR follow-ups for #39441 to fix the wrong error message.

### Why are the changes needed?

Error message correction.

### Does this PR introduce _any_ user-facing change?

No, but it's just about error message.

### How was this patch tested?

The existing CI should pass

Closes #40112 from itholic/SPARK-41933-followup.

Lead-authored-by: itholic <haejoon.lee@databricks.com>
Co-authored-by: Haejoon Lee <44108233+itholic@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 58efc4b)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request


          [SPARK-41933][FOLLOWUP][CONNECT] Correct an error message

fc24dad

### What changes were proposed in this pull request?

This PR follow-ups for apache#39441 to fix the wrong error message.

### Why are the changes needed?

Error message correction.

### Does this PR introduce _any_ user-facing change?

No, but it's just about error message.

### How was this patch tested?

The existing CI should pass

Closes apache#40112 from itholic/SPARK-41933-followup.

Lead-authored-by: itholic <haejoon.lee@databricks.com>
Co-authored-by: Haejoon Lee <44108233+itholic@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 58efc4b)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

HyukjinKwon deleted the SPARK-41933 branch

January 15, 2024 00:52

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CONNECT CORE PYTHON SQL