Skip to content

Conversation

@HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

This PR mainly proposes to pass the user-specified configurations to local remote mode.

Previously, all user-specific configurations were ignored in case of PySpark shell such as./bin/pyspark or plain Python interpreter - PySpark application submission case was fine.

Now, configurations are properly passed to the server side, e.g., ./bin/pyspark --remote local --conf aaa=bbb and aaa=bbb is properly passed to the server side.

For spark.master and spark.plugins, user-specific configurations are respected. If they are unset, they are automatically set, e.g., org.apache.spark.sql.connect.SparkConnectPlugin. If they are set, users have to provide the proper values to overwrite them, meaning that either:

./bin/pyspark --remote local --conf spark.plugins="other.Plugin,org.apache.spark.sql.connect.SparkConnectPlugin"

or

./bin/pyspark --remote local

In addition, this PR fixes the related code as below:

  • Adds spark.local.connect internal configuration to be used in Spark Submit (so we don't have to parse and manipulate user specified arguments in Python in order to remove --remote or spark.remote configuration).
  • Adds some more validation on arguments in SparkSubmitCommandBuilder so invalid combination can fail fast (e.g., setting both remote and master like --master ... and --conf spark.remote=...)
  • In dev mode, do not set spark.jars anymore since it adds the jars into the class path of the JVM through addJarToCurrentClassLoader.

Why are the changes needed?

To correctly pass the configurations specified from users.

Does this PR introduce any user-facing change?

No, Spark Connect has not been released yet.
This is kind of a followup of #39441 to complete its support.

How was this patch tested?

Manually tested all combinations such as:

./bin/pyspark --conf spark.remote=local
./bin/pyspark --conf spark.remote=local --conf spark.jars=a
./bin/pyspark --conf spark.remote=local --jars /.../spark/connector/connect/server/target/scala-2.12/spark-connect-assembly-3.4.0-SNAPSHOT.jar
./bin/spark-submit --conf spark.remote=local --jars /.../spark/connector/connect/server/target/scala-2.12/spark-connect-assembly-3.4.0-SNAPSHOT.jar app.py
./bin/pyspark --conf spark.remote=local --conf spark.jars=/.../spark/connector/connect/server/target/scala-2.12/spark-connect-assembly-3.4.0-SNAPSHOT.jar
./bin/pyspark --master "local[*]" --remote "local"
./bin/spark-submit --conf spark.remote=local app.py
./bin/spark-submit --master="local[*]" --conf spark.remote=local app.py
./bin/spark-submit --master="local[*]" --remote=local app.py
./bin/pyspark --master "local[*]" --remote "local"
./bin/pyspark --master "local[*]" --remote "local"
./bin/pyspark --master "local[*]" --conf spark.remote="local"
./bin/spark-submit --master="local[*]" --remote=local app.py
./bin/pyspark --remote local

@HyukjinKwon
Copy link
Member Author

cc @zhengruifeng @amaliujia FYI

@HyukjinKwon
Copy link
Member Author

Merged to master.

@HyukjinKwon HyukjinKwon deleted the SPARK-41933-conf branch January 15, 2024 00:52
udaynpusa pushed a commit to mapr/spark that referenced this pull request Jan 30, 2024
### What changes were proposed in this pull request?

This PR mainly proposes to pass the user-specified configurations to local remote mode.

Previously, all user-specific configurations were ignored in case of PySpark shell such as`./bin/pyspark` or plain Python interpreter - PySpark application submission case was fine.

Now, configurations are properly passed to the server side, e.g., `./bin/pyspark --remote local --conf aaa=bbb` and `aaa=bbb` is properly passed to the server side.

For `spark.master` and `spark.plugins`, user-specific configurations are respected. If they are unset, they are automatically set, e.g., `org.apache.spark.sql.connect.SparkConnectPlugin`. If they are set, users have to provide the proper values to overwrite them, meaning that either:

```bash
./bin/pyspark --remote local --conf spark.plugins="other.Plugin,org.apache.spark.sql.connect.SparkConnectPlugin"
```

or

```bash
./bin/pyspark --remote local
```

In addition, this PR fixes the related code as below:
- Adds `spark.local.connect` internal configuration to be used in Spark Submit (so we don't have to parse and manipulate user specified arguments in Python in order to remove `--remote` or `spark.remote` configuration).
- Adds some more validation on arguments in `SparkSubmitCommandBuilder` so invalid combination can fail fast (e.g., setting both remote and master like `--master ...` and `--conf spark.remote=...`)
- In dev mode, do not set `spark.jars` anymore since it adds the jars into the class path of the JVM through `addJarToCurrentClassLoader`.

### Why are the changes needed?

To correctly pass the configurations specified from users.

### Does this PR introduce _any_ user-facing change?

No, Spark Connect has not been released yet.
This is kind of a followup of apache#39441 to complete its support.

### How was this patch tested?

Manually tested all combinations such as:

```bash
./bin/pyspark --conf spark.remote=local
./bin/pyspark --conf spark.remote=local --conf spark.jars=a
./bin/pyspark --conf spark.remote=local --jars /.../spark/connector/connect/server/target/scala-2.12/spark-connect-assembly-3.4.0-SNAPSHOT.jar
./bin/spark-submit --conf spark.remote=local --jars /.../spark/connector/connect/server/target/scala-2.12/spark-connect-assembly-3.4.0-SNAPSHOT.jar app.py
./bin/pyspark --conf spark.remote=local --conf spark.jars=/.../spark/connector/connect/server/target/scala-2.12/spark-connect-assembly-3.4.0-SNAPSHOT.jar
./bin/pyspark --master "local[*]" --remote "local"
./bin/spark-submit --conf spark.remote=local app.py
./bin/spark-submit --master="local[*]" --conf spark.remote=local app.py
./bin/spark-submit --master="local[*]" --remote=local app.py
./bin/pyspark --master "local[*]" --remote "local"
./bin/pyspark --master "local[*]" --remote "local"
./bin/pyspark --master "local[*]" --conf spark.remote="local"
./bin/spark-submit --master="local[*]" --remote=local app.py
./bin/pyspark --remote local
```

Closes apache#39463 from HyukjinKwon/SPARK-41933-conf.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

(cherry picked from commit 55fe522)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants