[SPARK-4924] Add a library for launching Spark jobs programmatically. #3916

vanzin · 2015-01-06T22:40:31Z

This change encapsulates all the logic involved in launching a Spark job
into a small Java library that can be easily embedded into other applications.

The overall goal of this change is twofold, as described in the bug:

Provide a public API for launching Spark processes. This is a common request
from users and currently there's no good answer for it.
Remove a lot of the duplicated code and other coupling that exists in the
different parts of Spark that deal with launching processes.

A lot of the duplication was due to different code needed to build an
application's classpath (and the bootstrapper needed to run the driver in
certain situations), and also different code needed to parse spark-submit
command line options in different contexts. The change centralizes those
as much as possible so that all code paths can rely on the library for
handling those appropriately.

This change encapsulates all the logic involved in launching a Spark job into a small Java library that can be easily embedded into other applications. Only the `SparkLauncher` class is supposed to be public in the new launcher lib, but to be able to use those classes elsewhere in Spark some of the visibility modifiers were relaxed. This allows us to automate some checks in unit tests, when before you just had a comment that was easily missed. A subsequent commit will change Spark core and all the shell scripts to use this library instead of custom code that needs to be replicated for different OSes and, sometimes, also in Spark code.

Change the existing scripts under bin/ to use the launcher library, to avoid code duplication and reduce the amount of coupling between scripts and Spark code. Also change some Spark core code to use the library instead of relying on scripts (either by calling them or with comments saying they should be kept in sync). While the library is now included in the assembly (by means of the spark-core dependency), it's still packaged directly into the final lib/ directory, because loading a small jar is much faster than the huge assembly jar, and that makes the start up time of Spark jobs much better.

Use a common base class to parse SparkSubmit command line arguments. This forces anyone who wants to add new arguments to modify the shared parser, updating all code that needs to know about SparkSubmit options in the process. Also create some constants to avoid copy & pasting strings around to actually process the options.

For new-style launchers, do the launching using SparkSubmit; hopefully this will be the preferred method of launching new daemons (if any). Currently it handles the thrift server daemon.

pyspark (at least) relies on SPARK_HOME (the env variable) to be set for things to work properly. The launcher wasn't making sure that variable was set in all cases, so do that. Also, separately, the Yarn backend didn't seem to propagate that variable to the AM for some reason, so do that too. (Not sure how things worked previously...) Extra: add ".pyo" files to .gitignore (these are generated by `python -O`).

SparkQA · 2015-01-06T22:42:58Z

Test build #25118 has started for PR 3916 at commit 656374e.

This patch merges cleanly.

vanzin · 2015-01-06T22:44:11Z

A few pre-emptive comments:

I understand this change is a little large. I'd recommend looking at separate commits if you feel overwhelmed. But I don't think breaking it down would make sense; github is not friendly to "incremental" patches so it would slow down review a lot.
I tested the following combinations:
- the unit tests
- spark-submit, spark-shell and pyspark on Linux (local, yarn client/cluster, standalone)
- spark-submit, spark-shell and pyspark on Windows (local)

The yarn and standalone tests used our internal test harness, which was unmodified, so this shows the new scripts are backwards compatible with the old ones (to the extent our tests exercise them).

SparkQA · 2015-01-07T00:03:29Z

Test build #25118 has finished for PR 3916 at commit 656374e.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-07T00:03:33Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25118/
Test FAILed.

SparkQA · 2015-01-07T04:57:37Z

Test build #25138 has started for PR 3916 at commit a7936ef.

This patch merges cleanly.

SparkQA · 2015-01-07T06:35:35Z

Test build #25138 has finished for PR 3916 at commit a7936ef.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-07T06:35:40Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25138/
Test PASSed.

Conflicts: bin/spark-submit bin/spark-submit2.cmd yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala

SparkQA · 2015-01-08T21:02:34Z

Test build #25265 has started for PR 3916 at commit 53faef1.

This patch merges cleanly.

SparkQA · 2015-01-08T22:25:38Z

Test build #25265 has finished for PR 3916 at commit 53faef1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-08T22:25:41Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25265/
Test FAILed.

SparkQA · 2015-01-08T22:47:32Z

Test build #25271 has started for PR 3916 at commit bb5d324.

This patch merges cleanly.

SparkQA · 2015-01-09T00:28:32Z

Test build #25271 has finished for PR 3916 at commit bb5d324.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-09T00:28:36Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25271/
Test PASSed.

Conflicts: bin/compute-classpath.cmd bin/compute-classpath.sh make-distribution.sh

Also some minor tweaks for the maven build.

The issue is that SparkConf is not thread-safe; so it was possible for the executor thread to try to read the configuration while the context thread was modifying it. In my tests this caused the executor to consistently miss the "spark.driver.port" config and fail tests. Long term, it would probably be better to investigate using a concurrent map implementation in SparkConf (instead of a HashMap).

Conflicts: bin/spark-submit bin/spark-submit2.cmd

SparkQA · 2015-01-10T01:22:40Z

Test build #25350 has started for PR 3916 at commit f26556b.

This patch does not merge cleanly.

SparkQA · 2015-01-10T01:27:34Z

Test build #25351 has started for PR 3916 at commit fc6a3e2.

This patch merges cleanly.

SparkQA · 2015-01-10T02:53:32Z

Test build #25351 has finished for PR 3916 at commit fc6a3e2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

nchammas · 2015-03-09T21:52:16Z

sbin/spark-daemon.sh

@@ -121,45 +121,63 @@ if [ "$SPARK_NICENESS" = "" ]; then
    export SPARK_NICENESS=0
 fi

+run_command() {
+  mode=$1


mode="$1"

nchammas · 2015-03-09T22:07:17Z

Sorry to spam the PR with comments about Bash word splitting, @vanzin. I think what you have Bash-wise looks good, but as a matter of consistency I just went through and marked the places where Bash word splitting bugs are theoretically possible.

As a matter of habit and good style, I think we should always quote Bash output (whether of variables, subshells, etc.), though I can see if in some cases you feel it would be unnecessary.

Even to places that weren't really changes I made.

SparkQA · 2015-03-09T22:23:19Z

Test build #28407 has started for PR 3916 at commit 2ce741f.

This patch merges cleanly.

SparkQA · 2015-03-09T22:29:16Z

Test build #28404 has finished for PR 3916 at commit 3b28a75.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-09T22:29:20Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28404/
Test PASSed.

SparkQA · 2015-03-09T23:59:36Z

Test build #28407 has finished for PR 3916 at commit 2ce741f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-09T23:59:39Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28407/
Test PASSed.

pwendell · 2015-03-10T06:42:39Z

@andrewor14 are you good with this one? I'd like to merge it soon!

andrewor14 · 2015-03-10T20:40:13Z

Yeah this LGTM. There are a few other minor comments I left regarding visibility but they don't have to block this PR from going in. I haven't tested whether all the cases (Windows, YARN, PySpark...) are maintained as before and I won't have the bandwidth to do so this week, but we can always do that separately.

vanzin · 2015-03-10T22:02:49Z

FYI I ran our test suite with this change (covers standalone/client, yarn/both, spark-shell and pyspark, Linux only) and all looks ok. Also did some manual testing on Windows.

SparkQA · 2015-03-10T23:27:47Z

Test build #28447 has started for PR 3916 at commit 18c7e4d.

This patch merges cleanly.

SparkQA · 2015-03-11T01:15:21Z

Test build #28447 has finished for PR 3916 at commit 18c7e4d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-11T01:15:25Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28447/
Test PASSed.

pwendell · 2015-03-11T08:02:43Z

Okay cool - LGTM I will pull this in. I just did some local sanity tests, built with Maven and ran (a few) Maven tests. We'll need to keep any eye on the Maven build tomorrow to see if this is succeeding there, since it's not caught by the PRB. Thanks @vanzin for all the work on this.

jerryshao · 2015-03-12T05:47:26Z

bin/pyspark

-  if [[ $i =~ $whitespace ]]; then i=\"$i\"; fi
-  PYSPARK_SUBMIT_ARGS="$PYSPARK_SUBMIT_ARGS $i"
-done
-export PYSPARK_SUBMIT_ARGS


Hi @vanzin , I'm now adding the Kafka python unit test, and it requires to add Kafka assembly jar with --jars argument, as I see now you removed this codes, so for now how to add jars with pyspark for unit test, using PYSPARK_SUBMIT_ARGS? Thanks a lot.

It depends on how the unit tests are run. Are they run through spark-submit? The you just pass --jars to spark-submit.

If they're run by the code under the $SPARK_TESTING check in this file, then setting that env variable will probably work.

Thanks @vanzin for your reply. Actually the code is running under $SPARK_TESTING, I tried to use PYSPARK_SUBMIT_ARGS to expose the environment, like this:

export PYSPARK_SUBMIT_ARGS="pyspark-shell --jars ${KAFKA_ASSEMBLY_JAR}"

I'm not sure is it the right way? But seems the jar cannot add into the SparkContext, is there anything else I should take care?

Looks like pyspark-shell have to be in the end of PYSPARK_SUBMIT_ARGS, I changed the ordering to --jars ${KAFKA_ASSEBMLY_JAR} pyspark-shell, so it works. I'm not sure is it a right behavior or just a bug?

Actually pyspark-shell shouldn't be in PYSPARK_SUBMIT_ARGS at all. It probably works out of luck when you add it at the end :-), but it's handled internally by the launcher library.

(We should probably file a bug to remove this testing stuff from bin/pyspark. That would allow other cleanups in these scripts.)

Marcelo Vanzin added 7 commits January 6, 2015 13:25

Call SparkSubmit from spark-class launcher for unknown classes.

1b3f6e9

For new-style launchers, do the launching using SparkSubmit; hopefully this will be the preferred method of launching new daemons (if any). Currently it handles the thrift server daemon.

Fix tools search code.

4d511e7

Mima fixes.

656374e

Fix pyspark tests.

a7936ef

Merge branch 'master' into SPARK-4924

53faef1

Conflicts: bin/spark-submit bin/spark-submit2.cmd yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala

Merge branch 'master' into SPARK-4924

bb5d324

Marcelo Vanzin added 4 commits January 9, 2015 10:10

Merge branch 'master' into SPARK-4924

799fc20

Conflicts: bin/compute-classpath.cmd bin/compute-classpath.sh make-distribution.sh

Changes needed to make this work with SPARK-4048.

2f4e8b4

Also some minor tweaks for the maven build.

Merge branch 'master' into SPARK-4924

fc6a3e2

Conflicts: bin/spark-submit bin/spark-submit2.cmd

nchammas reviewed Mar 9, 2015
View reviewed changes

Add lots of quotes.

2ce741f

Even to places that weren't really changes I made.

Fix make-distribution.sh.

18c7e4d

asfgit closed this in 517975d Mar 11, 2015

vanzin deleted the SPARK-4924 branch March 11, 2015 16:07

jerryshao reviewed Mar 12, 2015
View reviewed changes

This was referenced Mar 13, 2015

[SPARK-6211][Streaming] Add Python Kafka API unit test #4961

Closed

[SPARK-6327] [PySpark] fix launch spark-submit from python #5019

Closed

WangTaoTheTonic mentioned this pull request May 29, 2015

[SPARK-7945][CORE]Do trim to values in properties file #6496

Closed

heuermh mentioned this pull request Aug 5, 2015

Investigate SparkLauncher API as a replacement for adam-submit+spark-submit bigdatagenomics/adam#756

Closed

[SPARK-4924] Add a library for launching Spark jobs programmatically. #3916

[SPARK-4924] Add a library for launching Spark jobs programmatically. #3916

Conversation

vanzin commented Jan 6, 2015

SparkQA commented Jan 6, 2015

vanzin commented Jan 6, 2015

SparkQA commented Jan 7, 2015

AmplabJenkins commented Jan 7, 2015

SparkQA commented Jan 7, 2015

SparkQA commented Jan 7, 2015

AmplabJenkins commented Jan 7, 2015

SparkQA commented Jan 8, 2015

SparkQA commented Jan 8, 2015

AmplabJenkins commented Jan 8, 2015

SparkQA commented Jan 8, 2015

SparkQA commented Jan 9, 2015

AmplabJenkins commented Jan 9, 2015

SparkQA commented Jan 10, 2015

SparkQA commented Jan 10, 2015

SparkQA commented Jan 10, 2015

nchammas Mar 9, 2015

Choose a reason for hiding this comment

nchammas commented Mar 9, 2015

SparkQA commented Mar 9, 2015

SparkQA commented Mar 9, 2015

AmplabJenkins commented Mar 9, 2015

SparkQA commented Mar 9, 2015

AmplabJenkins commented Mar 9, 2015

pwendell commented Mar 10, 2015

andrewor14 commented Mar 10, 2015

vanzin commented Mar 10, 2015

SparkQA commented Mar 10, 2015

SparkQA commented Mar 11, 2015

AmplabJenkins commented Mar 11, 2015

pwendell commented Mar 11, 2015

jerryshao Mar 12, 2015

Choose a reason for hiding this comment

vanzin Mar 12, 2015

Choose a reason for hiding this comment

jerryshao Mar 13, 2015

Choose a reason for hiding this comment

jerryshao Mar 13, 2015

Choose a reason for hiding this comment

vanzin Mar 13, 2015

Choose a reason for hiding this comment