Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4924] Add a library for launching Spark jobs programmatically. #3916

Closed
wants to merge 65 commits into from

Conversation

vanzin
Copy link
Contributor

@vanzin vanzin commented Jan 6, 2015

This change encapsulates all the logic involved in launching a Spark job
into a small Java library that can be easily embedded into other applications.

The overall goal of this change is twofold, as described in the bug:

  • Provide a public API for launching Spark processes. This is a common request
    from users and currently there's no good answer for it.
  • Remove a lot of the duplicated code and other coupling that exists in the
    different parts of Spark that deal with launching processes.

A lot of the duplication was due to different code needed to build an
application's classpath (and the bootstrapper needed to run the driver in
certain situations), and also different code needed to parse spark-submit
command line options in different contexts. The change centralizes those
as much as possible so that all code paths can rely on the library for
handling those appropriately.

Marcelo Vanzin added 7 commits January 6, 2015 13:25
This change encapsulates all the logic involved in launching a Spark job
into a small Java library that can be easily embedded into other applications.

Only the `SparkLauncher` class is supposed to be public in the new launcher
lib, but to be able to use those classes elsewhere in Spark some of the
visibility modifiers were relaxed. This allows us to automate some checks
in unit tests, when before you just had a comment that was easily missed.

A subsequent commit will change Spark core and all the shell scripts to use
this library instead of custom code that needs to be replicated for different
OSes and, sometimes, also in Spark code.
Change the existing scripts under bin/ to use the launcher library,
to avoid code duplication and reduce the amount of coupling between scripts
and Spark code.

Also change some Spark core code to use the library instead of relying on
scripts (either by calling them or with comments saying they should be
kept in sync).

While the library is now included in the assembly (by means of the spark-core
dependency), it's still packaged directly into the final lib/ directory,
because loading a small jar is much faster than the huge assembly jar, and
that makes the start up time of Spark jobs much better.
Use a common base class to parse SparkSubmit command line arguments. This forces
anyone who wants to add new arguments to modify the shared parser, updating all
code that needs to know about SparkSubmit options in the process.

Also create some constants to avoid copy & pasting strings around to actually
process the options.
For new-style launchers, do the launching using SparkSubmit; hopefully
this will be the preferred method of launching new daemons (if any).
Currently it handles the thrift server daemon.
pyspark (at least) relies on SPARK_HOME (the env variable) to be set
for things to work properly. The launcher wasn't making sure that
variable was set in all cases, so do that. Also, separately, the
Yarn backend didn't seem to propagate that variable to the AM for
some reason, so do that too. (Not sure how things worked previously...)

Extra: add ".pyo" files to .gitignore (these are generated by `python -O`).
@SparkQA
Copy link

SparkQA commented Jan 6, 2015

Test build #25118 has started for PR 3916 at commit 656374e.

  • This patch merges cleanly.

@vanzin
Copy link
Contributor Author

vanzin commented Jan 6, 2015

A few pre-emptive comments:

  • I understand this change is a little large. I'd recommend looking at separate commits if you feel overwhelmed. But I don't think breaking it down would make sense; github is not friendly to "incremental" patches so it would slow down review a lot.
  • I tested the following combinations:
    • the unit tests
    • spark-submit, spark-shell and pyspark on Linux (local, yarn client/cluster, standalone)
    • spark-submit, spark-shell and pyspark on Windows (local)

The yarn and standalone tests used our internal test harness, which was unmodified, so this shows the new scripts are backwards compatible with the old ones (to the extent our tests exercise them).

@SparkQA
Copy link

SparkQA commented Jan 7, 2015

Test build #25118 has finished for PR 3916 at commit 656374e.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25118/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Jan 7, 2015

Test build #25138 has started for PR 3916 at commit a7936ef.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 7, 2015

Test build #25138 has finished for PR 3916 at commit a7936ef.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25138/
Test PASSed.

Conflicts:
	bin/spark-submit
	bin/spark-submit2.cmd
	yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala
@SparkQA
Copy link

SparkQA commented Jan 8, 2015

Test build #25265 has started for PR 3916 at commit 53faef1.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 8, 2015

Test build #25265 has finished for PR 3916 at commit 53faef1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25265/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Jan 8, 2015

Test build #25271 has started for PR 3916 at commit bb5d324.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 9, 2015

Test build #25271 has finished for PR 3916 at commit bb5d324.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25271/
Test PASSed.

Marcelo Vanzin added 4 commits January 9, 2015 10:10
Conflicts:
	bin/compute-classpath.cmd
	bin/compute-classpath.sh
	make-distribution.sh
Also some minor tweaks for the maven build.
The issue is that SparkConf is not thread-safe; so it was possible
for the executor thread to try to read the configuration while the
context thread was modifying it. In my tests this caused the executor
to consistently miss the "spark.driver.port" config and fail tests.

Long term, it would probably be better to investigate using a concurrent
map implementation in SparkConf (instead of a HashMap).
Conflicts:
	bin/spark-submit
	bin/spark-submit2.cmd
@SparkQA
Copy link

SparkQA commented Jan 10, 2015

Test build #25350 has started for PR 3916 at commit f26556b.

  • This patch does not merge cleanly.

@SparkQA
Copy link

SparkQA commented Jan 10, 2015

Test build #25351 has started for PR 3916 at commit fc6a3e2.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 10, 2015

Test build #25351 has finished for PR 3916 at commit fc6a3e2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -121,45 +121,63 @@ if [ "$SPARK_NICENESS" = "" ]; then
export SPARK_NICENESS=0
fi

run_command() {
mode=$1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mode="$1"

@nchammas
Copy link
Contributor

nchammas commented Mar 9, 2015

Sorry to spam the PR with comments about Bash word splitting, @vanzin. I think what you have Bash-wise looks good, but as a matter of consistency I just went through and marked the places where Bash word splitting bugs are theoretically possible.

As a matter of habit and good style, I think we should always quote Bash output (whether of variables, subshells, etc.), though I can see if in some cases you feel it would be unnecessary.

Even to places that weren't really changes I made.
@SparkQA
Copy link

SparkQA commented Mar 9, 2015

Test build #28407 has started for PR 3916 at commit 2ce741f.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 9, 2015

Test build #28404 has finished for PR 3916 at commit 3b28a75.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28404/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Mar 9, 2015

Test build #28407 has finished for PR 3916 at commit 2ce741f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28407/
Test PASSed.

@pwendell
Copy link
Contributor

@andrewor14 are you good with this one? I'd like to merge it soon!

@andrewor14
Copy link
Contributor

Yeah this LGTM. There are a few other minor comments I left regarding visibility but they don't have to block this PR from going in. I haven't tested whether all the cases (Windows, YARN, PySpark...) are maintained as before and I won't have the bandwidth to do so this week, but we can always do that separately.

@vanzin
Copy link
Contributor Author

vanzin commented Mar 10, 2015

FYI I ran our test suite with this change (covers standalone/client, yarn/both, spark-shell and pyspark, Linux only) and all looks ok. Also did some manual testing on Windows.

@SparkQA
Copy link

SparkQA commented Mar 10, 2015

Test build #28447 has started for PR 3916 at commit 18c7e4d.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 11, 2015

Test build #28447 has finished for PR 3916 at commit 18c7e4d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28447/
Test PASSed.

@pwendell
Copy link
Contributor

Okay cool - LGTM I will pull this in. I just did some local sanity tests, built with Maven and ran (a few) Maven tests. We'll need to keep any eye on the Maven build tomorrow to see if this is succeeding there, since it's not caught by the PRB. Thanks @vanzin for all the work on this.

@asfgit asfgit closed this in 517975d Mar 11, 2015
@vanzin vanzin deleted the SPARK-4924 branch March 11, 2015 16:07
if [[ $i =~ $whitespace ]]; then i=\"$i\"; fi
PYSPARK_SUBMIT_ARGS="$PYSPARK_SUBMIT_ARGS $i"
done
export PYSPARK_SUBMIT_ARGS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @vanzin , I'm now adding the Kafka python unit test, and it requires to add Kafka assembly jar with --jars argument, as I see now you removed this codes, so for now how to add jars with pyspark for unit test, using PYSPARK_SUBMIT_ARGS? Thanks a lot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on how the unit tests are run. Are they run through spark-submit? The you just pass --jars to spark-submit.

If they're run by the code under the $SPARK_TESTING check in this file, then setting that env variable will probably work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vanzin for your reply. Actually the code is running under $SPARK_TESTING, I tried to use PYSPARK_SUBMIT_ARGS to expose the environment, like this:

export PYSPARK_SUBMIT_ARGS="pyspark-shell --jars ${KAFKA_ASSEMBLY_JAR}"

I'm not sure is it the right way? But seems the jar cannot add into the SparkContext, is there anything else I should take care?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like pyspark-shell have to be in the end of PYSPARK_SUBMIT_ARGS, I changed the ordering to --jars ${KAFKA_ASSEBMLY_JAR} pyspark-shell, so it works. I'm not sure is it a right behavior or just a bug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually pyspark-shell shouldn't be in PYSPARK_SUBMIT_ARGS at all. It probably works out of luck when you add it at the end :-), but it's handled internally by the launcher library.

(We should probably file a bug to remove this testing stuff from bin/pyspark. That would allow other cleanups in these scripts.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants