-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-4924] Add a library for launching Spark jobs programmatically. #3916
Conversation
This change encapsulates all the logic involved in launching a Spark job into a small Java library that can be easily embedded into other applications. Only the `SparkLauncher` class is supposed to be public in the new launcher lib, but to be able to use those classes elsewhere in Spark some of the visibility modifiers were relaxed. This allows us to automate some checks in unit tests, when before you just had a comment that was easily missed. A subsequent commit will change Spark core and all the shell scripts to use this library instead of custom code that needs to be replicated for different OSes and, sometimes, also in Spark code.
Change the existing scripts under bin/ to use the launcher library, to avoid code duplication and reduce the amount of coupling between scripts and Spark code. Also change some Spark core code to use the library instead of relying on scripts (either by calling them or with comments saying they should be kept in sync). While the library is now included in the assembly (by means of the spark-core dependency), it's still packaged directly into the final lib/ directory, because loading a small jar is much faster than the huge assembly jar, and that makes the start up time of Spark jobs much better.
Use a common base class to parse SparkSubmit command line arguments. This forces anyone who wants to add new arguments to modify the shared parser, updating all code that needs to know about SparkSubmit options in the process. Also create some constants to avoid copy & pasting strings around to actually process the options.
For new-style launchers, do the launching using SparkSubmit; hopefully this will be the preferred method of launching new daemons (if any). Currently it handles the thrift server daemon.
pyspark (at least) relies on SPARK_HOME (the env variable) to be set for things to work properly. The launcher wasn't making sure that variable was set in all cases, so do that. Also, separately, the Yarn backend didn't seem to propagate that variable to the AM for some reason, so do that too. (Not sure how things worked previously...) Extra: add ".pyo" files to .gitignore (these are generated by `python -O`).
Test build #25118 has started for PR 3916 at commit
|
A few pre-emptive comments:
The yarn and standalone tests used our internal test harness, which was unmodified, so this shows the new scripts are backwards compatible with the old ones (to the extent our tests exercise them). |
Test build #25118 has finished for PR 3916 at commit
|
Test FAILed. |
Test build #25138 has started for PR 3916 at commit
|
Test build #25138 has finished for PR 3916 at commit
|
Test PASSed. |
Conflicts: bin/spark-submit bin/spark-submit2.cmd yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala
Test build #25265 has started for PR 3916 at commit
|
Test build #25265 has finished for PR 3916 at commit
|
Test FAILed. |
Test build #25271 has started for PR 3916 at commit
|
Test build #25271 has finished for PR 3916 at commit
|
Test PASSed. |
Conflicts: bin/compute-classpath.cmd bin/compute-classpath.sh make-distribution.sh
Also some minor tweaks for the maven build.
The issue is that SparkConf is not thread-safe; so it was possible for the executor thread to try to read the configuration while the context thread was modifying it. In my tests this caused the executor to consistently miss the "spark.driver.port" config and fail tests. Long term, it would probably be better to investigate using a concurrent map implementation in SparkConf (instead of a HashMap).
Conflicts: bin/spark-submit bin/spark-submit2.cmd
Test build #25350 has started for PR 3916 at commit
|
Test build #25351 has started for PR 3916 at commit
|
Test build #25351 has finished for PR 3916 at commit
|
@@ -121,45 +121,63 @@ if [ "$SPARK_NICENESS" = "" ]; then | |||
export SPARK_NICENESS=0 | |||
fi | |||
|
|||
run_command() { | |||
mode=$1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mode="$1"
Sorry to spam the PR with comments about Bash word splitting, @vanzin. I think what you have Bash-wise looks good, but as a matter of consistency I just went through and marked the places where Bash word splitting bugs are theoretically possible. As a matter of habit and good style, I think we should always quote Bash output (whether of variables, subshells, etc.), though I can see if in some cases you feel it would be unnecessary. |
Even to places that weren't really changes I made.
Test build #28407 has started for PR 3916 at commit
|
Test build #28404 has finished for PR 3916 at commit
|
Test PASSed. |
Test build #28407 has finished for PR 3916 at commit
|
Test PASSed. |
@andrewor14 are you good with this one? I'd like to merge it soon! |
Yeah this LGTM. There are a few other minor comments I left regarding visibility but they don't have to block this PR from going in. I haven't tested whether all the cases (Windows, YARN, PySpark...) are maintained as before and I won't have the bandwidth to do so this week, but we can always do that separately. |
FYI I ran our test suite with this change (covers standalone/client, yarn/both, spark-shell and pyspark, Linux only) and all looks ok. Also did some manual testing on Windows. |
Test build #28447 has started for PR 3916 at commit
|
Test build #28447 has finished for PR 3916 at commit
|
Test PASSed. |
Okay cool - LGTM I will pull this in. I just did some local sanity tests, built with Maven and ran (a few) Maven tests. We'll need to keep any eye on the Maven build tomorrow to see if this is succeeding there, since it's not caught by the PRB. Thanks @vanzin for all the work on this. |
if [[ $i =~ $whitespace ]]; then i=\"$i\"; fi | ||
PYSPARK_SUBMIT_ARGS="$PYSPARK_SUBMIT_ARGS $i" | ||
done | ||
export PYSPARK_SUBMIT_ARGS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @vanzin , I'm now adding the Kafka python unit test, and it requires to add Kafka assembly jar with --jars
argument, as I see now you removed this codes, so for now how to add jars with pyspark for unit test, using PYSPARK_SUBMIT_ARGS
? Thanks a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on how the unit tests are run. Are they run through spark-submit? The you just pass --jars
to spark-submit.
If they're run by the code under the $SPARK_TESTING
check in this file, then setting that env variable will probably work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @vanzin for your reply. Actually the code is running under $SPARK_TESTING
, I tried to use PYSPARK_SUBMIT_ARGS
to expose the environment, like this:
export PYSPARK_SUBMIT_ARGS="pyspark-shell --jars ${KAFKA_ASSEMBLY_JAR}"
I'm not sure is it the right way? But seems the jar cannot add into the SparkContext, is there anything else I should take care?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like pyspark-shell
have to be in the end of PYSPARK_SUBMIT_ARGS
, I changed the ordering to --jars ${KAFKA_ASSEBMLY_JAR} pyspark-shell
, so it works. I'm not sure is it a right behavior or just a bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually pyspark-shell
shouldn't be in PYSPARK_SUBMIT_ARGS at all. It probably works out of luck when you add it at the end :-), but it's handled internally by the launcher library.
(We should probably file a bug to remove this testing stuff from bin/pyspark
. That would allow other cleanups in these scripts.)
This change encapsulates all the logic involved in launching a Spark job
into a small Java library that can be easily embedded into other applications.
The overall goal of this change is twofold, as described in the bug:
from users and currently there's no good answer for it.
different parts of Spark that deal with launching processes.
A lot of the duplication was due to different code needed to build an
application's classpath (and the bootstrapper needed to run the driver in
certain situations), and also different code needed to parse spark-submit
command line options in different contexts. The change centralizes those
as much as possible so that all code paths can rely on the library for
handling those appropriately.