[SPARK-16540][YARN][CORE] Avoid adding jars twice for Spark running on yarn #14196

jerryshao · 2016-07-14T07:49:07Z

What changes were proposed in this pull request?

Currently when running spark on yarn, jars specified with --jars, --packages will be added twice, one is Spark's own file server, another is yarn's distributed cache, this can be seen from log:
for example:

./bin/spark-shell --master yarn-client --jars examples/target/scala-2.11/jars/scopt_2.11-3.3.0.jar

If specified the jar to be added is scopt jar, it will added twice:

...
16/07/14 15:06:48 INFO Server: Started @5603ms
16/07/14 15:06:48 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/07/14 15:06:48 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.0.102:4040
16/07/14 15:06:48 INFO SparkContext: Added JAR file:/Users/sshao/projects/apache-spark/examples/target/scala-2.11/jars/scopt_2.11-3.3.0.jar at spark://192.168.0.102:63996/jars/scopt_2.11-3.3.0.jar with timestamp 1468480008637
16/07/14 15:06:49 INFO RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
16/07/14 15:06:49 INFO Client: Requesting a new application from cluster with 1 NodeManagers
16/07/14 15:06:49 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container)
16/07/14 15:06:49 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
16/07/14 15:06:49 INFO Client: Setting up container launch context for our AM
16/07/14 15:06:49 INFO Client: Setting up the launch environment for our AM container
16/07/14 15:06:49 INFO Client: Preparing resources for our AM container
16/07/14 15:06:49 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
16/07/14 15:06:50 INFO Client: Uploading resource file:/private/var/folders/tb/8pw1511s2q78mj7plnq8p9g40000gn/T/spark-a446300b-84bf-43ff-bfb1-3adfb0571a42/__spark_libs__6486179704064718817.zip -> hdfs://localhost:8020/user/sshao/.sparkStaging/application_1468468348998_0009/__spark_libs__6486179704064718817.zip
16/07/14 15:06:51 INFO Client: Uploading resource file:/Users/sshao/projects/apache-spark/examples/target/scala-2.11/jars/scopt_2.11-3.3.0.jar -> hdfs://localhost:8020/user/sshao/.sparkStaging/application_1468468348998_0009/scopt_2.11-3.3.0.jar
16/07/14 15:06:51 INFO Client: Uploading resource file:/private/var/folders/tb/8pw1511s2q78mj7plnq8p9g40000gn/T/spark-a446300b-84bf-43ff-bfb1-3adfb0571a42/__spark_conf__326416236462420861.zip -> hdfs://localhost:8020/user/sshao/.sparkStaging/application_1468468348998_0009/__spark_conf__.zip
...

So here try to avoid adding jars to Spark's fileserver unnecessarily.

How was this patch tested?

Manually verified both in yarn client and cluster mode, also in standalone mode.

SparkQA · 2016-07-14T09:52:56Z

Test build #62302 has finished for PR 14196 at commit 86205fc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2016-07-14T17:38:53Z

I was gonna say the Utils method now doesn't need to be called from SparkContext, which would avoid the new argument, but it does the parsing from a String into a Set, so that avoids some duplication. LGTM, merging into master / 2.0.

(Mental note: eventually all these should be config constants so they can use the common parsing code in ConfigBuilder.scala...)

…n yarn ## What changes were proposed in this pull request? Currently when running spark on yarn, jars specified with --jars, --packages will be added twice, one is Spark's own file server, another is yarn's distributed cache, this can be seen from log: for example: ``` ./bin/spark-shell --master yarn-client --jars examples/target/scala-2.11/jars/scopt_2.11-3.3.0.jar ``` If specified the jar to be added is scopt jar, it will added twice: ``` ... 16/07/14 15:06:48 INFO Server: Started 5603ms 16/07/14 15:06:48 INFO Utils: Successfully started service 'SparkUI' on port 4040. 16/07/14 15:06:48 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.0.102:4040 16/07/14 15:06:48 INFO SparkContext: Added JAR file:/Users/sshao/projects/apache-spark/examples/target/scala-2.11/jars/scopt_2.11-3.3.0.jar at spark://192.168.0.102:63996/jars/scopt_2.11-3.3.0.jar with timestamp 1468480008637 16/07/14 15:06:49 INFO RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 16/07/14 15:06:49 INFO Client: Requesting a new application from cluster with 1 NodeManagers 16/07/14 15:06:49 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container) 16/07/14 15:06:49 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead 16/07/14 15:06:49 INFO Client: Setting up container launch context for our AM 16/07/14 15:06:49 INFO Client: Setting up the launch environment for our AM container 16/07/14 15:06:49 INFO Client: Preparing resources for our AM container 16/07/14 15:06:49 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. 16/07/14 15:06:50 INFO Client: Uploading resource file:/private/var/folders/tb/8pw1511s2q78mj7plnq8p9g40000gn/T/spark-a446300b-84bf-43ff-bfb1-3adfb0571a42/__spark_libs__6486179704064718817.zip -> hdfs://localhost:8020/user/sshao/.sparkStaging/application_1468468348998_0009/__spark_libs__6486179704064718817.zip 16/07/14 15:06:51 INFO Client: Uploading resource file:/Users/sshao/projects/apache-spark/examples/target/scala-2.11/jars/scopt_2.11-3.3.0.jar -> hdfs://localhost:8020/user/sshao/.sparkStaging/application_1468468348998_0009/scopt_2.11-3.3.0.jar 16/07/14 15:06:51 INFO Client: Uploading resource file:/private/var/folders/tb/8pw1511s2q78mj7plnq8p9g40000gn/T/spark-a446300b-84bf-43ff-bfb1-3adfb0571a42/__spark_conf__326416236462420861.zip -> hdfs://localhost:8020/user/sshao/.sparkStaging/application_1468468348998_0009/__spark_conf__.zip ... ``` So here try to avoid adding jars to Spark's fileserver unnecessarily. ## How was this patch tested? Manually verified both in yarn client and cluster mode, also in standalone mode. Author: jerryshao <sshao@hortonworks.com> Closes #14196 from jerryshao/SPARK-16540. (cherry picked from commit 91575ca) Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

rxin · 2016-07-14T19:53:24Z

core/src/main/scala/org/apache/spark/util/Utils.scala

   * only the "spark.jars" property.
   */
-  def getUserJars(conf: SparkConf): Seq[String] = {
+  def getUserJars(conf: SparkConf, isShell: Boolean = false): Seq[String] = {


hm can you document what this parameter does?

Do I still need to update the docs, or maybe this can be done later?

Avoid adding jars twice for Spark running on yarn

86205fc

asfgit closed this in 91575ca Jul 14, 2016

rxin reviewed Jul 14, 2016
View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-16540][YARN][CORE] Avoid adding jars twice for Spark running on yarn #14196

[SPARK-16540][YARN][CORE] Avoid adding jars twice for Spark running on yarn #14196

Uh oh!

jerryshao commented Jul 14, 2016

Uh oh!

SparkQA commented Jul 14, 2016

Uh oh!

vanzin commented Jul 14, 2016

Uh oh!

rxin Jul 14, 2016

Uh oh!

jerryshao Jul 15, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-16540][YARN][CORE] Avoid adding jars twice for Spark running on yarn #14196

[SPARK-16540][YARN][CORE] Avoid adding jars twice for Spark running on yarn #14196

Uh oh!

Conversation

jerryshao commented Jul 14, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jul 14, 2016

Uh oh!

vanzin commented Jul 14, 2016

Uh oh!

rxin Jul 14, 2016

Choose a reason for hiding this comment

Uh oh!

jerryshao Jul 15, 2016

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants