From fa93415e860cce590c3392079e93d3ae21ffc83c Mon Sep 17 00:00:00 2001 From: Neelesh Srinivas Salian Date: Sat, 8 Aug 2015 19:51:52 -0700 Subject: [PATCH 01/28] Added yarn-deploy-mode alternative --- docs/submitting-applications.md | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/docs/submitting-applications.md b/docs/submitting-applications.md index e58645274e52..015cef946062 100644 --- a/docs/submitting-applications.md +++ b/docs/submitting-applications.md @@ -48,6 +48,20 @@ Some of the commonly used options are: * `application-jar`: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an `hdfs://` path or a `file://` path that is present on all nodes. * `application-arguments`: Arguments passed to the main method of your main class, if any +Alternatively, for submitting on yarn, + +{% highlight bash %} +./bin/spark-submit \ + --class + --master + --conf = \ + ... # other options + \ + [application-arguments] +{% endhighlight %} + +* `--master`: The --master parameter is either `yarn-client` or `yarn-cluster`. Defaults to `yarn-client` + A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster). @@ -99,7 +113,7 @@ run it with `--help`. Here are a few examples of common options: /path/to/examples.jar \ 1000 -# Run on a YARN cluster +# Run on a YARN cluster without --deploy mode export HADOOP_CONF_DIR=XXX ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ From 437a4d451147f179617628a672eaa795b3b76ea0 Mon Sep 17 00:00:00 2001 From: Neelesh Srinivas Salian Date: Sat, 8 Aug 2015 19:54:04 -0700 Subject: [PATCH 02/28] Moved Master URLs closer above before the examples --- docs/submitting-applications.md | 49 ++++++++++++++++----------------- 1 file changed, 24 insertions(+), 25 deletions(-) diff --git a/docs/submitting-applications.md b/docs/submitting-applications.md index 015cef946062..4a564ed0a765 100644 --- a/docs/submitting-applications.md +++ b/docs/submitting-applications.md @@ -62,6 +62,30 @@ Alternatively, for submitting on yarn, * `--master`: The --master parameter is either `yarn-client` or `yarn-cluster`. Defaults to `yarn-client` +# Master URLs + +The master URL passed to Spark can be in one of the following formats: + + + + + + + + + + +
Master URLMeaning
local Run Spark locally with one worker thread (i.e. no parallelism at all).
local[K] Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).
local[*] Run Spark locally with as many worker threads as logical cores on your machine.
spark://HOST:PORT Connect to the given Spark standalone + cluster master. The port must be whichever one your master is configured to use, which is 7077 by default. +
mesos://HOST:PORT Connect to the given Mesos cluster. + The port must be whichever one your is configured to use, which is 5050 by default. + Or, for a Mesos cluster using ZooKeeper, use mesos://zk://.... +
yarn-client Connect to a YARN cluster in +client mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable. +
yarn-cluster Connect to a YARN cluster in +cluster mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable. +
+ A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster). @@ -130,31 +154,6 @@ export HADOOP_CONF_DIR=XXX 1000 {% endhighlight %} -# Master URLs - -The master URL passed to Spark can be in one of the following formats: - - - - - - - - - - -
Master URLMeaning
local Run Spark locally with one worker thread (i.e. no parallelism at all).
local[K] Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).
local[*] Run Spark locally with as many worker threads as logical cores on your machine.
spark://HOST:PORT Connect to the given Spark standalone - cluster master. The port must be whichever one your master is configured to use, which is 7077 by default. -
mesos://HOST:PORT Connect to the given Mesos cluster. - The port must be whichever one your is configured to use, which is 5050 by default. - Or, for a Mesos cluster using ZooKeeper, use mesos://zk://.... -
yarn-client Connect to a YARN cluster in -client mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable. -
yarn-cluster Connect to a YARN cluster in -cluster mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable. -
- - # Loading Configuration from a File The `spark-submit` script can load default [Spark configuration values](configuration.html) from a From 05fe708c24f07f9661a558dfbe51970aa940e4e5 Mon Sep 17 00:00:00 2001 From: Neelesh Srinivas Salian Date: Mon, 10 Aug 2015 10:14:59 -0700 Subject: [PATCH 03/28] Removed the addition section --- docs/submitting-applications.md | 14 -------------- 1 file changed, 14 deletions(-) diff --git a/docs/submitting-applications.md b/docs/submitting-applications.md index 4a564ed0a765..a15f83bde24b 100644 --- a/docs/submitting-applications.md +++ b/docs/submitting-applications.md @@ -48,20 +48,6 @@ Some of the commonly used options are: * `application-jar`: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an `hdfs://` path or a `file://` path that is present on all nodes. * `application-arguments`: Arguments passed to the main method of your main class, if any -Alternatively, for submitting on yarn, - -{% highlight bash %} -./bin/spark-submit \ - --class - --master - --conf = \ - ... # other options - \ - [application-arguments] -{% endhighlight %} - -* `--master`: The --master parameter is either `yarn-client` or `yarn-cluster`. Defaults to `yarn-client` - # Master URLs The master URL passed to Spark can be in one of the following formats: From 98624e89c6b303db4fc30408e14705df021ca591 Mon Sep 17 00:00:00 2001 From: Neelesh Srinivas Salian Date: Mon, 10 Aug 2015 10:16:14 -0700 Subject: [PATCH 04/28] Added a section for alternative submission. Distinguished from the shifting of Master URLS --- docs/submitting-applications.md | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/docs/submitting-applications.md b/docs/submitting-applications.md index a15f83bde24b..4a564ed0a765 100644 --- a/docs/submitting-applications.md +++ b/docs/submitting-applications.md @@ -48,6 +48,20 @@ Some of the commonly used options are: * `application-jar`: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an `hdfs://` path or a `file://` path that is present on all nodes. * `application-arguments`: Arguments passed to the main method of your main class, if any +Alternatively, for submitting on yarn, + +{% highlight bash %} +./bin/spark-submit \ + --class + --master + --conf = \ + ... # other options + \ + [application-arguments] +{% endhighlight %} + +* `--master`: The --master parameter is either `yarn-client` or `yarn-cluster`. Defaults to `yarn-client` + # Master URLs The master URL passed to Spark can be in one of the following formats: From b8fdd5cd1b11dd7954d1f05bb71b1a2ae740d065 Mon Sep 17 00:00:00 2001 From: Neelesh Srinivas Salian Date: Tue, 11 Aug 2015 18:43:10 -0700 Subject: [PATCH 05/28] Added section for preferred yarn and kept the one with deploy-mode for generic submission to help clear up confusion --- docs/submitting-applications.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/submitting-applications.md b/docs/submitting-applications.md index 4a564ed0a765..ae21a16d5ec5 100644 --- a/docs/submitting-applications.md +++ b/docs/submitting-applications.md @@ -48,7 +48,7 @@ Some of the commonly used options are: * `application-jar`: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an `hdfs://` path or a `file://` path that is present on all nodes. * `application-arguments`: Arguments passed to the main method of your main class, if any -Alternatively, for submitting on yarn, +For submitting application to YARN, the preferred options are: {% highlight bash %} ./bin/spark-submit \ From 8c65676a6b7a692d07face111d8e998f36ca0151 Mon Sep 17 00:00:00 2001 From: Neelesh Srinivas Salian Date: Tue, 11 Aug 2015 18:44:36 -0700 Subject: [PATCH 06/28] Moved the Standalone examples together --- docs/submitting-applications.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/submitting-applications.md b/docs/submitting-applications.md index ae21a16d5ec5..8778f719e250 100644 --- a/docs/submitting-applications.md +++ b/docs/submitting-applications.md @@ -125,6 +125,12 @@ run it with `--help`. Here are a few examples of common options: --total-executor-cores 100 \ /path/to/examples.jar \ 1000 + +# Run a Python application on a Spark Standalone cluster +./bin/spark-submit \ + --master spark://207.184.161.138:7077 \ + examples/src/main/python/pi.py \ + 1000 # Run on a Spark Standalone cluster in cluster deploy mode with supervise ./bin/spark-submit \ @@ -146,12 +152,6 @@ export HADOOP_CONF_DIR=XXX --num-executors 50 \ /path/to/examples.jar \ 1000 - -# Run a Python application on a Spark Standalone cluster -./bin/spark-submit \ - --master spark://207.184.161.138:7077 \ - examples/src/main/python/pi.py \ - 1000 {% endhighlight %} # Loading Configuration from a File From 8a331d0444f58d3c14c1c12c4f087f1a02d5b8d1 Mon Sep 17 00:00:00 2001 From: Neelesh Srinivas Salian Date: Wed, 12 Aug 2015 14:19:58 -0700 Subject: [PATCH 07/28] Moved Master URLs --- docs/submitting-applications.md | 48 ++++++++++++++++----------------- 1 file changed, 24 insertions(+), 24 deletions(-) diff --git a/docs/submitting-applications.md b/docs/submitting-applications.md index 8778f719e250..d864bc9f59ff 100644 --- a/docs/submitting-applications.md +++ b/docs/submitting-applications.md @@ -62,30 +62,6 @@ For submitting application to YARN, the preferred options are: * `--master`: The --master parameter is either `yarn-client` or `yarn-cluster`. Defaults to `yarn-client` -# Master URLs - -The master URL passed to Spark can be in one of the following formats: - - - - - - - - - - -
Master URLMeaning
local Run Spark locally with one worker thread (i.e. no parallelism at all).
local[K] Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).
local[*] Run Spark locally with as many worker threads as logical cores on your machine.
spark://HOST:PORT Connect to the given Spark standalone - cluster master. The port must be whichever one your master is configured to use, which is 7077 by default. -
mesos://HOST:PORT Connect to the given Mesos cluster. - The port must be whichever one your is configured to use, which is 5050 by default. - Or, for a Mesos cluster using ZooKeeper, use mesos://zk://.... -
yarn-client Connect to a YARN cluster in -client mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable. -
yarn-cluster Connect to a YARN cluster in -cluster mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable. -
- A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster). @@ -154,6 +130,30 @@ export HADOOP_CONF_DIR=XXX 1000 {% endhighlight %} +# Master URLs + +The master URL passed to Spark can be in one of the following formats: + + + + + + + + + + +
Master URLMeaning
local Run Spark locally with one worker thread (i.e. no parallelism at all).
local[K] Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).
local[*] Run Spark locally with as many worker threads as logical cores on your machine.
spark://HOST:PORT Connect to the given Spark standalone + cluster master. The port must be whichever one your master is configured to use, which is 7077 by default. +
mesos://HOST:PORT Connect to the given Mesos cluster. + The port must be whichever one your is configured to use, which is 5050 by default. + Or, for a Mesos cluster using ZooKeeper, use mesos://zk://.... +
yarn-client Connect to a YARN cluster in +client mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable. +
yarn-cluster Connect to a YARN cluster in +cluster mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable. +
+ # Loading Configuration from a File The `spark-submit` script can load default [Spark configuration values](configuration.html) from a From 0fed23b8dc525f62197d1cd332260a0752d7d35c Mon Sep 17 00:00:00 2001 From: Neelesh Srinivas Salian Date: Thu, 13 Aug 2015 16:12:06 -0700 Subject: [PATCH 08/28] Added deploy-mode section to YARN submission --- docs/running-on-yarn.md | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md index cac08a91b97d..0f77fca1a56c 100644 --- a/docs/running-on-yarn.md +++ b/docs/running-on-yarn.md @@ -40,6 +40,22 @@ The above starts a YARN client program which starts the default Application Mast To launch a Spark application in `yarn-client` mode, do the same, but replace `yarn-cluster` with `yarn-client`. To run spark-shell: $ ./bin/spark-shell --master yarn-client + +The alternative to launching a Spark application on YARN is to explicitly set the deployment mode for the YARN master + +For example: + + $ ./bin/spark-submit --class org.apache.spark.examples.SparkPi \ + --master yarn \ + --deploy-mode client or cluster + --num-executors 3 \ + --driver-memory 4g \ + --executor-memory 2g \ + --executor-cores 1 \ + --queue thequeue \ + lib/spark-examples*.jar \ + +`--deploy-mode` can be either client or cluster. ## Adding Other JARs From 670d251db01306ecc6029abaf6fc7d0e7c30dc3f Mon Sep 17 00:00:00 2001 From: Neelesh Srinivas Salian Date: Sat, 8 Aug 2015 19:51:52 -0700 Subject: [PATCH 09/28] Added yarn-deploy-mode alternative --- docs/submitting-applications.md | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/docs/submitting-applications.md b/docs/submitting-applications.md index e58645274e52..015cef946062 100644 --- a/docs/submitting-applications.md +++ b/docs/submitting-applications.md @@ -48,6 +48,20 @@ Some of the commonly used options are: * `application-jar`: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an `hdfs://` path or a `file://` path that is present on all nodes. * `application-arguments`: Arguments passed to the main method of your main class, if any +Alternatively, for submitting on yarn, + +{% highlight bash %} +./bin/spark-submit \ + --class + --master + --conf = \ + ... # other options + \ + [application-arguments] +{% endhighlight %} + +* `--master`: The --master parameter is either `yarn-client` or `yarn-cluster`. Defaults to `yarn-client` + A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster). @@ -99,7 +113,7 @@ run it with `--help`. Here are a few examples of common options: /path/to/examples.jar \ 1000 -# Run on a YARN cluster +# Run on a YARN cluster without --deploy mode export HADOOP_CONF_DIR=XXX ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ From 40d3b80012f2db351446f8f9d6049f8a9f00bf2b Mon Sep 17 00:00:00 2001 From: Neelesh Srinivas Salian Date: Sat, 8 Aug 2015 19:54:04 -0700 Subject: [PATCH 10/28] Moved Master URLs closer above before the examples --- docs/submitting-applications.md | 49 ++++++++++++++++----------------- 1 file changed, 24 insertions(+), 25 deletions(-) diff --git a/docs/submitting-applications.md b/docs/submitting-applications.md index 015cef946062..4a564ed0a765 100644 --- a/docs/submitting-applications.md +++ b/docs/submitting-applications.md @@ -62,6 +62,30 @@ Alternatively, for submitting on yarn, * `--master`: The --master parameter is either `yarn-client` or `yarn-cluster`. Defaults to `yarn-client` +# Master URLs + +The master URL passed to Spark can be in one of the following formats: + + + + + + + + + + +
Master URLMeaning
local Run Spark locally with one worker thread (i.e. no parallelism at all).
local[K] Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).
local[*] Run Spark locally with as many worker threads as logical cores on your machine.
spark://HOST:PORT Connect to the given Spark standalone + cluster master. The port must be whichever one your master is configured to use, which is 7077 by default. +
mesos://HOST:PORT Connect to the given Mesos cluster. + The port must be whichever one your is configured to use, which is 5050 by default. + Or, for a Mesos cluster using ZooKeeper, use mesos://zk://.... +
yarn-client Connect to a YARN cluster in +client mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable. +
yarn-cluster Connect to a YARN cluster in +cluster mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable. +
+ A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster). @@ -130,31 +154,6 @@ export HADOOP_CONF_DIR=XXX 1000 {% endhighlight %} -# Master URLs - -The master URL passed to Spark can be in one of the following formats: - - - - - - - - - - -
Master URLMeaning
local Run Spark locally with one worker thread (i.e. no parallelism at all).
local[K] Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).
local[*] Run Spark locally with as many worker threads as logical cores on your machine.
spark://HOST:PORT Connect to the given Spark standalone - cluster master. The port must be whichever one your master is configured to use, which is 7077 by default. -
mesos://HOST:PORT Connect to the given Mesos cluster. - The port must be whichever one your is configured to use, which is 5050 by default. - Or, for a Mesos cluster using ZooKeeper, use mesos://zk://.... -
yarn-client Connect to a YARN cluster in -client mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable. -
yarn-cluster Connect to a YARN cluster in -cluster mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable. -
- - # Loading Configuration from a File The `spark-submit` script can load default [Spark configuration values](configuration.html) from a From 89d15bf63741e3c62017586df35508a6bde821c2 Mon Sep 17 00:00:00 2001 From: Neelesh Srinivas Salian Date: Mon, 10 Aug 2015 10:14:59 -0700 Subject: [PATCH 11/28] Removed the addition section --- docs/submitting-applications.md | 14 -------------- 1 file changed, 14 deletions(-) diff --git a/docs/submitting-applications.md b/docs/submitting-applications.md index 4a564ed0a765..a15f83bde24b 100644 --- a/docs/submitting-applications.md +++ b/docs/submitting-applications.md @@ -48,20 +48,6 @@ Some of the commonly used options are: * `application-jar`: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an `hdfs://` path or a `file://` path that is present on all nodes. * `application-arguments`: Arguments passed to the main method of your main class, if any -Alternatively, for submitting on yarn, - -{% highlight bash %} -./bin/spark-submit \ - --class - --master - --conf = \ - ... # other options - \ - [application-arguments] -{% endhighlight %} - -* `--master`: The --master parameter is either `yarn-client` or `yarn-cluster`. Defaults to `yarn-client` - # Master URLs The master URL passed to Spark can be in one of the following formats: From d2c212aa6e3a4537c0a4a7ad49e83412e47e60e7 Mon Sep 17 00:00:00 2001 From: Neelesh Srinivas Salian Date: Mon, 10 Aug 2015 10:16:14 -0700 Subject: [PATCH 12/28] Added a section for alternative submission. Distinguished from the shifting of Master URLS --- docs/submitting-applications.md | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/docs/submitting-applications.md b/docs/submitting-applications.md index a15f83bde24b..4a564ed0a765 100644 --- a/docs/submitting-applications.md +++ b/docs/submitting-applications.md @@ -48,6 +48,20 @@ Some of the commonly used options are: * `application-jar`: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an `hdfs://` path or a `file://` path that is present on all nodes. * `application-arguments`: Arguments passed to the main method of your main class, if any +Alternatively, for submitting on yarn, + +{% highlight bash %} +./bin/spark-submit \ + --class + --master + --conf = \ + ... # other options + \ + [application-arguments] +{% endhighlight %} + +* `--master`: The --master parameter is either `yarn-client` or `yarn-cluster`. Defaults to `yarn-client` + # Master URLs The master URL passed to Spark can be in one of the following formats: From 3f25500b5d39b2d6b247a8dca8147c8fd140c7c0 Mon Sep 17 00:00:00 2001 From: Neelesh Srinivas Salian Date: Tue, 11 Aug 2015 18:43:10 -0700 Subject: [PATCH 13/28] Added section for preferred yarn and kept the one with deploy-mode for generic submission to help clear up confusion --- docs/submitting-applications.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/submitting-applications.md b/docs/submitting-applications.md index 4a564ed0a765..ae21a16d5ec5 100644 --- a/docs/submitting-applications.md +++ b/docs/submitting-applications.md @@ -48,7 +48,7 @@ Some of the commonly used options are: * `application-jar`: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an `hdfs://` path or a `file://` path that is present on all nodes. * `application-arguments`: Arguments passed to the main method of your main class, if any -Alternatively, for submitting on yarn, +For submitting application to YARN, the preferred options are: {% highlight bash %} ./bin/spark-submit \ From 0766da66ccf16ab55c80614776c1f5a7a1877253 Mon Sep 17 00:00:00 2001 From: Neelesh Srinivas Salian Date: Tue, 11 Aug 2015 18:44:36 -0700 Subject: [PATCH 14/28] Moved the Standalone examples together --- docs/submitting-applications.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/submitting-applications.md b/docs/submitting-applications.md index ae21a16d5ec5..8778f719e250 100644 --- a/docs/submitting-applications.md +++ b/docs/submitting-applications.md @@ -125,6 +125,12 @@ run it with `--help`. Here are a few examples of common options: --total-executor-cores 100 \ /path/to/examples.jar \ 1000 + +# Run a Python application on a Spark Standalone cluster +./bin/spark-submit \ + --master spark://207.184.161.138:7077 \ + examples/src/main/python/pi.py \ + 1000 # Run on a Spark Standalone cluster in cluster deploy mode with supervise ./bin/spark-submit \ @@ -146,12 +152,6 @@ export HADOOP_CONF_DIR=XXX --num-executors 50 \ /path/to/examples.jar \ 1000 - -# Run a Python application on a Spark Standalone cluster -./bin/spark-submit \ - --master spark://207.184.161.138:7077 \ - examples/src/main/python/pi.py \ - 1000 {% endhighlight %} # Loading Configuration from a File From 46a24d55ffe99431885b57fda50938289a0ed91b Mon Sep 17 00:00:00 2001 From: Neelesh Srinivas Salian Date: Wed, 12 Aug 2015 14:19:58 -0700 Subject: [PATCH 15/28] Moved Master URLs --- docs/submitting-applications.md | 48 ++++++++++++++++----------------- 1 file changed, 24 insertions(+), 24 deletions(-) diff --git a/docs/submitting-applications.md b/docs/submitting-applications.md index 8778f719e250..d864bc9f59ff 100644 --- a/docs/submitting-applications.md +++ b/docs/submitting-applications.md @@ -62,30 +62,6 @@ For submitting application to YARN, the preferred options are: * `--master`: The --master parameter is either `yarn-client` or `yarn-cluster`. Defaults to `yarn-client` -# Master URLs - -The master URL passed to Spark can be in one of the following formats: - - - - - - - - - - -
Master URLMeaning
local Run Spark locally with one worker thread (i.e. no parallelism at all).
local[K] Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).
local[*] Run Spark locally with as many worker threads as logical cores on your machine.
spark://HOST:PORT Connect to the given Spark standalone - cluster master. The port must be whichever one your master is configured to use, which is 7077 by default. -
mesos://HOST:PORT Connect to the given Mesos cluster. - The port must be whichever one your is configured to use, which is 5050 by default. - Or, for a Mesos cluster using ZooKeeper, use mesos://zk://.... -
yarn-client Connect to a YARN cluster in -client mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable. -
yarn-cluster Connect to a YARN cluster in -cluster mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable. -
- A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster). @@ -154,6 +130,30 @@ export HADOOP_CONF_DIR=XXX 1000 {% endhighlight %} +# Master URLs + +The master URL passed to Spark can be in one of the following formats: + + + + + + + + + + +
Master URLMeaning
local Run Spark locally with one worker thread (i.e. no parallelism at all).
local[K] Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).
local[*] Run Spark locally with as many worker threads as logical cores on your machine.
spark://HOST:PORT Connect to the given Spark standalone + cluster master. The port must be whichever one your master is configured to use, which is 7077 by default. +
mesos://HOST:PORT Connect to the given Mesos cluster. + The port must be whichever one your is configured to use, which is 5050 by default. + Or, for a Mesos cluster using ZooKeeper, use mesos://zk://.... +
yarn-client Connect to a YARN cluster in +client mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable. +
yarn-cluster Connect to a YARN cluster in +cluster mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable. +
+ # Loading Configuration from a File The `spark-submit` script can load default [Spark configuration values](configuration.html) from a From 91758072dbc954e2c31609dcd2b6232a09fbfdb3 Mon Sep 17 00:00:00 2001 From: Neelesh Srinivas Salian Date: Thu, 13 Aug 2015 16:12:06 -0700 Subject: [PATCH 16/28] Added deploy-mode section to YARN submission --- docs/running-on-yarn.md | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md index 5159ef9e3394..84167e5162d2 100644 --- a/docs/running-on-yarn.md +++ b/docs/running-on-yarn.md @@ -40,6 +40,22 @@ The above starts a YARN client program which starts the default Application Mast To launch a Spark application in `yarn-client` mode, do the same, but replace `yarn-cluster` with `yarn-client`. To run spark-shell: $ ./bin/spark-shell --master yarn-client + +The alternative to launching a Spark application on YARN is to explicitly set the deployment mode for the YARN master + +For example: + + $ ./bin/spark-submit --class org.apache.spark.examples.SparkPi \ + --master yarn \ + --deploy-mode client or cluster + --num-executors 3 \ + --driver-memory 4g \ + --executor-memory 2g \ + --executor-cores 1 \ + --queue thequeue \ + lib/spark-examples*.jar \ + +`--deploy-mode` can be either client or cluster. ## Adding Other JARs From c91073ef5ab7fa2e5a8cada89983422960b24a1a Mon Sep 17 00:00:00 2001 From: Neelesh Srinivas Salian Date: Sun, 23 Aug 2015 08:11:55 -0700 Subject: [PATCH 17/28] Modified Running on YARN doc --- docs/running-on-yarn.md | 29 ++++++++++++++++------------- 1 file changed, 16 insertions(+), 13 deletions(-) diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md index 84167e5162d2..1400ae287dcb 100644 --- a/docs/running-on-yarn.md +++ b/docs/running-on-yarn.md @@ -21,48 +21,51 @@ There are two deploy modes that can be used to launch Spark applications on YARN Unlike in Spark standalone and Mesos mode, in which the master's address is specified in the `--master` parameter, in YARN mode the ResourceManager's address is picked up from the Hadoop configuration. Thus, the `--master` parameter is `yarn-client` or `yarn-cluster`. To launch a Spark application in `yarn-cluster` mode: - `$ ./bin/spark-submit --class path.to.your.Class --master yarn-cluster [options] [app options]` - + `$ ./bin/spark-submit --class path.to.your.Class --master yarn --deploy-mode yarn-client/yarn-cluster [options] [app options]` + For example: $ ./bin/spark-submit --class org.apache.spark.examples.SparkPi \ - --master yarn-cluster \ + --master yarn \ + --deploy-mode cluster --num-executors 3 \ --driver-memory 4g \ --executor-memory 2g \ --executor-cores 1 \ --queue thequeue \ lib/spark-examples*.jar \ - 10 + +`--deploy-mode` can be either client or cluster. -The above starts a YARN client program which starts the default Application Master. Then SparkPi will be run as a child thread of Application Master. The client will periodically poll the Application Master for status updates and display them in the console. The client will exit once your application has finished running. Refer to the "Debugging your Application" section below for how to see driver and executor logs. +The above example starts a YARN client program which starts the default Application Master. Then SparkPi will be run as a child thread of Application Master. The client will periodically poll the Application Master for status updates and display them in the console. The client will exit once your application has finished running. Refer to the "Debugging your Application" section below for how to see driver and executor logs. -To launch a Spark application in `yarn-client` mode, do the same, but replace `yarn-cluster` with `yarn-client`. To run spark-shell: +To launch a Spark application in `yarn-client` mode, do the same, but replace `yarn-cluster` with `yarn-client` in the --deploy-mode. To run spark-shell: $ ./bin/spark-shell --master yarn-client -The alternative to launching a Spark application on YARN is to explicitly set the deployment mode for the YARN master - +The alternative to launching a Spark application on YARN is to set deployment mode for the YARN master in the `--master` itself. + For example: $ ./bin/spark-submit --class org.apache.spark.examples.SparkPi \ - --master yarn \ - --deploy-mode client or cluster + --master yarn-cluster \ --num-executors 3 \ --driver-memory 4g \ --executor-memory 2g \ --executor-cores 1 \ --queue thequeue \ lib/spark-examples*.jar \ + 10 -`--deploy-mode` can be either client or cluster. +`--master` can be `yarn-client` or `yarn-cluster` ## Adding Other JARs In `yarn-cluster` mode, the driver runs on a different machine than the client, so `SparkContext.addJar` won't work out of the box with files that are local to the client. To make files on the client available to `SparkContext.addJar`, include them with the `--jars` option in the launch command. $ ./bin/spark-submit --class my.main.Class \ - --master yarn-cluster \ + --master yarn + --deploy-mode cluster \ --jars my-other-jar.jar,my-other-other-jar.jar my-main-jar.jar app_arg1 app_arg2 @@ -402,6 +405,6 @@ If you need a reference to the proper location to put log files in the YARN so t # Important notes - Whether core requests are honored in scheduling decisions depends on which scheduler is in use and how it is configured. -- In `yarn-cluster` mode, the local directories used by the Spark executors and the Spark driver will be the local directories configured for YARN (Hadoop YARN config `yarn.nodemanager.local-dirs`). If the user specifies `spark.local.dir`, it will be ignored. In `yarn-client` mode, the Spark executors will use the local directories configured for YARN while the Spark driver will use those defined in `spark.local.dir`. This is because the Spark driver does not run on the YARN cluster in `yarn-client` mode, only the Spark executors do. +- In `--master yarn --deploy-mode cluster`, the local directories used by the Spark executors and the Spark driver will be the local directories configured for YARN (Hadoop YARN config `yarn.nodemanager.local-dirs`). If the user specifies `spark.local.dir`, it will be ignored. In `yarn-client` mode, the Spark executors will use the local directories configured for YARN while the Spark driver will use those defined in `spark.local.dir`. This is because the Spark driver does not run on the YARN cluster in `yarn-client` mode, only the Spark executors do. - The `--files` and `--archives` options support specifying file names with the # similar to Hadoop. For example you can specify: `--files localtest.txt#appSees.txt` and this will upload the file you have locally named localtest.txt into HDFS but this will be linked to by the name `appSees.txt`, and your application should use the name as `appSees.txt` to reference it when running on YARN. - The `--jars` option allows the `SparkContext.addJar` function to work if you are using it with local files and running in `yarn-cluster` mode. It does not need to be used if you are using it with HDFS, HTTP, HTTPS, or FTP files. From 3dc79e2d24a76abd32779d09a044240e808ed9fc Mon Sep 17 00:00:00 2001 From: Neelesh Srinivas Salian Date: Sun, 23 Aug 2015 14:21:33 -0700 Subject: [PATCH 18/28] Modified submitting applications --- docs/submitting-applications.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/submitting-applications.md b/docs/submitting-applications.md index d864bc9f59ff..b87d4093be25 100644 --- a/docs/submitting-applications.md +++ b/docs/submitting-applications.md @@ -48,7 +48,7 @@ Some of the commonly used options are: * `application-jar`: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an `hdfs://` path or a `file://` path that is present on all nodes. * `application-arguments`: Arguments passed to the main method of your main class, if any -For submitting application to YARN, the preferred options are: +For submitting application to YARN, the alternate options are: {% highlight bash %} ./bin/spark-submit \ From 67a4255f94e828fcfffc6039ddc4872acc2d717d Mon Sep 17 00:00:00 2001 From: Neelesh Srinivas Salian Date: Sun, 23 Aug 2015 14:44:26 -0700 Subject: [PATCH 19/28] Removed extra YARN section, there is already a running without --deploy example --- docs/submitting-applications.md | 14 -------------- 1 file changed, 14 deletions(-) diff --git a/docs/submitting-applications.md b/docs/submitting-applications.md index b87d4093be25..b32d9c12cd7e 100644 --- a/docs/submitting-applications.md +++ b/docs/submitting-applications.md @@ -48,20 +48,6 @@ Some of the commonly used options are: * `application-jar`: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an `hdfs://` path or a `file://` path that is present on all nodes. * `application-arguments`: Arguments passed to the main method of your main class, if any -For submitting application to YARN, the alternate options are: - -{% highlight bash %} -./bin/spark-submit \ - --class - --master - --conf = \ - ... # other options - \ - [application-arguments] -{% endhighlight %} - -* `--master`: The --master parameter is either `yarn-client` or `yarn-cluster`. Defaults to `yarn-client` - A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster). From a8b67efb6a8bc28b69a87b4158156b1517e1475d Mon Sep 17 00:00:00 2001 From: Neelesh Srinivas Salian Date: Sun, 23 Aug 2015 17:14:45 -0700 Subject: [PATCH 20/28] Added --deploy-mode flags to the yarn submission sections --- R/README.md | 2 ++ README.md | 4 ++-- docs/sql-programming-guide.md | 2 +- yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala | 2 +- 4 files changed, 6 insertions(+), 4 deletions(-) diff --git a/R/README.md b/R/README.md index 005f56da1670..d8e75ea75260 100644 --- a/R/README.md +++ b/R/README.md @@ -63,5 +63,7 @@ You can also run the unit-tests for SparkR by running (you need to install the [ The `./bin/spark-submit` and `./bin/sparkR` can also be used to submit jobs to YARN clusters. You will need to set YARN conf dir before doing so. For example on CDH you can run ``` export YARN_CONF_DIR=/etc/hadoop/conf +./bin/spark-submit --master yarn --deploy-mode cluster (or client) examples/src/main/r/dataframe.R +OR ./bin/spark-submit --master yarn examples/src/main/r/dataframe.R ``` diff --git a/README.md b/README.md index 380422ca00db..2d2d1e2e6b59 100644 --- a/README.md +++ b/README.md @@ -58,8 +58,8 @@ To run one of them, use `./bin/run-example [params]`. For example: will run the Pi example locally. You can set the MASTER environment variable when running examples to submit -examples to a cluster. This can be a mesos:// or spark:// URL, -"yarn-cluster" or "yarn-client" to run on YARN, and "local" to run +examples to a cluster. This can be a mesos:// or spark:// URL, to run on YARN; either --master yarn and set --deploy-mode (cluster or client) or simply set --master as +"yarn-cluster" or "yarn-client", and "local" to run locally with one thread, or "local[N]" to run locally with N threads. You can also use an abbreviated class name if the class is in the `examples` package. For instance: diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md index 33e7893d7bd0..3af171f10b2f 100644 --- a/docs/sql-programming-guide.md +++ b/docs/sql-programming-guide.md @@ -1551,7 +1551,7 @@ on all of the worker nodes, as they will need access to the Hive serialization a (SerDes) in order to access data stored in Hive. Configuration of Hive is done by placing your `hive-site.xml` file in `conf/`. Please note when running -the query on a YARN cluster (`yarn-cluster` mode), the `datanucleus` jars under the `lib_managed/jars` directory +the query on a YARN cluster (`--master yarn --deploy-mode cluster` mode), the `datanucleus` jars under the `lib_managed/jars` directory and `hive-site.xml` under `conf/` directory need to be available on the driver and all executors launched by the YARN cluster. The convenient way to do this is adding them through the `--jars` option and `--file` option of the `spark-submit` command. diff --git a/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala b/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala index bff585b46cbb..c5877b6fc0d8 100644 --- a/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala +++ b/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala @@ -967,7 +967,7 @@ object Client extends Logging { def main(argStrings: Array[String]) { if (!sys.props.contains("SPARK_SUBMIT")) { logWarning("WARNING: This client is deprecated and will be removed in a " + - "future version of Spark. Use ./bin/spark-submit with \"--master yarn\"") + "future version of Spark. Use ./bin/spark-submit with \"--master yarn --deploy-mode cluster (or client) OR --master yarn-cluster (yarn-client)\"") } // Set an env variable indicating we are running in YARN mode. From d93d4bab5c425d3bc7a02b6a3afdef2e97e7ccf9 Mon Sep 17 00:00:00 2001 From: Neelesh Srinivas Salian Date: Sat, 19 Sep 2015 11:23:02 -0700 Subject: [PATCH 21/28] Changed R/ReadME --- R/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/R/README.md b/R/README.md index d8e75ea75260..996039cf1f14 100644 --- a/R/README.md +++ b/R/README.md @@ -63,7 +63,7 @@ You can also run the unit-tests for SparkR by running (you need to install the [ The `./bin/spark-submit` and `./bin/sparkR` can also be used to submit jobs to YARN clusters. You will need to set YARN conf dir before doing so. For example on CDH you can run ``` export YARN_CONF_DIR=/etc/hadoop/conf -./bin/spark-submit --master yarn --deploy-mode cluster (or client) examples/src/main/r/dataframe.R +./bin/spark-submit --master yarn --deploy-mode client examples/src/main/r/dataframe.R OR ./bin/spark-submit --master yarn examples/src/main/r/dataframe.R ``` From 108caecd8dbaf2a09e52ffa5b5b289f2612fe2f3 Mon Sep 17 00:00:00 2001 From: Neelesh Srinivas Salian Date: Sat, 19 Sep 2015 11:28:17 -0700 Subject: [PATCH 22/28] Changed parent/README --- README.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/README.md b/README.md index 2d2d1e2e6b59..41cc1987de66 100644 --- a/README.md +++ b/README.md @@ -58,8 +58,7 @@ To run one of them, use `./bin/run-example [params]`. For example: will run the Pi example locally. You can set the MASTER environment variable when running examples to submit -examples to a cluster. This can be a mesos:// or spark:// URL, to run on YARN; either --master yarn and set --deploy-mode (cluster or client) or simply set --master as -"yarn-cluster" or "yarn-client", and "local" to run +examples to a cluster. This can be a mesos:// or spark:// URL, "yarn" to run on YARN and "local" to run locally with one thread, or "local[N]" to run locally with N threads. You can also use an abbreviated class name if the class is in the `examples` package. For instance: From 12ecd4359142d25ed7a5c1e053461926b29a1389 Mon Sep 17 00:00:00 2001 From: Neelesh Srinivas Salian Date: Sat, 19 Sep 2015 12:07:11 -0700 Subject: [PATCH 23/28] Modified Running on yarn --- docs/running-on-yarn.md | 22 ++++++++++------------ 1 file changed, 10 insertions(+), 12 deletions(-) diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md index 1400ae287dcb..ed9e16943ada 100644 --- a/docs/running-on-yarn.md +++ b/docs/running-on-yarn.md @@ -16,12 +16,12 @@ containers used by the application use the same configuration. If the configurat Java system properties or environment variables not managed by YARN, they should also be set in the Spark application's configuration (driver, executors, and the AM when running in client mode). -There are two deploy modes that can be used to launch Spark applications on YARN. In `yarn-cluster` mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In `yarn-client` mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN. +There are two deploy modes that can be used to launch Spark applications on YARN. In `cluster` mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In `client` mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN. -Unlike in Spark standalone and Mesos mode, in which the master's address is specified in the `--master` parameter, in YARN mode the ResourceManager's address is picked up from the Hadoop configuration. Thus, the `--master` parameter is `yarn-client` or `yarn-cluster`. -To launch a Spark application in `yarn-cluster` mode: +Unlike in Spark standalone and Mesos mode, in which the master's address is specified in the `--master` parameter, in YARN mode the ResourceManager's address is picked up from the Hadoop configuration. Thus, the `--master` parameter is `yarn` and to specify the deployment `--deploy-mode` can be either `client` or `cluster`. +To launch a Spark application in YARN in `cluster` mode: - `$ ./bin/spark-submit --class path.to.your.Class --master yarn --deploy-mode yarn-client/yarn-cluster [options] [app options]` + `$ ./bin/spark-submit --class path.to.your.Class --master yarn --deploy-mode cluster [options] [app options]` For example: @@ -34,17 +34,17 @@ For example: --executor-cores 1 \ --queue thequeue \ lib/spark-examples*.jar \ - -`--deploy-mode` can be either client or cluster. The above example starts a YARN client program which starts the default Application Master. Then SparkPi will be run as a child thread of Application Master. The client will periodically poll the Application Master for status updates and display them in the console. The client will exit once your application has finished running. Refer to the "Debugging your Application" section below for how to see driver and executor logs. -To launch a Spark application in `yarn-client` mode, do the same, but replace `yarn-cluster` with `yarn-client` in the --deploy-mode. To run spark-shell: +To launch a Spark application in `client` mode, do the same, but replace `cluster` with `client` in the --deploy-mode. +To run spark-shell: - $ ./bin/spark-shell --master yarn-client + $ ./bin/spark-shell --master yarn --deploy-mode client The alternative to launching a Spark application on YARN is to set deployment mode for the YARN master in the `--master` itself. - +`--master` can be `yarn-client` or `yarn-cluster` + For example: $ ./bin/spark-submit --class org.apache.spark.examples.SparkPi \ @@ -57,8 +57,6 @@ For example: lib/spark-examples*.jar \ 10 -`--master` can be `yarn-client` or `yarn-cluster` - ## Adding Other JARs In `yarn-cluster` mode, the driver runs on a different machine than the client, so `SparkContext.addJar` won't work out of the box with files that are local to the client. To make files on the client available to `SparkContext.addJar`, include them with the `--jars` option in the launch command. @@ -405,6 +403,6 @@ If you need a reference to the proper location to put log files in the YARN so t # Important notes - Whether core requests are honored in scheduling decisions depends on which scheduler is in use and how it is configured. -- In `--master yarn --deploy-mode cluster`, the local directories used by the Spark executors and the Spark driver will be the local directories configured for YARN (Hadoop YARN config `yarn.nodemanager.local-dirs`). If the user specifies `spark.local.dir`, it will be ignored. In `yarn-client` mode, the Spark executors will use the local directories configured for YARN while the Spark driver will use those defined in `spark.local.dir`. This is because the Spark driver does not run on the YARN cluster in `yarn-client` mode, only the Spark executors do. +- In `yarn-cluster`, the local directories used by the Spark executors and the Spark driver will be the local directories configured for YARN (Hadoop YARN config `yarn.nodemanager.local-dirs`). If the user specifies `spark.local.dir`, it will be ignored. In `yarn-client` mode, the Spark executors will use the local directories configured for YARN while the Spark driver will use those defined in `spark.local.dir`. This is because the Spark driver does not run on the YARN cluster in `yarn-client` mode, only the Spark executors do. - The `--files` and `--archives` options support specifying file names with the # similar to Hadoop. For example you can specify: `--files localtest.txt#appSees.txt` and this will upload the file you have locally named localtest.txt into HDFS but this will be linked to by the name `appSees.txt`, and your application should use the name as `appSees.txt` to reference it when running on YARN. - The `--jars` option allows the `SparkContext.addJar` function to work if you are using it with local files and running in `yarn-cluster` mode. It does not need to be used if you are using it with HDFS, HTTP, HTTPS, or FTP files. From 0cd5d0b12d7f8a3e3e31c70a044e8345d7d3b05f Mon Sep 17 00:00:00 2001 From: Neelesh Srinivas Salian Date: Sat, 19 Sep 2015 12:12:44 -0700 Subject: [PATCH 24/28] Changed submitting-applications --- docs/submitting-applications.md | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/docs/submitting-applications.md b/docs/submitting-applications.md index b32d9c12cd7e..8a367eeb4f34 100644 --- a/docs/submitting-applications.md +++ b/docs/submitting-applications.md @@ -87,12 +87,6 @@ run it with `--help`. Here are a few examples of common options: --total-executor-cores 100 \ /path/to/examples.jar \ 1000 - -# Run a Python application on a Spark Standalone cluster -./bin/spark-submit \ - --master spark://207.184.161.138:7077 \ - examples/src/main/python/pi.py \ - 1000 # Run on a Spark Standalone cluster in cluster deploy mode with supervise ./bin/spark-submit \ @@ -105,15 +99,22 @@ run it with `--help`. Here are a few examples of common options: /path/to/examples.jar \ 1000 -# Run on a YARN cluster without --deploy mode +# Run on a YARN cluster export HADOOP_CONF_DIR=XXX ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ - --master yarn-cluster \ # can also be `yarn-client` for client mode + --master yarn \ + --deploy-mode cluster \ --executor-memory 20G \ --num-executors 50 \ /path/to/examples.jar \ 1000 + +# Run a Python application on a Spark Standalone cluster +./bin/spark-submit \ + --master spark://207.184.161.138:7077 \ + examples/src/main/python/pi.py \ + 1000 {% endhighlight %} # Master URLs From 07ed32c6e9cf9f2a3d7f1e5b5b15e73d5db41a43 Mon Sep 17 00:00:00 2001 From: Neelesh Srinivas Salian Date: Sat, 19 Sep 2015 12:16:16 -0700 Subject: [PATCH 25/28] Changed /deploy/yarn/Client.scala --- yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala b/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala index c5877b6fc0d8..bff585b46cbb 100644 --- a/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala +++ b/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala @@ -967,7 +967,7 @@ object Client extends Logging { def main(argStrings: Array[String]) { if (!sys.props.contains("SPARK_SUBMIT")) { logWarning("WARNING: This client is deprecated and will be removed in a " + - "future version of Spark. Use ./bin/spark-submit with \"--master yarn --deploy-mode cluster (or client) OR --master yarn-cluster (yarn-client)\"") + "future version of Spark. Use ./bin/spark-submit with \"--master yarn\"") } // Set an env variable indicating we are running in YARN mode. From 1b86c3588ae3076cbc6b00efbd67975c334105c5 Mon Sep 17 00:00:00 2001 From: Neelesh Srinivas Salian Date: Sat, 19 Sep 2015 14:30:09 -0700 Subject: [PATCH 26/28] Modified SparkSubmitSuite.scala --- .../scala/org/apache/spark/deploy/SparkSubmitSuite.scala | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala b/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala index 1110ca6051a4..3ae7616d9621 100644 --- a/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala +++ b/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala @@ -412,7 +412,8 @@ class SparkSubmitSuite // Test files and archives (Yarn) val clArgs2 = Seq( - "--master", "yarn-client", + "--master", "yarn", + "--deploy-mode","client", "--class", "org.SomeClass", "--files", files, "--archives", archives, @@ -470,7 +471,8 @@ class SparkSubmitSuite writer2.println("spark.yarn.dist.archives " + archives) writer2.close() val clArgs2 = Seq( - "--master", "yarn-client", + "--master", "yarn", + "--deploy-mode","client", "--class", "org.SomeClass", "--properties-file", f2.getPath, "thejar.jar" From 9be5993256e50010af212f52dc7cd0de667cf178 Mon Sep 17 00:00:00 2001 From: Neelesh Srinivas Salian Date: Sat, 26 Sep 2015 19:49:14 -0700 Subject: [PATCH 27/28] Recent Review changes --- R/README.md | 3 +-- docs/running-on-yarn.md | 11 ++++------- docs/sql-programming-guide.md | 2 +- 3 files changed, 6 insertions(+), 10 deletions(-) diff --git a/R/README.md b/R/README.md index 996039cf1f14..0e1c6c802742 100644 --- a/R/README.md +++ b/R/README.md @@ -64,6 +64,5 @@ The `./bin/spark-submit` and `./bin/sparkR` can also be used to submit jobs to Y ``` export YARN_CONF_DIR=/etc/hadoop/conf ./bin/spark-submit --master yarn --deploy-mode client examples/src/main/r/dataframe.R -OR -./bin/spark-submit --master yarn examples/src/main/r/dataframe.R + ``` diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md index ed9e16943ada..56d732927a93 100644 --- a/docs/running-on-yarn.md +++ b/docs/running-on-yarn.md @@ -18,7 +18,7 @@ Spark application's configuration (driver, executors, and the AM when running in There are two deploy modes that can be used to launch Spark applications on YARN. In `cluster` mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In `client` mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN. -Unlike in Spark standalone and Mesos mode, in which the master's address is specified in the `--master` parameter, in YARN mode the ResourceManager's address is picked up from the Hadoop configuration. Thus, the `--master` parameter is `yarn` and to specify the deployment `--deploy-mode` can be either `client` or `cluster`. +Unlike in Spark standalone and Mesos mode, in which the master's address is specified in the `--master` parameter, in YARN mode the ResourceManager's address is picked up from the Hadoop configuration. Thus, the `--master` parameter is `yarn` and `--deploy-mode` can be `client` or `cluster` to select the YARN deployment mode. To launch a Spark application in YARN in `cluster` mode: `$ ./bin/spark-submit --class path.to.your.Class --master yarn --deploy-mode cluster [options] [app options]` @@ -37,13 +37,10 @@ For example: The above example starts a YARN client program which starts the default Application Master. Then SparkPi will be run as a child thread of Application Master. The client will periodically poll the Application Master for status updates and display them in the console. The client will exit once your application has finished running. Refer to the "Debugging your Application" section below for how to see driver and executor logs. -To launch a Spark application in `client` mode, do the same, but replace `cluster` with `client` in the --deploy-mode. +To launch a Spark application in `client` mode, do the same, but replace `cluster` with `client` in the `--deploy-mode` argument. To run spark-shell: - $ ./bin/spark-shell --master yarn --deploy-mode client - -The alternative to launching a Spark application on YARN is to set deployment mode for the YARN master in the `--master` itself. -`--master` can be `yarn-client` or `yarn-cluster` + $ ./bin/spark-shell --master yarn --deploy-mode client For example: @@ -403,6 +400,6 @@ If you need a reference to the proper location to put log files in the YARN so t # Important notes - Whether core requests are honored in scheduling decisions depends on which scheduler is in use and how it is configured. -- In `yarn-cluster`, the local directories used by the Spark executors and the Spark driver will be the local directories configured for YARN (Hadoop YARN config `yarn.nodemanager.local-dirs`). If the user specifies `spark.local.dir`, it will be ignored. In `yarn-client` mode, the Spark executors will use the local directories configured for YARN while the Spark driver will use those defined in `spark.local.dir`. This is because the Spark driver does not run on the YARN cluster in `yarn-client` mode, only the Spark executors do. +- In yarn-cluster, the local directories used by the Spark executors and the Spark driver will be the local directories configured for YARN (Hadoop YARN config `yarn.nodemanager.local-dirs`). If the user specifies `spark.local.dir`, it will be ignored. In yarn-client mode, the Spark executors will use the local directories configured for YARN while the Spark driver will use those defined in `spark.local.dir`. This is because the Spark driver does not run on the YARN cluster in yarn-client mode, only the Spark executors do. - The `--files` and `--archives` options support specifying file names with the # similar to Hadoop. For example you can specify: `--files localtest.txt#appSees.txt` and this will upload the file you have locally named localtest.txt into HDFS but this will be linked to by the name `appSees.txt`, and your application should use the name as `appSees.txt` to reference it when running on YARN. - The `--jars` option allows the `SparkContext.addJar` function to work if you are using it with local files and running in `yarn-cluster` mode. It does not need to be used if you are using it with HDFS, HTTP, HTTPS, or FTP files. diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md index 3af171f10b2f..845ca6850250 100644 --- a/docs/sql-programming-guide.md +++ b/docs/sql-programming-guide.md @@ -1551,7 +1551,7 @@ on all of the worker nodes, as they will need access to the Hive serialization a (SerDes) in order to access data stored in Hive. Configuration of Hive is done by placing your `hive-site.xml` file in `conf/`. Please note when running -the query on a YARN cluster (`--master yarn --deploy-mode cluster` mode), the `datanucleus` jars under the `lib_managed/jars` directory +the query on a YARN cluster (--master yarn --deploy-mode cluster mode), the `datanucleus` jars under the `lib_managed/jars` directory and `hive-site.xml` under `conf/` directory need to be available on the driver and all executors launched by the YARN cluster. The convenient way to do this is adding them through the `--jars` option and `--file` option of the `spark-submit` command. From 177146e58de8a46fc9b920d7587952db230b0f08 Mon Sep 17 00:00:00 2001 From: Neelesh Srinivas Salian Date: Thu, 1 Oct 2015 11:38:24 -0700 Subject: [PATCH 28/28] Review Changes --deploy-mode --- docs/running-on-yarn.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md index 56d732927a93..e45ab11f0aa1 100644 --- a/docs/running-on-yarn.md +++ b/docs/running-on-yarn.md @@ -45,7 +45,8 @@ To run spark-shell: For example: $ ./bin/spark-submit --class org.apache.spark.examples.SparkPi \ - --master yarn-cluster \ + --master yarn \ + --deploy-mode cluster \ --num-executors 3 \ --driver-memory 4g \ --executor-memory 2g \