From 204b2480028a1a4256ed248f4dbf689b60723ac3 Mon Sep 17 00:00:00 2001 From: Patrick Wendell Date: Sat, 24 May 2014 20:05:19 -0700 Subject: [PATCH 01/16] Small fixes --- docs/index.md | 2 +- docs/quick-start.md | 8 ++++---- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/index.md b/docs/index.md index c9b10376cc80..fb75bc678c8a 100644 --- a/docs/index.md +++ b/docs/index.md @@ -5,7 +5,7 @@ title: Spark Overview Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in [Scala](scala-programming-guide.html), [Java](java-programming-guide.html), and [Python](python-programming-guide.html) that make parallel jobs easy to write, and an optimized engine that supports general computation graphs. -It also supports a rich set of higher-level tools including [Shark](http://shark.cs.berkeley.edu) (Hive on Spark), [MLlib](mllib-guide.html) for machine learning, [GraphX](graphx-programming-guide.html) for graph processing, and [Spark Streaming](streaming-programming-guide.html). +It also supports a rich set of higher-level tools including [Spark SQL](sql-programming-guide.html) (SQL on Spark), [MLlib](mllib-guide.html) for machine learning, [GraphX](graphx-programming-guide.html) for graph processing, and [Spark Streaming](streaming-programming-guide.html). # Downloading diff --git a/docs/quick-start.md b/docs/quick-start.md index 33a0df103642..20e17ebf703f 100644 --- a/docs/quick-start.md +++ b/docs/quick-start.md @@ -252,11 +252,11 @@ we initialize a SparkContext as part of the program. We pass the SparkContext constructor a [SparkConf](api/scala/index.html#org.apache.spark.SparkConf) object which contains information about our -application. We also call sc.addJar to make sure that when our application is launched in cluster -mode, the jar file containing it will be shipped automatically to worker nodes. +application. -This file depends on the Spark API, so we'll also include an sbt configuration file, `simple.sbt` -which explains that Spark is a dependency. This file also adds a repository that Spark depends on: +Our application depends on the Spark API, so we'll also include an sbt configuration file, +`simple.sbt` which explains that Spark is a dependency. This file also adds a repository that +Spark depends on: {% highlight scala %} name := "Simple Project" From 4af9e07494b4de99e1e099ff9c04a74fa3f02951 Mon Sep 17 00:00:00 2001 From: Patrick Wendell Date: Sat, 24 May 2014 20:25:58 -0700 Subject: [PATCH 02/16] Adding SPARK_LOCAL_DIRS docs --- docs/spark-standalone.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/docs/spark-standalone.md b/docs/spark-standalone.md index eb3211b6b0e4..489c6e36400d 100644 --- a/docs/spark-standalone.md +++ b/docs/spark-standalone.md @@ -95,6 +95,14 @@ You can optionally configure the cluster further by setting environment variable SPARK_MASTER_OPTS Configuration properties that apply only to the master in the form "-Dx=y" (default: none). + + SPARK_LOCAL_DIRS + + Directory to use for "scratch" space in Spark, including map output files and RDDs that get + stored on disk. This should be on a fast, local disk in your system. It can also be a + comma-separated list of multiple directories on different disks. + + SPARK_WORKER_CORES Total number of cores to allow Spark applications to use on the machine (default: all available cores). From 2d719efd9f68563119be1f527e97a19df4aa7485 Mon Sep 17 00:00:00 2001 From: Patrick Wendell Date: Sat, 24 May 2014 20:26:30 -0700 Subject: [PATCH 03/16] Small fix --- docs/configuration.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/configuration.md b/docs/configuration.md index 4d41c36e38e2..7e95aa69a3d2 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -582,7 +582,7 @@ Apart from these, the following properties are also available, and may be useful spark.logConf false - Whether to log the supplied SparkConf as INFO at start of spark context. + Whether to log the supplied SparkConf as INFO when a SparkContext is started. From 29b54461e07557d66cfa7128f6c222106ce5a5e8 Mon Sep 17 00:00:00 2001 From: Patrick Wendell Date: Sat, 24 May 2014 20:56:05 -0700 Subject: [PATCH 04/16] Better discussion of spark-submit in configuration docs --- docs/configuration.md | 24 +++++++++++++++--------- 1 file changed, 15 insertions(+), 9 deletions(-) diff --git a/docs/configuration.md b/docs/configuration.md index 7e95aa69a3d2..97e648d356d1 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -30,22 +30,28 @@ val conf = new SparkConf() val sc = new SparkContext(conf) {% endhighlight %} -## Loading Default Configurations +## Dynamically Loading Spark Properties +In some cases, you may want to avoid hard-coding certain configurations in a `SparkConf`. For +instance, if you'd like to run the same applicaiton with different masters or different +amounts of memory. -In the case of `spark-shell`, a SparkContext has already been created for you, so you cannot control -the configuration properties through SparkConf. However, you can still set configuration properties -through a default configuration file. By default, `spark-shell` (and more generally `spark-submit`) -will read configuration options from `conf/spark-defaults.conf`, in which each line consists of a -key and a value separated by whitespace. For example, +The Spark shell and [`spark-submit`](cluster-overview.html#launching-applications-with-spark-submit) tool support two ways to load configurations dynamically. +When a SparkConf is created, it will read configuration options from `conf/spark-defaults.conf`, +in which each line consists of a key and a value separated by whitespace. For example, spark.master spark://5.6.7.8:7077 spark.executor.memory 512m spark.eventLog.enabled true spark.serializer org.apache.spark.serializer.KryoSerializer -Any values specified in the file will be passed on to the application, and merged with those -specified through SparkConf. If the same configuration property exists in both `spark-defaults.conf` -and SparkConf, then the latter will take precedence as it is the most application-specific. + +In addition, when launching programs with the [`spark-submit`](cluster-overview.html#launching-applications-with-spark-submit) tool, certain options can be configured as flags. For instance, the +`--master` flag to `spark-submit` will automatically set the master. Run `./bin/spark-submit --help` to see the entire list of options. + +Any values specified as flags or in the properties file will be passed on to the application +and merged with those specified through SparkConf. Properties set directly on the SparkConf +take highest precedence, then flags passed to `spark-submit` or `spark-shell`, then options +in the `spark-defaults.conf` file. ## Viewing Spark Properties From 592e94ac20f4d209c9e2334875f33d811f5e1a64 Mon Sep 17 00:00:00 2001 From: Patrick Wendell Date: Sun, 25 May 2014 00:28:10 -0700 Subject: [PATCH 05/16] Stash --- docs/configuration.md | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/docs/configuration.md b/docs/configuration.md index 97e648d356d1..bef75c58c362 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -16,10 +16,10 @@ Spark provides three locations to configure the system: # Spark Properties Spark properties control most application settings and are configured separately for each -application. The preferred way is to set them through -[SparkConf](api/scala/index.html#org.apache.spark.SparkConf) and passing it as an argument to your -SparkContext. SparkConf allows you to configure most of the common properties to initialize a -cluster (e.g. master URL and application name), as well as arbitrary key-value pairs through the +application. These properties can be set directly on a +[SparkConf](api/scala/index.html#org.apache.spark.SparkConf) and passed as an argument to your +SparkContext. SparkConf allows you to configure some of the common properties +(e.g. master URL and application name), as well as arbitrary key-value pairs through the `set()` method. For example, we could initialize an application as follows: {% highlight scala %} @@ -32,8 +32,13 @@ val sc = new SparkContext(conf) ## Dynamically Loading Spark Properties In some cases, you may want to avoid hard-coding certain configurations in a `SparkConf`. For -instance, if you'd like to run the same applicaiton with different masters or different -amounts of memory. +instance, if you'd like to run the same application with different masters or different +amounts of memory. Spark allows you to omit this in your code: + +{% highlight scala %} +val conf = new SparkConf().setAppName("myApp") +{% endhighlight %} + The Spark shell and [`spark-submit`](cluster-overview.html#launching-applications-with-spark-submit) tool support two ways to load configurations dynamically. When a SparkConf is created, it will read configuration options from `conf/spark-defaults.conf`, From 54b184d4a3c10386fd73cf8b8d0db7800d4ac560 Mon Sep 17 00:00:00 2001 From: Patrick Wendell Date: Sat, 24 May 2014 21:40:10 -0700 Subject: [PATCH 06/16] Adding standalone configs to the standalone page --- docs/spark-standalone.md | 70 ++++++++++++++++++++++++++++++++++++++-- 1 file changed, 68 insertions(+), 2 deletions(-) diff --git a/docs/spark-standalone.md b/docs/spark-standalone.md index 489c6e36400d..cd3fbe8a9427 100644 --- a/docs/spark-standalone.md +++ b/docs/spark-standalone.md @@ -93,7 +93,7 @@ You can optionally configure the cluster further by setting environment variable SPARK_MASTER_OPTS - Configuration properties that apply only to the master in the form "-Dx=y" (default: none). + Configuration properties that apply only to the master in the form "-Dx=y" (default: none). See below for a list of possible options. SPARK_LOCAL_DIRS @@ -134,7 +134,7 @@ You can optionally configure the cluster further by setting environment variable SPARK_WORKER_OPTS - Configuration properties that apply only to the worker in the form "-Dx=y" (default: none). + Configuration properties that apply only to the worker in the form "-Dx=y" (default: none). See below for a list of possible options. SPARK_DAEMON_MEMORY @@ -152,6 +152,72 @@ You can optionally configure the cluster further by setting environment variable **Note:** The launch scripts do not currently support Windows. To run a Spark cluster on Windows, start the master and workers by hand. +SPARK_MASTER_OPTS supports the following system properties: + + + + + + + + + + + + + + + + + + +
Property NameDefaultMeaning
spark.deploy.spreadOuttrue + Whether the standalone cluster manager should spread applications out across nodes or try + to consolidate them onto as few nodes as possible. Spreading out is usually better for + data locality in HDFS, but consolidating is more efficient for compute-intensive workloads.
+
spark.deploy.defaultCores(infinite) + Default number of cores to give to applications in Spark's standalone mode if they don't + set spark.cores.max. If not set, applications always get all available + cores unless they configure spark.cores.max themselves. + Set this lower on a shared cluster to prevent users from grabbing + the whole cluster by default.
+
spark.worker.timeout60 + Number of seconds after which the standalone deploy master considers a worker lost if it + receives no heartbeats. +
+ +SPARK_WORKER_OPTS supports the following system properties: + + + + + + + + + + + + + + + + + + +
Property NameDefaultMeaning
spark.worker.cleanup.enabledfalse + Enable periodic cleanup of worker / application directories. Note that this only affects standalone + mode, as YARN works differently. Applications directories are cleaned up regardless of whether + the application is still running. +
spark.worker.cleanup.interval1800 (30 minutes) + Controls the interval, in seconds, at which the worker cleans up old application work dirs + on the local machine. +
spark.worker.cleanup.appDataTtl7 * 24 * 3600 (7 days) + The number of seconds to retain application work directories on each worker. This is a Time To Live + and should depend on the amount of available disk space you have. Application logs and jars are + downloaded to each application work dir. Over time, the work dirs can quickly fill up disk space, + especially if you run jobs very frequently. +
# Connecting an Application to the Cluster To run an application on the Spark cluster, simply pass the `spark://IP:PORT` URL of the master as to the [`SparkContext` From f7e79bc42c1635686c3af01eef147dae92de2529 Mon Sep 17 00:00:00 2001 From: Patrick Wendell Date: Sun, 25 May 2014 21:43:11 -0700 Subject: [PATCH 07/16] Re-organizing config options. This uses the following categories: - Runtime Environment - Shuffle Behavior - Spark UI - Compression and Serialization - Execution Behavior - Networking - Scheduling - Security - Spark Streaming --- docs/configuration.md | 592 +++++++++++++++++++++--------------------- 1 file changed, 300 insertions(+), 292 deletions(-) diff --git a/docs/configuration.md b/docs/configuration.md index bef75c58c362..9bb542482db5 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -33,26 +33,29 @@ val sc = new SparkContext(conf) ## Dynamically Loading Spark Properties In some cases, you may want to avoid hard-coding certain configurations in a `SparkConf`. For instance, if you'd like to run the same application with different masters or different -amounts of memory. Spark allows you to omit this in your code: +amounts of memory. Spark allows you to simply create an empty conf: {% highlight scala %} -val conf = new SparkConf().setAppName("myApp") +val sc = new SparkContext(new SparkConf()) {% endhighlight %} +Then, you can supply configuration values at runtime: +{% highlight bash %} +./bin/spark-submit --name "My fancy app" --master local[4] myApp.jar +{% endhighlight %} + +The Spark shell and [`spark-submit`](cluster-overview.html#launching-applications-with-spark-submit) tool support +two ways to load configurations dynamically. The first are command line options, such as `--master`, as shown above. +Running `./bin/spark-submit --help` will show the entire list of options. -The Spark shell and [`spark-submit`](cluster-overview.html#launching-applications-with-spark-submit) tool support two ways to load configurations dynamically. -When a SparkConf is created, it will read configuration options from `conf/spark-defaults.conf`, -in which each line consists of a key and a value separated by whitespace. For example, +`bin/spark-submit` will also read configuration options from `conf/spark-defaults.conf`, in which each line consists +of a key and a value separated by whitespace. For example: spark.master spark://5.6.7.8:7077 spark.executor.memory 512m spark.eventLog.enabled true spark.serializer org.apache.spark.serializer.KryoSerializer - -In addition, when launching programs with the [`spark-submit`](cluster-overview.html#launching-applications-with-spark-submit) tool, certain options can be configured as flags. For instance, the -`--master` flag to `spark-submit` will automatically set the master. Run `./bin/spark-submit --help` to see the entire list of options. - Any values specified as flags or in the properties file will be passed on to the application and merged with those specified through SparkConf. Properties set directly on the SparkConf take highest precedence, then flags passed to `spark-submit` or `spark-shell`, then options @@ -67,16 +70,31 @@ appear. For all other configuration properties, you can assume the default value ## All Configuration Properties -Most of the properties that control internal settings have reasonable default values. However, -there are at least five properties that you will commonly want to control: +Most of the properties that control internal settings have reasonable default values. Some +of the most common options to set are: + + + + + + + + + + @@ -109,49 +127,94 @@ there are at least five properties that you will commonly want to control: list of multiple directories on different disks. NOTE: In Spark 1.0 and later this will be overriden by SPARK_LOCAL_DIRS (Standalone, Mesos) or - LOCAL_DIRS (YARN) envrionment variables set by the cluster manager. + LOCAL_DIRS (YARN) environment variables set by the cluster manager. - - + +
Property NameDefaultMeaning
spark.app.name(none) + The name of your application. This will appear in the UI and in log data. +
spark.master(none) + The cluster manager to connect to. See the list of [allowed master URL's](scala-programming-guide.html#master-urls). +
spark.executor.memory 512m - Amount of memory to use per executor process, in the same format as JVM memory strings (e.g. 512m, 2g). + Amount of memory to use per executor process, in the same format as JVM memory strings + (e.g. 512m, 2g).
spark.cores.max(not set)spark.logConffalse - When running on a standalone deploy cluster or a - Mesos cluster in "coarse-grained" - sharing mode, the maximum amount of CPU cores to request for the application from - across the cluster (not from each machine). If not set, the default will be - spark.deploy.defaultCores on Spark's standalone cluster manager, or - infinite (all available cores) on Mesos. + Logs the effective SparkConf as INFO when a SparkContext is started.
- Apart from these, the following properties are also available, and may be useful in some situations: +#### Runtime Environment - + + + + + + - - + + + + + + + + + + + + + +
Property NameDefaultMeaning
spark.default.parallelismspark.executor.memory512m -
    -
  • Local mode: number of cores on the local machine
  • -
  • Mesos fine grained mode: 8
  • -
  • Others: total number of cores on all executor nodes or 2, whichever is larger
  • -
+ Amount of memory to use per executor process, in the same format as JVM memory strings (e.g. 512m, 2g).
spark.executor.extraJavaOptions(none) - Default number of tasks to use across the cluster for distributed shuffle operations (groupByKey, - reduceByKey, etc) when not set by user. + A string of extra JVM options to pass to executors. For instance, GC settings or other + logging. Note that it is illegal to set Spark properties or heap size settings with this + option. Spark properties should be set using a SparkConf object or the + spark-defaults.conf file used with the spark-submit script. Heap size settings can be set + with spark.executor.memory.
spark.storage.memoryFraction0.6spark.executor.extraClassPath(none) - Fraction of Java heap to use for Spark's memory cache. This should not be larger than the "old" - generation of objects in the JVM, which by default is given 0.6 of the heap, but you can increase - it if you configure your own old generation size. + Extra classpath entries to append to the classpath of executors. This exists primarily + for backwards-compatibility with older versions of Spark. Users typically should not need + to set this option. +
spark.executor.extraLibraryPath(none) + Set a special library path to use when launching executor JVM's. +
spark.files.userClassPathFirstfalse + (Experimental) Whether to give user-added jars precedence over Spark's own jars when + loading classes in Executors. This feature can be used to mitigate conflicts between + Spark's dependencies and user dependencies. It is currently an experimental feature. +
+ +#### Shuffle Behavior + + + + + + + + + + + + + + + + @@ -166,40 +229,43 @@ Apart from these, the following properties are also available, and may be useful - - + + - - + + - - + + - - + + +
Property NameDefaultMeaning
spark.shuffle.consolidateFilesfalse + If set to "true", consolidates intermediate files created during a shuffle. Creating fewer files can improve + filesystem performance for shuffles with large numbers of reduce tasks. It is recommended to set this to "true" + when using ext4 or xfs filesystems. On ext3, this option might degrade performance on machines with many (>8) + cores due to filesystem limitations. +
spark.shuffle.spilltrue + If set to "true", limits the amount of memory used during reduces by spilling data out to disk. This spilling + threshold is specified by spark.shuffle.memoryFraction. +
spark.shuffle.spill.compresstrue + Whether to compress data spilled during shuffles.
spark.storage.memoryMapThreshold8192spark.shuffle.compresstrue - Size of a block, in bytes, above which Spark memory maps when reading a block from disk. - This prevents Spark from memory mapping very small blocks. In general, memory - mapping has high overhead for blocks close to or below the page size of the operating system. + Whether to compress map output files. Generally a good idea.
spark.tachyonStore.baseDirSystem.getProperty("java.io.tmpdir")spark.shuffle.file.buffer.kb100 - Directories of the Tachyon File System that store RDDs. The Tachyon file system's URL is set by spark.tachyonStore.url. - It can also be a comma-separated list of multiple directories on Tachyon file system. + Size of the in-memory buffer for each shuffle file output stream, in kilobytes. These buffers + reduce the number of disk seeks and system calls made in creating intermediate shuffle files.
spark.tachyonStore.urltachyon://localhost:19998spark.storage.memoryMapThreshold8192 - The URL of the underlying Tachyon file system in the TachyonStore. + Size of a block, in bytes, above which Spark memory maps when reading a block from disk. + This prevents Spark from memory mapping very small blocks. In general, memory + mapping has high overhead for blocks close to or below the page size of the operating system.
spark.mesos.coarsefalsespark.reducer.maxMbInFlight48 - If set to "true", runs over Mesos clusters in - "coarse-grained" sharing mode, - where Spark acquires one long-lived Mesos task on each machine instead of one Mesos task per Spark task. - This gives lower-latency scheduling for short queries, but leaves resources in use for the whole - duration of the Spark job. + Maximum size (in megabytes) of map outputs to fetch simultaneously from each reduce task. Since + each output requires us to create a buffer to receive it, this represents a fixed memory overhead + per reduce task, so keep it small unless you have a large amount of memory.
+ +#### Spark UI + + @@ -215,54 +281,40 @@ Apart from these, the following properties are also available, and may be useful - - + + - + - - - - - - - - - - - - + + - - + + +
Property NameDefaultMeaning
spark.ui.port 4040
spark.ui.filtersNonespark.ui.killEnabledtrue - Comma separated list of filter class names to apply to the Spark web ui. The filter should be a - standard javax servlet Filter. Parameters to each filter can also be specified by setting a - java system property of spark.<class name of filter>.params='param1=value1,param2=value2' - (e.g. -Dspark.ui.filters=com.test.filter1 -Dspark.com.test.filter1.params='param1=foo,param2=testing') + Allows stages and corresponding jobs to be killed from the web ui.
spark.ui.acls.enablespark.eventLog.enabled false - Whether spark web ui acls should are enabled. If enabled, this checks to see if the user has - access permissions to view the web ui. See spark.ui.view.acls for more details. - Also note this requires the user to be known, if the user comes across as null no checks - are done. Filters can be used to authenticate and set the user. -
spark.ui.view.aclsEmpty - Comma separated list of users that have view access to the spark web ui. By default only the - user that started the Spark job has view access. -
spark.ui.killEnabledtrue - Allows stages and corresponding jobs to be killed from the web ui. + Whether to log spark events, useful for reconstructing the Web UI after the application has finished.
spark.shuffle.compresstruespark.eventLog.compressfalse - Whether to compress map output files. Generally a good idea. + Whether to compress logged events, if spark.eventLog.enabled is true.
spark.shuffle.spill.compresstruespark.eventLog.dirfile:///tmp/spark-events - Whether to compress data spilled during shuffles. + Base directory in which spark events are logged, if spark.eventLog.enabled is true. + Within this base directory, Spark creates a sub-directory for each application, and logs the events + specific to the application in this directory.
+ +#### Compression and Serialization + + @@ -294,36 +346,21 @@ Apart from these, the following properties are also available, and may be useful - - - - - - - - - - - - + + - - + + @@ -345,15 +382,23 @@ Apart from these, the following properties are also available, and may be useful exceeded" exception inside Kryo. Note that there will be one buffer per core on each worker. +
Property NameDefaultMeaning
spark.broadcast.compress true
spark.scheduler.modeFIFO - The scheduling mode between - jobs submitted to the same SparkContext. Can be set to FAIR - to use fair sharing instead of queueing jobs one after another. Useful for - multi-user services. -
spark.scheduler.revive.interval1000 - The interval length for the scheduler to revive the worker resource offers to run tasks. (in milliseconds) -
spark.reducer.maxMbInFlight48spark.closure.serializerorg.apache.spark.serializer.
JavaSerializer
- Maximum size (in megabytes) of map outputs to fetch simultaneously from each reduce task. Since - each output requires us to create a buffer to receive it, this represents a fixed memory overhead - per reduce task, so keep it small unless you have a large amount of memory. + Serializer class to use for closures. Currently only the Java serializer is supported.
spark.closure.serializerorg.apache.spark.serializer.
JavaSerializer
spark.serializer.objectStreamReset10000 - Serializer class to use for closures. Currently only the Java serializer is supported. + When serializing using org.apache.spark.serializer.JavaSerializer, the serializer caches + objects to prevent writing redundant data, however that stops garbage collection of those + objects. By calling 'reset' you flush that info from the serializer, and allow old + objects to be collected. To turn off this periodic reset set it to a value <= 0. + By default it will reset the serializer every 10,000 objects.
+ +#### Execution Behavior + + - - + + @@ -364,73 +409,70 @@ Apart from these, the following properties are also available, and may be useful - - + + - - + + - - + + - - + + - - + + - - + + +
Property NameDefaultMeaning
spark.serializer.objectStreamReset10000spark.default.parallelism - When serializing using org.apache.spark.serializer.JavaSerializer, the serializer caches - objects to prevent writing redundant data, however that stops garbage collection of those - objects. By calling 'reset' you flush that info from the serializer, and allow old - objects to be collected. To turn off this periodic reset set it to a value <= 0. - By default it will reset the serializer every 10,000 objects. +
    +
  • Local mode: number of cores on the local machine
  • +
  • Mesos fine grained mode: 8
  • +
  • Others: total number of cores on all executor nodes or 2, whichever is larger
  • +
+
+ Default number of tasks to use across the cluster for distributed shuffle operations (groupByKey, + reduceByKey, etc) when not set by user.
spark.locality.wait3000spark.broadcast.blockSize4096 - Number of milliseconds to wait to launch a data-local task before giving up and launching it - on a less-local node. The same wait will be used to step through multiple locality levels - (process-local, node-local, rack-local and then any). It is also possible to customize the - waiting time for each level by setting spark.locality.wait.node, etc. - You should increase this setting if your tasks are long and see poor locality, but the - default usually works well. + Size of each piece of a block in kilobytes for TorrentBroadcastFactory. + Too large a value decreases parallelism during broadcast (makes it slower); however, if it is too small, + BlockManager might take a performance hit.
spark.locality.wait.processspark.locality.waitspark.files.overwritefalse - Customize the locality wait for process locality. This affects tasks that attempt to access - cached data in a particular executor process. + Whether to overwrite files added through SparkContext.addFile() when the target file exists and its contents do not match those of the source.
spark.locality.wait.nodespark.locality.waitspark.files.fetchTimeoutfalse - Customize the locality wait for node locality. For example, you can set this to 0 to skip - node locality and search immediately for rack locality (if your cluster has rack information). + Communication timeout to use when fetching files added through SparkContext.addFile() from + the driver.
spark.locality.wait.rackspark.locality.waitspark.storage.memoryFraction0.6 - Customize the locality wait for rack locality. + Fraction of Java heap to use for Spark's memory cache. This should not be larger than the "old" + generation of objects in the JVM, which by default is given 0.6 of the heap, but you can increase + it if you configure your own old generation size.
spark.worker.timeout60spark.tachyonStore.baseDirSystem.getProperty("java.io.tmpdir") - Number of seconds after which the standalone deploy master considers a worker lost if it - receives no heartbeats. + Directories of the Tachyon File System that store RDDs. The Tachyon file system's URL is set by spark.tachyonStore.url. + It can also be a comma-separated list of multiple directories on Tachyon file system.
spark.worker.cleanup.enabledfalsespark.tachyonStore.urltachyon://localhost:19998 - Enable periodic cleanup of worker / application directories. Note that this only affects standalone - mode, as YARN works differently. Applications directories are cleaned up regardless of whether - the application is still running. + The URL of the underlying Tachyon file system in the TachyonStore.
+ +#### Networking + + - - + + - - + + @@ -478,47 +520,16 @@ Apart from these, the following properties are also available, and may be useful This is set to a larger value to disable failure detector that comes inbuilt akka. It can be enabled again, if you plan to use this feature (Not recommended). A larger interval value in seconds reduces network overhead and a smaller value ( ~ 1 s) might be more informative for akka's failure detector. Tune this in combination of `spark.akka.heartbeat.pauses` and `spark.akka.failure-detector.threshold` if you need to. Only positive use case for using failure detector can be, a sensistive failure detector can help evict rogue executors really quick. However this is usually not the case as gc pauses and network lags are expected in a real spark cluster. Apart from that enabling this leads to a lot of exchanges of heart beats between nodes leading to flooding the network with those. +
Property NameDefaultMeaning
spark.worker.cleanup.interval1800 (30 minutes)spark.driver.host(local hostname) - Controls the interval, in seconds, at which the worker cleans up old application work dirs - on the local machine. + Hostname or IP address for the driver to listen on.
spark.worker.cleanup.appDataTtl7 * 24 * 3600 (7 days)spark.driver.port(random) - The number of seconds to retain application work directories on each worker. This is a Time To Live - and should depend on the amount of available disk space you have. Application logs and jars are - downloaded to each application work dir. Over time, the work dirs can quickly fill up disk space, - especially if you run jobs very frequently. + Port for the driver to listen on.
+ +#### Scheduling + + - - - - - - - - - - - - - - - - - - - - - - + + @@ -530,35 +541,36 @@ Apart from these, the following properties are also available, and may be useful - - - - - - - - + + - - + + - - + + @@ -590,83 +602,52 @@ Apart from these, the following properties are also available, and may be useful - - - - - - - - - - - - - - - - - - - - - - + + - - + + - - + + - - + + - - + + +
Property NameDefaultMeaning
spark.driver.host(local hostname) - Hostname or IP address for the driver to listen on. -
spark.driver.port(random) - Port for the driver to listen on. -
spark.cleaner.ttl(infinite) - Duration (seconds) of how long Spark will remember any metadata (stages generated, tasks generated, etc.). - Periodic cleanups will ensure that metadata older than this duration will be forgotten. This is - useful for running Spark for many hours / days (for example, running 24/7 in case of Spark Streaming - applications). Note that any RDD that persists in memory for more than this duration will be cleared as well. -
spark.streaming.blockInterval200 - Interval (milliseconds) at which data received by Spark Streaming receivers is coalesced - into blocks of data before storing them in Spark. -
spark.streaming.unpersisttruespark.task.cpus1 - Force RDDs generated and persisted by Spark Streaming to be automatically unpersisted from - Spark's memory. The raw input data received by Spark Streaming is also automatically cleared. - Setting this to false will allow the raw data and persisted RDDs to be accessible outside the - streaming application as they will not be cleared automatically. But it comes at the cost of - higher memory usage in Spark. + Number of cores to allocate for each task.
spark.broadcast.blockSize4096 - Size of each piece of a block in kilobytes for TorrentBroadcastFactory. - Too large a value decreases parallelism during broadcast (makes it slower); however, if it is too small, BlockManager might take a performance hit. -
spark.shuffle.consolidateFilesfalsespark.scheduler.modeFIFO - If set to "true", consolidates intermediate files created during a shuffle. Creating fewer files can improve filesystem performance for shuffles with large numbers of reduce tasks. It is recommended to set this to "true" when using ext4 or xfs filesystems. On ext3, this option might degrade performance on machines with many (>8) cores due to filesystem limitations. + The scheduling mode between + jobs submitted to the same SparkContext. Can be set to FAIR + to use fair sharing instead of queueing jobs one after another. Useful for + multi-user services.
spark.shuffle.file.buffer.kb100spark.cores.max(not set) - Size of the in-memory buffer for each shuffle file output stream, in kilobytes. These buffers - reduce the number of disk seeks and system calls made in creating intermediate shuffle files. + When running on a standalone deploy cluster or a + Mesos cluster in "coarse-grained" + sharing mode, the maximum amount of CPU cores to request for the application from + across the cluster (not from each machine). If not set, the default will be + spark.deploy.defaultCores on Spark's standalone cluster manager, or + infinite (all available cores) on Mesos.
spark.shuffle.spilltruespark.mesos.coarsefalse - If set to "true", limits the amount of memory used during reduces by spilling data out to disk. This spilling - threshold is specified by spark.shuffle.memoryFraction. + If set to "true", runs over Mesos clusters in + "coarse-grained" sharing mode, + where Spark acquires one long-lived Mesos task on each machine instead of one Mesos task per Spark task. + This gives lower-latency scheduling for short queries, but leaves resources in use for the whole + duration of the Spark job.
spark.logConffalse - Whether to log the supplied SparkConf as INFO when a SparkContext is started. -
spark.eventLog.enabledfalse - Whether to log spark events, useful for reconstructing the Web UI after the application has finished. -
spark.eventLog.compressfalse - Whether to compress logged events, if spark.eventLog.enabled is true. -
spark.eventLog.dirfile:///tmp/spark-events - Base directory in which spark events are logged, if spark.eventLog.enabled is true. - Within this base directory, Spark creates a sub-directory for each application, and logs the events - specific to the application in this directory. -
spark.deploy.spreadOuttruespark.locality.wait3000 - Whether the standalone cluster manager should spread applications out across nodes or try - to consolidate them onto as few nodes as possible. Spreading out is usually better for - data locality in HDFS, but consolidating is more efficient for compute-intensive workloads.
- Note: this setting needs to be configured in the standalone cluster master, not in individual - applications; you can set it through SPARK_MASTER_OPTS in spark-env.sh. + Number of milliseconds to wait to launch a data-local task before giving up and launching it + on a less-local node. The same wait will be used to step through multiple locality levels + (process-local, node-local, rack-local and then any). It is also possible to customize the + waiting time for each level by setting spark.locality.wait.node, etc. + You should increase this setting if your tasks are long and see poor locality, but the + default usually works well.
spark.deploy.defaultCores(infinite)spark.locality.wait.processspark.locality.wait - Default number of cores to give to applications in Spark's standalone mode if they don't - set spark.cores.max. If not set, applications always get all available - cores unless they configure spark.cores.max themselves. - Set this lower on a shared cluster to prevent users from grabbing - the whole cluster by default.
- Note: this setting needs to be configured in the standalone cluster master, not in individual - applications; you can set it through SPARK_MASTER_OPTS in spark-env.sh. + Customize the locality wait for process locality. This affects tasks that attempt to access + cached data in a particular executor process.
spark.files.overwritefalsespark.locality.wait.nodespark.locality.wait - Whether to overwrite files added through SparkContext.addFile() when the target file exists and its contents do not match those of the source. + Customize the locality wait for node locality. For example, you can set this to 0 to skip + node locality and search immediately for rack locality (if your cluster has rack information).
spark.files.fetchTimeoutfalsespark.locality.wait.rackspark.locality.wait - Communication timeout to use when fetching files added through SparkContext.addFile() from - the driver. + Customize the locality wait for rack locality.
spark.files.userClassPathFirstfalsespark.scheduler.revive.interval1000 - (Experimental) Whether to give user-added jars precedence over Spark's own jars when - loading classes in Executors. This feature can be used to mitigate conflicts between - Spark's dependencies and user dependencies. It is currently an experimental feature. + The interval length for the scheduler to revive the worker resource offers to run tasks. (in milliseconds)
+ +#### Security + + @@ -692,40 +673,67 @@ Apart from these, the following properties are also available, and may be useful - - + + - - + + - - + + +
Property NameDefaultMeaning
spark.authenticate false
spark.task.cpus1spark.ui.filtersNone - Number of cores to allocate for each task. + Comma separated list of filter class names to apply to the Spark web ui. The filter should be a + standard javax servlet Filter. Parameters to each filter can also be specified by setting a + java system property of spark.<class name of filter>.params='param1=value1,param2=value2' + (e.g. -Dspark.ui.filters=com.test.filter1 -Dspark.com.test.filter1.params='param1=foo,param2=testing')
spark.executor.extraJavaOptions(none)spark.ui.acls.enablefalse - A string of extra JVM options to pass to executors. For instance, GC settings or other - logging. Note that it is illegal to set Spark properties or heap size settings with this - option. Spark properties should be set using a SparkConf object or the - spark-defaults.conf file used with the spark-submit script. Heap size settings can be set - with spark.executor.memory. + Whether spark web ui acls should are enabled. If enabled, this checks to see if the user has + access permissions to view the web ui. See spark.ui.view.acls for more details. + Also note this requires the user to be known, if the user comes across as null no checks + are done. Filters can be used to authenticate and set the user.
spark.executor.extraClassPath(none)spark.ui.view.aclsEmpty - Extra classpath entries to append to the classpath of executors. This exists primarily - for backwards-compatibility with older versions of Spark. Users typically should not need - to set this option. + Comma separated list of users that have view access to the spark web ui. By default only the + user that started the Spark job has view access.
+ +#### Spark Streaming + + - - + + + + + + + + + + + + -
Property NameDefaultMeaning
spark.executor.extraLibraryPath(none)spark.cleaner.ttl(infinite) - Set a special library path to use when launching executor JVM's. + Duration (seconds) of how long Spark will remember any metadata (stages generated, tasks generated, etc.). + Periodic cleanups will ensure that metadata older than this duration will be forgotten. This is + useful for running Spark for many hours / days (for example, running 24/7 in case of Spark Streaming + applications). Note that any RDD that persists in memory for more than this duration will be cleared as well. +
spark.streaming.blockInterval200 + Interval (milliseconds) at which data received by Spark Streaming receivers is coalesced + into blocks of data before storing them in Spark. +
spark.streaming.unpersisttrue + Force RDDs generated and persisted by Spark Streaming to be automatically unpersisted from + Spark's memory. The raw input data received by Spark Streaming is also automatically cleared. + Setting this to false will allow the raw data and persisted RDDs to be accessible outside the + streaming application as they will not be cleared automatically. But it comes at the cost of + higher memory usage in Spark.
# Environment Variables From 106ee312469824959ef301ed4899f91d97099fdd Mon Sep 17 00:00:00 2001 From: Patrick Wendell Date: Sun, 25 May 2014 22:33:18 -0700 Subject: [PATCH 08/16] Small link fix --- docs/configuration.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/configuration.md b/docs/configuration.md index 9bb542482db5..c4b5c73e2df1 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -86,7 +86,8 @@ of the most common options to set are: spark.master (none) - The cluster manager to connect to. See the list of [allowed master URL's](scala-programming-guide.html#master-urls). + The cluster manager to connect to. See the list of + allowed master URL's. From 3289ea4f852408e440ca41056405265d80248089 Mon Sep 17 00:00:00 2001 From: Patrick Wendell Date: Sun, 25 May 2014 23:14:00 -0700 Subject: [PATCH 09/16] Pulling in changes from #856 --- docs/configuration.md | 65 +++++++++++++++++------------ docs/spark-standalone.md | 88 ++++++++++++++++++++++++++++++++++++++++ 2 files changed, 127 insertions(+), 26 deletions(-) diff --git a/docs/configuration.md b/docs/configuration.md index c4b5c73e2df1..f68f8d116c66 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -3,15 +3,8 @@ layout: global title: Spark Configuration --- -Spark provides three locations to configure the system: - -* [Spark properties](#spark-properties) control most application parameters and can be set by - passing a [SparkConf](api/scala/index.html#org.apache.spark.SparkConf) object to SparkContext, - or through the `conf/spark-defaults.conf` properties file. -* [Environment variables](#environment-variables) can be used to set per-machine settings, such as - the IP address, through the `conf/spark-env.sh` script on each node. -* [Logging](#configuring-logging) can be configured through `log4j.properties`. - +* This will become a table of contents (this text will be scraped). +{:toc} # Spark Properties @@ -149,7 +142,8 @@ Apart from these, the following properties are also available, and may be useful spark.executor.memory 512m - Amount of memory to use per executor process, in the same format as JVM memory strings (e.g. 512m, 2g). + Amount of memory to use per executor process, in the same format as JVM memory strings + (e.g. 512m, 2g). @@ -422,7 +416,8 @@ Apart from these, the following properties are also available, and may be useful spark.files.overwrite false - Whether to overwrite files added through SparkContext.addFile() when the target file exists and its contents do not match those of the source. + Whether to overwrite files added through SparkContext.addFile() when the target file exists and its contents do not + match those of the source. @@ -446,8 +441,9 @@ Apart from these, the following properties are also available, and may be useful spark.tachyonStore.baseDir System.getProperty("java.io.tmpdir") - Directories of the Tachyon File System that store RDDs. The Tachyon file system's URL is set by spark.tachyonStore.url. - It can also be a comma-separated list of multiple directories on Tachyon file system. + Directories of the Tachyon File System that store RDDs. The Tachyon file system's URL is set by + spark.tachyonStore.url. It can also be a comma-separated list of multiple directories + on Tachyon file system. @@ -504,21 +500,33 @@ Apart from these, the following properties are also available, and may be useful spark.akka.heartbeat.pauses 600 - This is set to a larger value to disable failure detector that comes inbuilt akka. It can be enabled again, if you plan to use this feature (Not recommended). Acceptable heart beat pause in seconds for akka. This can be used to control sensitivity to gc pauses. Tune this in combination of `spark.akka.heartbeat.interval` and `spark.akka.failure-detector.threshold` if you need to. + This is set to a larger value to disable failure detector that comes inbuilt akka. It can be enabled again, if you + plan to use this feature (Not recommended). Acceptable heart beat pause in seconds for akka. This can be used to + control sensitivity to gc pauses. Tune this in combination of `spark.akka.heartbeat.interval` and + `spark.akka.failure-detector.threshold` if you need to. spark.akka.failure-detector.threshold 300.0 - This is set to a larger value to disable failure detector that comes inbuilt akka. It can be enabled again, if you plan to use this feature (Not recommended). This maps to akka's `akka.remote.transport-failure-detector.threshold`. Tune this in combination of `spark.akka.heartbeat.pauses` and `spark.akka.heartbeat.interval` if you need to. + This is set to a larger value to disable failure detector that comes inbuilt akka. It can be enabled again, if you + plan to use this feature (Not recommended). This maps to akka's `akka.remote.transport-failure-detector.threshold`. + Tune this in combination of `spark.akka.heartbeat.pauses` and `spark.akka.heartbeat.interval` if you need to. spark.akka.heartbeat.interval 1000 - This is set to a larger value to disable failure detector that comes inbuilt akka. It can be enabled again, if you plan to use this feature (Not recommended). A larger interval value in seconds reduces network overhead and a smaller value ( ~ 1 s) might be more informative for akka's failure detector. Tune this in combination of `spark.akka.heartbeat.pauses` and `spark.akka.failure-detector.threshold` if you need to. Only positive use case for using failure detector can be, a sensistive failure detector can help evict rogue executors really quick. However this is usually not the case as gc pauses and network lags are expected in a real spark cluster. Apart from that enabling this leads to a lot of exchanges of heart beats between nodes leading to flooding the network with those. + This is set to a larger value to disable failure detector that comes inbuilt akka. It can be enabled again, if you + plan to use this feature (Not recommended). A larger interval value in seconds reduces network overhead and a + smaller value ( ~ 1 s) might be more informative for akka's failure detector. Tune this in combination + of `spark.akka.heartbeat.pauses` and `spark.akka.failure-detector.threshold` if you need to. Only positive use + case for using failure detector can be, a sensistive failure detector can help evict rogue executors really + quick. However this is usually not the case as gc pauses and network lags are expected in a real spark cluster. + Apart from that enabling this leads to a lot of exchanges of heart beats between nodes leading to flooding the + network with those. @@ -578,7 +586,8 @@ Apart from these, the following properties are also available, and may be useful spark.speculation false - If set to "true", performs speculative execution of tasks. This means if one or more tasks are running slowly in a stage, they will be re-launched. + If set to "true", performs speculative execution of tasks. This means if one or more tasks are running slowly in a + stage, they will be re-launched. @@ -739,13 +748,13 @@ Apart from these, the following properties are also available, and may be useful # Environment Variables -Certain Spark settings can be configured through environment variables, which are read from the `conf/spark-env.sh` -script in the directory where Spark is installed (or `conf/spark-env.cmd` on Windows). In Standalone and Mesos modes, -this file can give machine specific information such as hostnames. It is also sourced when running local -Spark applications or submission scripts. +Certain Spark settings can be configured through environment variables, which are read from the +`conf/spark-env.sh` script in the directory where Spark is installed (or `conf/spark-env.cmd` on +Windows). In Standalone and Mesos modes, this file can give machine specific information such as +hostnames. It is also sourced when running local Spark applications or submission scripts. -Note that `conf/spark-env.sh` does not exist by default when Spark is installed. However, you can copy -`conf/spark-env.sh.template` to create it. Make sure you make the copy executable. +Note that `conf/spark-env.sh` does not exist by default when Spark is installed. However, you can +copy `conf/spark-env.sh.template` to create it. Make sure you make the copy executable. The following variables can be set in `spark-env.sh`: @@ -770,12 +779,16 @@ The following variables can be set in `spark-env.sh`: -In addition to the above, there are also options for setting up the Spark [standalone cluster scripts](spark-standalone.html#cluster-launch-scripts), such as number of cores to use on each machine and maximum memory. +In addition to the above, there are also options for setting up the Spark +[standalone cluster scripts](spark-standalone.html#cluster-launch-scripts), such as number of cores to use on each +machine and maximum memory. Since `spark-env.sh` is a shell script, some of these can be set programmatically -- for example, you might compute `SPARK_LOCAL_IP` by looking up the IP of a specific network interface. # Configuring Logging -Spark uses [log4j](http://logging.apache.org/log4j/) for logging. You can configure it by adding a `log4j.properties` -file in the `conf` directory. One way to start is to copy the existing `log4j.properties.template` located there. +Spark uses [log4j](http://logging.apache.org/log4j/) for logging. You can configure it by adding a +`log4j.properties` file in the `conf` directory. One way to start is to copy the existing +`log4j.properties.template` located there. + diff --git a/docs/spark-standalone.md b/docs/spark-standalone.md index cd3fbe8a9427..15c5816182a8 100644 --- a/docs/spark-standalone.md +++ b/docs/spark-standalone.md @@ -286,6 +286,94 @@ In addition, detailed log output for each job is also written to the work direct You can run Spark alongside your existing Hadoop cluster by just launching it as a separate service on the same machines. To access Hadoop data from Spark, just use a hdfs:// URL (typically `hdfs://:9000/path`, but you can find the right URL on your Hadoop Namenode's web UI). Alternatively, you can set up a separate cluster for Spark, and still have it access HDFS over the network; this will be slower than disk-local access, but may not be a concern if you are still running in the same local area network (e.g. you place a few Spark machines on each rack that you have Hadoop on). +# Configuring Ports for Network Security + +Spark makes heavy use of the network, and some environments have strict requirements for using tight +firewall settings. Below are the primary ports that Spark uses for its communication and how to +configure those ports. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
FromToDefault PortPurposeConfiguration + SettingNotes
BrowserStandalone Cluster Master8080Web UImaster.ui.portJetty-based
BrowserDriver4040Web UIspark.ui.portJetty-based
BrowserHistory Server18080Web UIspark.history.ui.portJetty-based
BrowserWorker8081Web UIworker.ui.portJetty-based
ApplicationStandalone Cluster Master7077Submit job to clusterspark.driver.portAkka-based. Set to "0" to choose a port randomly
WorkerStandalone Cluster Master7077Join clusterspark.driver.portAkka-based. Set to "0" to choose a port randomly
ApplicationWorker(random)Join clusterSPARK_WORKER_PORT (standalone cluster)Akka-based
Driver and other WorkersWorker(random) +
    +
  • File server for file and jars
  • +
  • Http Broadcast
  • +
  • Class file server (Spark Shell only)
  • +
+
NoneJetty-based. Each of these services starts on a random port that cannot be configured
+ # High Availability By default, standalone scheduling clusters are resilient to Worker failures (insofar as Spark itself is resilient to losing work by moving it to other workers). However, the scheduler uses a Master to make scheduling decisions, and this (by default) creates a single point of failure: if the Master crashes, no new applications can be created. In order to circumvent this, we have two high availability schemes, detailed below. From a374369e63d1c48cd71c4280167eb62607f9c9c3 Mon Sep 17 00:00:00 2001 From: Patrick Wendell Date: Sun, 25 May 2014 23:38:45 -0700 Subject: [PATCH 10/16] Line wrapping fixes --- docs/configuration.md | 159 +++++++++++++++++++++++------------------- 1 file changed, 87 insertions(+), 72 deletions(-) diff --git a/docs/configuration.md b/docs/configuration.md index d8a2360a7b3f..900cb884dc31 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -39,12 +39,13 @@ Then, you can supply configuration values at runtime: ./bin/spark-submit --name "My fancy app" --master local[4] myApp.jar {% endhighlight %} -The Spark shell and [`spark-submit`](cluster-overview.html#launching-applications-with-spark-submit) tool support -two ways to load configurations dynamically. The first are command line options, such as `--master`, as shown above. -Running `./bin/spark-submit --help` will show the entire list of options. +The Spark shell and [`spark-submit`](cluster-overview.html#launching-applications-with-spark-submit) +tool support two ways to load configurations dynamically. The first are command line options, +such as `--master`, as shown above. Running `./bin/spark-submit --help` will show the entire list +of options. -`bin/spark-submit` will also read configuration options from `conf/spark-defaults.conf`, in which each line consists -of a key and a value separated by whitespace. For example: +`bin/spark-submit` will also read configuration options from `conf/spark-defaults.conf`, in which +each line consists of a key and a value separated by whitespace. For example: spark.master spark://5.6.7.8:7077 spark.executor.memory 512m @@ -81,8 +82,8 @@ of the most common options to set are: spark.master (none) - The cluster manager to connect to. See the list of - allowed master URL's. + The cluster manager to connect to. See the list of + allowed master URL's. @@ -98,10 +99,12 @@ of the most common options to set are: org.apache.spark.serializer.
JavaSerializer Class to use for serializing objects that will be sent over the network or need to be cached - in serialized form. The default of Java serialization works with any Serializable Java object but is - quite slow, so we recommend using org.apache.spark.serializer.KryoSerializer - and configuring Kryo serialization when speed is necessary. Can be any subclass of - org.apache.spark.Serializer. + in serialized form. The default of Java serialization works with any Serializable Java object + but is quite slow, so we recommend using + org.apache.spark.serializer.KryoSerializer and configuring Kryo serialization + when speed is necessary. Can be any subclass of + + org.apache.spark.Serializer. @@ -110,7 +113,8 @@ of the most common options to set are: If you use Kryo serialization, set this class to register your custom classes with Kryo. It should be set to a class that extends - KryoRegistrator. + + KryoRegistrator. See the tuning guide for more details. @@ -118,9 +122,9 @@ of the most common options to set are: spark.local.dir /tmp - Directory to use for "scratch" space in Spark, including map output files and RDDs that get stored - on disk. This should be on a fast, local disk in your system. It can also be a comma-separated - list of multiple directories on different disks. + Directory to use for "scratch" space in Spark, including map output files and RDDs that get + stored on disk. This should be on a fast, local disk in your system. It can also be a + comma-separated list of multiple directories on different disks. NOTE: In Spark 1.0 and later this will be overriden by SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS (YARN) environment variables set by the cluster manager. @@ -193,18 +197,18 @@ Apart from these, the following properties are also available, and may be useful spark.shuffle.consolidateFiles false - If set to "true", consolidates intermediate files created during a shuffle. Creating fewer files can improve - filesystem performance for shuffles with large numbers of reduce tasks. It is recommended to set this to "true" - when using ext4 or xfs filesystems. On ext3, this option might degrade performance on machines with many (>8) - cores due to filesystem limitations. + If set to "true", consolidates intermediate files created during a shuffle. Creating fewer + files can improve filesystem performance for shuffles with large numbers of reduce tasks. It + is recommended to set this to "true" when using ext4 or xfs filesystems. On ext3, this option + might degrade performance on machines with many (>8) cores due to filesystem limitations. spark.shuffle.spill true - If set to "true", limits the amount of memory used during reduces by spilling data out to disk. This spilling - threshold is specified by spark.shuffle.memoryFraction. + If set to "true", limits the amount of memory used during reduces by spilling data out to disk. + This spilling threshold is specified by spark.shuffle.memoryFraction. @@ -254,8 +258,8 @@ Apart from these, the following properties are also available, and may be useful 48 Maximum size (in megabytes) of map outputs to fetch simultaneously from each reduce task. Since - each output requires us to create a buffer to receive it, this represents a fixed memory overhead - per reduce task, so keep it small unless you have a large amount of memory. + each output requires us to create a buffer to receive it, this represents a fixed memory + overhead per reduce task, so keep it small unless you have a large amount of memory. @@ -288,7 +292,8 @@ Apart from these, the following properties are also available, and may be useful spark.eventLog.enabled false - Whether to log spark events, useful for reconstructing the Web UI after the application has finished. + Whether to log spark events, useful for reconstructing the Web UI after the application has + finished. @@ -303,8 +308,8 @@ Apart from these, the following properties are also available, and may be useful file:///tmp/spark-events Base directory in which spark events are logged, if spark.eventLog.enabled is true. - Within this base directory, Spark creates a sub-directory for each application, and logs the events - specific to the application in this directory. + Within this base directory, Spark creates a sub-directory for each application, and logs the + events specific to the application in this directory. @@ -323,23 +328,26 @@ Apart from these, the following properties are also available, and may be useful spark.rdd.compress false - Whether to compress serialized RDD partitions (e.g. for StorageLevel.MEMORY_ONLY_SER). - Can save substantial space at the cost of some extra CPU time. + Whether to compress serialized RDD partitions (e.g. for + StorageLevel.MEMORY_ONLY_SER). Can save substantial space at the cost of some + extra CPU time. spark.io.compression.codec org.apache.spark.io.
LZFCompressionCodec - The codec used to compress internal data such as RDD partitions and shuffle outputs. By default, Spark provides two - codecs: org.apache.spark.io.LZFCompressionCodec and org.apache.spark.io.SnappyCompressionCodec. + The codec used to compress internal data such as RDD partitions and shuffle outputs. + By default, Spark provides two codecs: org.apache.spark.io.LZFCompressionCodec + and org.apache.spark.io.SnappyCompressionCodec. spark.io.compression.snappy.block.size 32768 - Block size (in bytes) used in Snappy compression, in the case when Snappy compression codec is used. + Block size (in bytes) used in Snappy compression, in the case when Snappy compression codec + is used. @@ -376,7 +384,8 @@ Apart from these, the following properties are also available, and may be useful Maximum object size to allow within Kryo (the library needs to create a buffer at least as large as the largest single object you'll serialize). Increase this if you get a "buffer limit - exceeded" exception inside Kryo. Note that there will be one buffer per core on each worker. + exceeded" exception inside Kryo. Note that there will be one buffer per core on each + worker. @@ -394,8 +403,8 @@ Apart from these, the following properties are also available, and may be useful - Default number of tasks to use across the cluster for distributed shuffle operations (groupByKey, - reduceByKey, etc) when not set by user. + Default number of tasks to use across the cluster for distributed shuffle operations + (groupByKey, reduceByKey, etc) when not set by user. @@ -410,16 +419,16 @@ Apart from these, the following properties are also available, and may be useful 4096 Size of each piece of a block in kilobytes for TorrentBroadcastFactory. - Too large a value decreases parallelism during broadcast (makes it slower); however, if it is too small, - BlockManager might take a performance hit. + Too large a value decreases parallelism during broadcast (makes it slower); however, if it is + too small, BlockManager might take a performance hit. spark.files.overwrite false - Whether to overwrite files added through SparkContext.addFile() when the target file exists and its contents do not - match those of the source. + Whether to overwrite files added through SparkContext.addFile() when the target file exists and + its contents do not match those of the source. @@ -435,8 +444,8 @@ Apart from these, the following properties are also available, and may be useful 0.6 Fraction of Java heap to use for Spark's memory cache. This should not be larger than the "old" - generation of objects in the JVM, which by default is given 0.6 of the heap, but you can increase - it if you configure your own old generation size. + generation of objects in the JVM, which by default is given 0.6 of the heap, but you can + increase it if you configure your own old generation size. @@ -444,8 +453,8 @@ Apart from these, the following properties are also available, and may be useful System.getProperty("java.io.tmpdir") Directories of the Tachyon File System that store RDDs. The Tachyon file system's URL is set by - spark.tachyonStore.url. It can also be a comma-separated list of multiple directories - on Tachyon file system. + spark.tachyonStore.url. It can also be a comma-separated list of multiple + directories on Tachyon file system. @@ -502,33 +511,36 @@ Apart from these, the following properties are also available, and may be useful spark.akka.heartbeat.pauses 600 - This is set to a larger value to disable failure detector that comes inbuilt akka. It can be enabled again, if you - plan to use this feature (Not recommended). Acceptable heart beat pause in seconds for akka. This can be used to - control sensitivity to gc pauses. Tune this in combination of `spark.akka.heartbeat.interval` and - `spark.akka.failure-detector.threshold` if you need to. + This is set to a larger value to disable failure detector that comes inbuilt akka. It can be + enabled again, if you plan to use this feature (Not recommended). Acceptable heart beat pause + in seconds for akka. This can be used to control sensitivity to gc pauses. Tune this in + combination of `spark.akka.heartbeat.interval` and `spark.akka.failure-detector.threshold` + if you need to. spark.akka.failure-detector.threshold 300.0 - This is set to a larger value to disable failure detector that comes inbuilt akka. It can be enabled again, if you - plan to use this feature (Not recommended). This maps to akka's `akka.remote.transport-failure-detector.threshold`. - Tune this in combination of `spark.akka.heartbeat.pauses` and `spark.akka.heartbeat.interval` if you need to. + This is set to a larger value to disable failure detector that comes inbuilt akka. It can be + enabled again, if you plan to use this feature (Not recommended). This maps to akka's + `akka.remote.transport-failure-detector.threshold`. Tune this in combination of + `spark.akka.heartbeat.pauses` and `spark.akka.heartbeat.interval` if you need to. spark.akka.heartbeat.interval 1000 - This is set to a larger value to disable failure detector that comes inbuilt akka. It can be enabled again, if you - plan to use this feature (Not recommended). A larger interval value in seconds reduces network overhead and a - smaller value ( ~ 1 s) might be more informative for akka's failure detector. Tune this in combination - of `spark.akka.heartbeat.pauses` and `spark.akka.failure-detector.threshold` if you need to. Only positive use - case for using failure detector can be, a sensistive failure detector can help evict rogue executors really - quick. However this is usually not the case as gc pauses and network lags are expected in a real spark cluster. - Apart from that enabling this leads to a lot of exchanges of heart beats between nodes leading to flooding the - network with those. + This is set to a larger value to disable failure detector that comes inbuilt akka. It can be + enabled again, if you plan to use this feature (Not recommended). A larger interval value in + seconds reduces network overhead and a smaller value ( ~ 1 s) might be more informative for + akka's failure detector. Tune this in combination of `spark.akka.heartbeat.pauses` and + `spark.akka.failure-detector.threshold` if you need to. Only positive use case for using + failure detector can be, a sensistive failure detector can help evict rogue executors really + quick. However this is usually not the case as gc pauses and network lags are expected in a + real spark cluster. Apart from that enabling this leads to a lot of exchanges of heart beats + between nodes leading to flooding the network with those. @@ -579,17 +591,17 @@ Apart from these, the following properties are also available, and may be useful If set to "true", runs over Mesos clusters in "coarse-grained" sharing mode, - where Spark acquires one long-lived Mesos task on each machine instead of one Mesos task per Spark task. - This gives lower-latency scheduling for short queries, but leaves resources in use for the whole - duration of the Spark job. + where Spark acquires one long-lived Mesos task on each machine instead of one Mesos task per + Spark task. This gives lower-latency scheduling for short queries, but leaves resources in use + for the whole duration of the Spark job. spark.speculation false - If set to "true", performs speculative execution of tasks. This means if one or more tasks are running slowly in a - stage, they will be re-launched. + If set to "true", performs speculative execution of tasks. This means if one or more tasks are + running slowly in a stage, they will be re-launched. @@ -652,7 +664,8 @@ Apart from these, the following properties are also available, and may be useful spark.scheduler.revive.interval 1000 - The interval length for the scheduler to revive the worker resource offers to run tasks. (in milliseconds) + The interval length for the scheduler to revive the worker resource offers to run tasks. + (in milliseconds) @@ -664,8 +677,8 @@ Apart from these, the following properties are also available, and may be useful spark.authenticate false - Whether spark authenticates its internal connections. See spark.authenticate.secret if not - running on Yarn. + Whether spark authenticates its internal connections. See + spark.authenticate.secret if not running on Yarn. @@ -691,7 +704,8 @@ Apart from these, the following properties are also available, and may be useful Comma separated list of filter class names to apply to the Spark web ui. The filter should be a standard javax servlet Filter. Parameters to each filter can also be specified by setting a java system property of spark.<class name of filter>.params='param1=value1,param2=value2' - (e.g. -Dspark.ui.filters=com.test.filter1 -Dspark.com.test.filter1.params='param1=foo,param2=testing') + (e.g. -Dspark.ui.filters=com.test.filter1 + -Dspark.com.test.filter1.params='param1=foo,param2=testing') @@ -721,10 +735,11 @@ Apart from these, the following properties are also available, and may be useful spark.cleaner.ttl (infinite) - Duration (seconds) of how long Spark will remember any metadata (stages generated, tasks generated, etc.). - Periodic cleanups will ensure that metadata older than this duration will be forgotten. This is - useful for running Spark for many hours / days (for example, running 24/7 in case of Spark Streaming - applications). Note that any RDD that persists in memory for more than this duration will be cleared as well. + Duration (seconds) of how long Spark will remember any metadata (stages generated, tasks + generated, etc.). Periodic cleanups will ensure that metadata older than this duration will be + forgotten. This is useful for running Spark for many hours / days (for example, running 24/7 in + case of Spark Streaming applications). Note that any RDD that persists in memory for more than + this duration will be cleared as well. @@ -782,8 +797,8 @@ The following variables can be set in `spark-env.sh`: In addition to the above, there are also options for setting up the Spark -[standalone cluster scripts](spark-standalone.html#cluster-launch-scripts), such as number of cores to use on each -machine and maximum memory. +[standalone cluster scripts](spark-standalone.html#cluster-launch-scripts), such as number of cores +to use on each machine and maximum memory. Since `spark-env.sh` is a shell script, some of these can be set programmatically -- for example, you might compute `SPARK_LOCAL_IP` by looking up the IP of a specific network interface. From 27d57db59621703c948de02226bb1bc1d382aad1 Mon Sep 17 00:00:00 2001 From: Patrick Wendell Date: Wed, 28 May 2014 11:19:28 -0700 Subject: [PATCH 11/16] Reverting changes to index.html (covered in #896) --- docs/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/index.md b/docs/index.md index fb75bc678c8a..c9b10376cc80 100644 --- a/docs/index.md +++ b/docs/index.md @@ -5,7 +5,7 @@ title: Spark Overview Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in [Scala](scala-programming-guide.html), [Java](java-programming-guide.html), and [Python](python-programming-guide.html) that make parallel jobs easy to write, and an optimized engine that supports general computation graphs. -It also supports a rich set of higher-level tools including [Spark SQL](sql-programming-guide.html) (SQL on Spark), [MLlib](mllib-guide.html) for machine learning, [GraphX](graphx-programming-guide.html) for graph processing, and [Spark Streaming](streaming-programming-guide.html). +It also supports a rich set of higher-level tools including [Shark](http://shark.cs.berkeley.edu) (Hive on Spark), [MLlib](mllib-guide.html) for machine learning, [GraphX](graphx-programming-guide.html) for graph processing, and [Spark Streaming](streaming-programming-guide.html). # Downloading From e0c17289ec77c7a2b9c717fbe5939435e2e2bb9e Mon Sep 17 00:00:00 2001 From: Patrick Wendell Date: Wed, 28 May 2014 11:33:45 -0700 Subject: [PATCH 12/16] Response to Matei's review --- docs/configuration.md | 65 ++++++++++++++++++++-------------------- docs/spark-standalone.md | 12 ++++---- 2 files changed, 39 insertions(+), 38 deletions(-) diff --git a/docs/configuration.md b/docs/configuration.md index 900cb884dc31..9d00d25549ac 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -64,7 +64,7 @@ This is a useful place to check to make sure that your properties have been set that only values explicitly specified through either `spark-defaults.conf` or SparkConf will appear. For all other configuration properties, you can assume the default value is used. -## All Configuration Properties +## Available Properties Most of the properties that control internal settings have reasonable default values. Some of the most common options to set are: @@ -72,14 +72,14 @@ of the most common options to set are: - + - + - - - - - @@ -292,7 +283,7 @@ Apart from these, the following properties are also available, and may be useful @@ -307,7 +298,7 @@ Apart from these, the following properties are also available, and may be useful @@ -457,6 +448,15 @@ Apart from these, the following properties are also available, and may be useful directories on Tachyon file system. + + + + + @@ -464,6 +464,17 @@ Apart from these, the following properties are also available, and may be useful The URL of the underlying Tachyon file system in the TachyonStore. + + + + +
Property NameDefaultMeaning
spark.app.namespark.app.name (none) The name of your application. This will appear in the UI and in log data.
spark.masterspark.master (none) The cluster manager to connect to. See the list of @@ -244,15 +244,6 @@ Apart from these, the following properties are also available, and may be useful reduce the number of disk seeks and system calls made in creating intermediate shuffle files.
spark.storage.memoryMapThreshold8192 - Size of a block, in bytes, above which Spark memory maps when reading a block from disk. - This prevents Spark from memory mapping very small blocks. In general, memory - mapping has high overhead for blocks close to or below the page size of the operating system. -
spark.reducer.maxMbInFlight 48spark.eventLog.enabled false - Whether to log spark events, useful for reconstructing the Web UI after the application has + Whether to log Spark events, useful for reconstructing the Web UI after the application has finished.
spark.eventLog.dir file:///tmp/spark-events - Base directory in which spark events are logged, if spark.eventLog.enabled is true. + Base directory in which Spark events are logged, if spark.eventLog.enabled is true. Within this base directory, Spark creates a sub-directory for each application, and logs the events specific to the application in this directory.
spark.storage.memoryMapThreshold8192 + Size of a block, in bytes, above which Spark memory maps when reading a block from disk. + This prevents Spark from memory mapping very small blocks. In general, memory + mapping has high overhead for blocks close to or below the page size of the operating system. +
spark.tachyonStore.url tachyon://localhost:19998
spark.cleaner.ttl(infinite) + Duration (seconds) of how long Spark will remember any metadata (stages generated, tasks + generated, etc.). Periodic cleanups will ensure that metadata older than this duration will be + forgotten. This is useful for running Spark for many hours / days (for example, running 24/7 in + case of Spark Streaming applications). Note that any RDD that persists in memory for more than + this duration will be cleared as well. +
#### Networking @@ -539,7 +550,7 @@ Apart from these, the following properties are also available, and may be useful `spark.akka.failure-detector.threshold` if you need to. Only positive use case for using failure detector can be, a sensistive failure detector can help evict rogue executors really quick. However this is usually not the case as gc pauses and network lags are expected in a - real spark cluster. Apart from that enabling this leads to a lot of exchanges of heart beats + real Spark cluster. Apart from that enabling this leads to a lot of exchanges of heart beats between nodes leading to flooding the network with those. @@ -677,8 +688,8 @@ Apart from these, the following properties are also available, and may be useful spark.authenticate false - Whether spark authenticates its internal connections. See - spark.authenticate.secret if not running on Yarn. + Whether Spark authenticates its internal connections. See + spark.authenticate.secret if not running on YARN. @@ -686,7 +697,7 @@ Apart from these, the following properties are also available, and may be useful None Set the secret key used for Spark to authenticate between components. This needs to be set if - not running on Yarn and authentication is enabled. + not running on YARN and authentication is enabled. @@ -702,7 +713,8 @@ Apart from these, the following properties are also available, and may be useful None Comma separated list of filter class names to apply to the Spark web ui. The filter should be a - standard javax servlet Filter. Parameters to each filter can also be specified by setting a + standard + javax servlet Filter. Parameters to each filter can also be specified by setting a java system property of spark.<class name of filter>.params='param1=value1,param2=value2' (e.g. -Dspark.ui.filters=com.test.filter1 -Dspark.com.test.filter1.params='param1=foo,param2=testing') @@ -712,7 +724,7 @@ Apart from these, the following properties are also available, and may be useful spark.ui.acls.enable false - Whether spark web ui acls should are enabled. If enabled, this checks to see if the user has + Whether Spark web ui acls should are enabled. If enabled, this checks to see if the user has access permissions to view the web ui. See spark.ui.view.acls for more details. Also note this requires the user to be known, if the user comes across as null no checks are done. Filters can be used to authenticate and set the user. @@ -722,7 +734,7 @@ Apart from these, the following properties are also available, and may be useful spark.ui.view.acls Empty - Comma separated list of users that have view access to the spark web ui. By default only the + Comma separated list of users that have view access to the Spark web ui. By default only the user that started the Spark job has view access. @@ -731,17 +743,6 @@ Apart from these, the following properties are also available, and may be useful #### Spark Streaming - - - - - diff --git a/docs/spark-standalone.md b/docs/spark-standalone.md index 15c5816182a8..20ae4b00115c 100644 --- a/docs/spark-standalone.md +++ b/docs/spark-standalone.md @@ -157,7 +157,7 @@ SPARK_MASTER_OPTS supports the following system properties:
Property NameDefaultMeaning
spark.cleaner.ttl(infinite) - Duration (seconds) of how long Spark will remember any metadata (stages generated, tasks - generated, etc.). Periodic cleanups will ensure that metadata older than this duration will be - forgotten. This is useful for running Spark for many hours / days (for example, running 24/7 in - case of Spark Streaming applications). Note that any RDD that persists in memory for more than - this duration will be cleared as well. -
spark.streaming.blockInterval 200
- + - + - + @@ -239,7 +240,8 @@ Apart from these, the following properties are also available, and may be useful @@ -306,7 +308,8 @@ Apart from these, the following properties are also available, and may be useful
Property NameDefaultMeaning
spark.deploy.spreadOutspark.deploy.spreadOut true Whether the standalone cluster manager should spread applications out across nodes or try @@ -166,7 +166,7 @@ SPARK_MASTER_OPTS supports the following system properties:
spark.deploy.defaultCoresspark.deploy.defaultCores (infinite) Default number of cores to give to applications in Spark's standalone mode if they don't @@ -177,7 +177,7 @@ SPARK_MASTER_OPTS supports the following system properties:
spark.worker.timeoutspark.worker.timeout 60 Number of seconds after which the standalone deploy master considers a worker lost if it @@ -191,7 +191,7 @@ SPARK_WORKER_OPTS supports the following system properties: - + - + - +
Property NameDefaultMeaning
spark.worker.cleanup.enabledspark.worker.cleanup.enabled false Enable periodic cleanup of worker / application directories. Note that this only affects standalone @@ -200,7 +200,7 @@ SPARK_WORKER_OPTS supports the following system properties:
spark.worker.cleanup.intervalspark.worker.cleanup.interval 1800 (30 minutes) Controls the interval, in seconds, at which the worker cleans up old application work dirs @@ -208,7 +208,7 @@ SPARK_WORKER_OPTS supports the following system properties:
spark.worker.cleanup.appDataTtlspark.worker.cleanup.appDataTtl 7 * 24 * 3600 (7 days) The number of seconds to retain application work directories on each worker. This is a Time To Live From d9c264ff225b65d47f616aa1a1690933802b9973 Mon Sep 17 00:00:00 2001 From: Patrick Wendell Date: Wed, 28 May 2014 11:44:23 -0700 Subject: [PATCH 13/16] Small fix --- docs/spark-standalone.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/spark-standalone.md b/docs/spark-standalone.md index 20ae4b00115c..dca80a9a6961 100644 --- a/docs/spark-standalone.md +++ b/docs/spark-standalone.md @@ -218,6 +218,7 @@ SPARK_WORKER_OPTS supports the following system properties:
+ # Connecting an Application to the Cluster To run an application on the Spark cluster, simply pass the `spark://IP:PORT` URL of the master as to the [`SparkContext` From 16ae7767e7deb5366ea46732f8d6d7e52d7f0d6f Mon Sep 17 00:00:00 2001 From: Patrick Wendell Date: Wed, 28 May 2014 14:41:50 -0700 Subject: [PATCH 14/16] Adding back header section --- docs/configuration.md | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/docs/configuration.md b/docs/configuration.md index 9d00d25549ac..df50be003c7d 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -2,11 +2,17 @@ layout: global title: Spark Configuration --- - * This will become a table of contents (this text will be scraped). {:toc} -Spark provides several locations to configure the system: +Spark provides three locations to configure the system: + +* [Spark properties](#spark-properties) control most application parameters and can be set by passing + a [SparkConf](api/core/index.html#org.apache.spark.SparkConf) object to SparkContext, or through Java + system properties. +* [Environment variables](#environment-variables) can be used to set per-machine settings, such as + the IP address, through the `conf/spark-env.sh` script on each node. +* [Logging](#configuring-logging) can be configured through `log4j.properties`. # Spark Properties From 6f66efc31612ff43814f48da946339f855f24f38 Mon Sep 17 00:00:00 2001 From: Patrick Wendell Date: Wed, 28 May 2014 14:58:57 -0700 Subject: [PATCH 15/16] More feedback --- docs/configuration.md | 21 +++++++++++++++++---- 1 file changed, 17 insertions(+), 4 deletions(-) diff --git a/docs/configuration.md b/docs/configuration.md index df50be003c7d..d654975b8dc8 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -221,7 +221,8 @@ Apart from these, the following properties are also available, and may be useful
spark.shuffle.spill.compress true - Whether to compress data spilled during shuffles. + Whether to compress data spilled during shuffles. Compression will use + spark.io.compression.codec.
spark.shuffle.compress true - Whether to compress map output files. Generally a good idea. + Whether to compress map output files. Generally a good idea. Compression will use + spark.io.compression.codec.
Base directory in which Spark events are logged, if spark.eventLog.enabled is true. Within this base directory, Spark creates a sub-directory for each application, and logs the - events specific to the application in this directory. + events specific to the application in this directory. Users may want to set this to + and HDFS directory so that history files can be read by the history server.
@@ -336,7 +339,9 @@ Apart from these, the following properties are also available, and may be useful The codec used to compress internal data such as RDD partitions and shuffle outputs. By default, Spark provides two codecs: org.apache.spark.io.LZFCompressionCodec - and org.apache.spark.io.SnappyCompressionCodec. + and org.apache.spark.io.SnappyCompressionCodec. Of these two choices, + Snappy offers faster compression and decompression, while LZF offers a better compression + ratio. @@ -770,6 +775,14 @@ Apart from these, the following properties are also available, and may be useful +#### Cluster Managers (YARN, Mesos, Standalone) +Each cluster manager in Spark has additional configuration options. Configurations +can be found on the pages for each mode: + + * [Yarn](running-on-yarn.html#configuration) + * [Mesos](running-on-mesos.html) + * [Standalone Mode](spark-standalone.html#cluster-launch-scripts) + # Environment Variables Certain Spark settings can be configured through environment variables, which are read from the From 93f56c3e9248f2977d8db9d162b902fe2d52333e Mon Sep 17 00:00:00 2001 From: Patrick Wendell Date: Wed, 28 May 2014 15:47:25 -0700 Subject: [PATCH 16/16] Feedback from Matei --- docs/configuration.md | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/docs/configuration.md b/docs/configuration.md index d654975b8dc8..b6e7fd34eae6 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -723,12 +723,14 @@ Apart from these, the following properties are also available, and may be useful spark.ui.filters None - Comma separated list of filter class names to apply to the Spark web ui. The filter should be a + Comma separated list of filter class names to apply to the Spark web UI. The filter should be a standard javax servlet Filter. Parameters to each filter can also be specified by setting a - java system property of spark.<class name of filter>.params='param1=value1,param2=value2' - (e.g. -Dspark.ui.filters=com.test.filter1 - -Dspark.com.test.filter1.params='param1=foo,param2=testing') + java system property of:
+ spark.<class name of filter>.params='param1=value1,param2=value2'
+ For example:
+ -Dspark.ui.filters=com.test.filter1
+ -Dspark.com.test.filter1.params='param1=foo,param2=testing' @@ -779,7 +781,7 @@ Apart from these, the following properties are also available, and may be useful Each cluster manager in Spark has additional configuration options. Configurations can be found on the pages for each mode: - * [Yarn](running-on-yarn.html#configuration) + * [YARN](running-on-yarn.html#configuration) * [Mesos](running-on-mesos.html) * [Standalone Mode](spark-standalone.html#cluster-launch-scripts) @@ -828,4 +830,3 @@ compute `SPARK_LOCAL_IP` by looking up the IP of a specific network interface. Spark uses [log4j](http://logging.apache.org/log4j/) for logging. You can configure it by adding a `log4j.properties` file in the `conf` directory. One way to start is to copy the existing `log4j.properties.template` located there. -