From 204b2480028a1a4256ed248f4dbf689b60723ac3 Mon Sep 17 00:00:00 2001
From: Patrick Wendell <pwendell@gmail.com>
Date: Sat, 24 May 2014 20:05:19 -0700
Subject: [PATCH 01/16] Small fixes

---
 docs/index.md       | 2 +-
 docs/quick-start.md | 8 ++++----
 2 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/docs/index.md b/docs/index.md
index c9b10376cc80..fb75bc678c8a 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -5,7 +5,7 @@ title: Spark Overview
 
 Apache Spark is a fast and general-purpose cluster computing system.
 It provides high-level APIs in [Scala](scala-programming-guide.html), [Java](java-programming-guide.html), and [Python](python-programming-guide.html) that make parallel jobs easy to write, and an optimized engine that supports general computation graphs.
-It also supports a rich set of higher-level tools including [Shark](http://shark.cs.berkeley.edu) (Hive on Spark), [MLlib](mllib-guide.html) for machine learning, [GraphX](graphx-programming-guide.html) for graph processing, and [Spark Streaming](streaming-programming-guide.html).
+It also supports a rich set of higher-level tools including [Spark SQL](sql-programming-guide.html) (SQL on Spark), [MLlib](mllib-guide.html) for machine learning, [GraphX](graphx-programming-guide.html) for graph processing, and [Spark Streaming](streaming-programming-guide.html).
 
 # Downloading
 
diff --git a/docs/quick-start.md b/docs/quick-start.md
index 33a0df103642..20e17ebf703f 100644
--- a/docs/quick-start.md
+++ b/docs/quick-start.md
@@ -252,11 +252,11 @@ we initialize a SparkContext as part of the program.
 We pass the SparkContext constructor a 
 [SparkConf](api/scala/index.html#org.apache.spark.SparkConf)
 object which contains information about our
-application. We also call sc.addJar to make sure that when our application is launched in cluster
-mode, the jar file containing it will be shipped automatically to worker nodes.
+application. 
 
-This file depends on the Spark API, so we'll also include an sbt configuration file, `simple.sbt`
-which explains that Spark is a dependency. This file also adds a repository that Spark depends on:
+Our application depends on the Spark API, so we'll also include an sbt configuration file, 
+`simple.sbt` which explains that Spark is a dependency. This file also adds a repository that 
+Spark depends on:
 
 {% highlight scala %}
 name := "Simple Project"

From 4af9e07494b4de99e1e099ff9c04a74fa3f02951 Mon Sep 17 00:00:00 2001
From: Patrick Wendell <pwendell@gmail.com>
Date: Sat, 24 May 2014 20:25:58 -0700
Subject: [PATCH 02/16] Adding SPARK_LOCAL_DIRS docs

---
 docs/spark-standalone.md | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/docs/spark-standalone.md b/docs/spark-standalone.md
index eb3211b6b0e4..489c6e36400d 100644
--- a/docs/spark-standalone.md
+++ b/docs/spark-standalone.md
@@ -95,6 +95,14 @@ You can optionally configure the cluster further by setting environment variable
     <td><code>SPARK_MASTER_OPTS</code></td>
     <td>Configuration properties that apply only to the master in the form "-Dx=y" (default: none).</td>
   </tr>
+  <tr>
+    <td><code>SPARK_LOCAL_DIRS</code></td>
+    <td>
+    Directory to use for "scratch" space in Spark, including map output files and RDDs that get 
+    stored on disk. This should be on a fast, local disk in your system. It can also be a 
+    comma-separated list of multiple directories on different disks.
+    </td>
+  </tr>
   <tr>
     <td><code>SPARK_WORKER_CORES</code></td>
     <td>Total number of cores to allow Spark applications to use on the machine (default: all available cores).</td>

From 2d719efd9f68563119be1f527e97a19df4aa7485 Mon Sep 17 00:00:00 2001
From: Patrick Wendell <pwendell@gmail.com>
Date: Sat, 24 May 2014 20:26:30 -0700
Subject: [PATCH 03/16] Small fix

---
 docs/configuration.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/configuration.md b/docs/configuration.md
index 4d41c36e38e2..7e95aa69a3d2 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -582,7 +582,7 @@ Apart from these, the following properties are also available, and may be useful
   <td><code>spark.logConf</code></td>
   <td>false</td>
   <td>
-    Whether to log the supplied SparkConf as INFO at start of spark context.
+    Whether to log the supplied SparkConf as INFO when a SparkContext is started.
   </td>
 </tr>
 <tr>

From 29b54461e07557d66cfa7128f6c222106ce5a5e8 Mon Sep 17 00:00:00 2001
From: Patrick Wendell <pwendell@gmail.com>
Date: Sat, 24 May 2014 20:56:05 -0700
Subject: [PATCH 04/16] Better discussion of spark-submit in configuration docs

---
 docs/configuration.md | 24 +++++++++++++++---------
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/docs/configuration.md b/docs/configuration.md
index 7e95aa69a3d2..97e648d356d1 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -30,22 +30,28 @@ val conf = new SparkConf()
 val sc = new SparkContext(conf)
 {% endhighlight %}
 
-## Loading Default Configurations
+## Dynamically Loading Spark Properties
+In some cases, you may want to avoid hard-coding certain configurations in a `SparkConf`. For
+instance, if you'd like to run the same applicaiton with different masters or different
+amounts of memory.
 
-In the case of `spark-shell`, a SparkContext has already been created for you, so you cannot control
-the configuration properties through SparkConf. However, you can still set configuration properties
-through a default configuration file. By default, `spark-shell` (and more generally `spark-submit`)
-will read configuration options from `conf/spark-defaults.conf`, in which each line consists of a
-key and a value separated by whitespace. For example,
+The Spark shell and [`spark-submit`](cluster-overview.html#launching-applications-with-spark-submit) tool support two ways to load configurations dynamically. 
+When a SparkConf is created, it will read configuration options from `conf/spark-defaults.conf`, 
+in which each line consists of a key and a value separated by whitespace. For example,
 
     spark.master            spark://5.6.7.8:7077
     spark.executor.memory   512m
     spark.eventLog.enabled  true
     spark.serializer        org.apache.spark.serializer.KryoSerializer
 
-Any values specified in the file will be passed on to the application, and merged with those
-specified through SparkConf. If the same configuration property exists in both `spark-defaults.conf`
-and SparkConf, then the latter will take precedence as it is the most application-specific.
+
+In addition, when launching programs with the [`spark-submit`](cluster-overview.html#launching-applications-with-spark-submit) tool, certain options can be configured as flags. For instance, the 
+`--master` flag to `spark-submit` will automatically set the master. Run `./bin/spark-submit --help` to see the entire list of options.
+
+Any values specified as flags or in the properties file will be passed on to the application
+and merged with those specified through SparkConf. Properties set directly on the SparkConf
+take highest precedence, then flags passed to `spark-submit` or `spark-shell`, then options
+in the `spark-defaults.conf` file.
 
 ## Viewing Spark Properties
 

From 592e94ac20f4d209c9e2334875f33d811f5e1a64 Mon Sep 17 00:00:00 2001
From: Patrick Wendell <pwendell@gmail.com>
Date: Sun, 25 May 2014 00:28:10 -0700
Subject: [PATCH 05/16] Stash

---
 docs/configuration.md | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/docs/configuration.md b/docs/configuration.md
index 97e648d356d1..bef75c58c362 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -16,10 +16,10 @@ Spark provides three locations to configure the system:
 # Spark Properties
 
 Spark properties control most application settings and are configured separately for each
-application. The preferred way is to set them through
-[SparkConf](api/scala/index.html#org.apache.spark.SparkConf) and passing it as an argument to your
-SparkContext. SparkConf allows you to configure most of the common properties to initialize a
-cluster (e.g. master URL and application name), as well as arbitrary key-value pairs through the
+application. These properties can be set directly on a
+[SparkConf](api/scala/index.html#org.apache.spark.SparkConf) and passed as an argument to your
+SparkContext. SparkConf allows you to configure some of the common properties
+(e.g. master URL and application name), as well as arbitrary key-value pairs through the
 `set()` method. For example, we could initialize an application as follows:
 
 {% highlight scala %}
@@ -32,8 +32,13 @@ val sc = new SparkContext(conf)
 
 ## Dynamically Loading Spark Properties
 In some cases, you may want to avoid hard-coding certain configurations in a `SparkConf`. For
-instance, if you'd like to run the same applicaiton with different masters or different
-amounts of memory.
+instance, if you'd like to run the same application with different masters or different
+amounts of memory. Spark allows you to omit this in your code:
+
+{% highlight scala %}
+val conf = new SparkConf().setAppName("myApp")
+{% endhighlight %}
+
 
 The Spark shell and [`spark-submit`](cluster-overview.html#launching-applications-with-spark-submit) tool support two ways to load configurations dynamically. 
 When a SparkConf is created, it will read configuration options from `conf/spark-defaults.conf`, 

From 54b184d4a3c10386fd73cf8b8d0db7800d4ac560 Mon Sep 17 00:00:00 2001
From: Patrick Wendell <pwendell@gmail.com>
Date: Sat, 24 May 2014 21:40:10 -0700
Subject: [PATCH 06/16] Adding standalone configs to the standalone page

---
 docs/spark-standalone.md | 70 ++++++++++++++++++++++++++++++++++++++--
 1 file changed, 68 insertions(+), 2 deletions(-)

diff --git a/docs/spark-standalone.md b/docs/spark-standalone.md
index 489c6e36400d..cd3fbe8a9427 100644
--- a/docs/spark-standalone.md
+++ b/docs/spark-standalone.md
@@ -93,7 +93,7 @@ You can optionally configure the cluster further by setting environment variable
   </tr>
   <tr>
     <td><code>SPARK_MASTER_OPTS</code></td>
-    <td>Configuration properties that apply only to the master in the form "-Dx=y" (default: none).</td>
+    <td>Configuration properties that apply only to the master in the form "-Dx=y" (default: none). See below for a list of possible options.</td>
   </tr>
   <tr>
     <td><code>SPARK_LOCAL_DIRS</code></td>
@@ -134,7 +134,7 @@ You can optionally configure the cluster further by setting environment variable
   </tr>
   <tr>
     <td><code>SPARK_WORKER_OPTS</code></td>
-    <td>Configuration properties that apply only to the worker in the form "-Dx=y" (default: none).</td>
+    <td>Configuration properties that apply only to the worker in the form "-Dx=y" (default: none). See below for a list of possible options.</td>
   </tr>
   <tr>
     <td><code>SPARK_DAEMON_MEMORY</code></td>
@@ -152,6 +152,72 @@ You can optionally configure the cluster further by setting environment variable
 
 **Note:** The launch scripts do not currently support Windows. To run a Spark cluster on Windows, start the master and workers by hand.
 
+SPARK_MASTER_OPTS supports the following system properties:
+
+<table class="table">
+<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+<tr>
+  <td>spark.deploy.spreadOut</td>
+  <td>true</td>
+  <td>
+    Whether the standalone cluster manager should spread applications out across nodes or try
+    to consolidate them onto as few nodes as possible. Spreading out is usually better for
+    data locality in HDFS, but consolidating is more efficient for compute-intensive workloads. <br/>
+  </td>
+</tr>
+<tr>
+  <td>spark.deploy.defaultCores</td>
+  <td>(infinite)</td>
+  <td>
+    Default number of cores to give to applications in Spark's standalone mode if they don't
+    set <code>spark.cores.max</code>. If not set, applications always get all available
+    cores unless they configure <code>spark.cores.max</code> themselves.
+    Set this lower on a shared cluster to prevent users from grabbing
+    the whole cluster by default. <br/>
+  </td>
+</tr>
+<tr>
+  <td>spark.worker.timeout</td>
+  <td>60</td>
+  <td>
+    Number of seconds after which the standalone deploy master considers a worker lost if it
+    receives no heartbeats.
+  </td>
+</tr>
+</table>
+
+SPARK_WORKER_OPTS supports the following system properties:
+
+<table class="table">
+<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+<tr>
+  <td>spark.worker.cleanup.enabled</td>
+  <td>false</td>
+  <td>
+    Enable periodic cleanup of worker / application directories.  Note that this only affects standalone
+    mode, as YARN works differently. Applications directories are cleaned up regardless of whether
+    the application is still running.
+  </td>
+</tr>
+<tr>
+  <td>spark.worker.cleanup.interval</td>
+  <td>1800 (30 minutes)</td>
+  <td>
+    Controls the interval, in seconds, at which the worker cleans up old application work dirs
+    on the local machine.
+  </td>
+</tr>
+<tr>
+  <td>spark.worker.cleanup.appDataTtl</td>
+  <td>7 * 24 * 3600 (7 days)</td>
+  <td>
+    The number of seconds to retain application work directories on each worker.  This is a Time To Live
+    and should depend on the amount of available disk space you have.  Application logs and jars are
+    downloaded to each application work dir.  Over time, the work dirs can quickly fill up disk space,
+    especially if you run jobs very frequently.
+  </td>
+</tr>
+</table>
 # Connecting an Application to the Cluster
 
 To run an application on the Spark cluster, simply pass the `spark://IP:PORT` URL of the master as to the [`SparkContext`

From f7e79bc42c1635686c3af01eef147dae92de2529 Mon Sep 17 00:00:00 2001
From: Patrick Wendell <pwendell@gmail.com>
Date: Sun, 25 May 2014 21:43:11 -0700
Subject: [PATCH 07/16] Re-organizing config options.

This uses the following categories:
- Runtime Environment
- Shuffle Behavior
- Spark UI
- Compression and Serialization
- Execution Behavior
- Networking
- Scheduling
- Security
- Spark Streaming
---
 docs/configuration.md | 592 +++++++++++++++++++++---------------------
 1 file changed, 300 insertions(+), 292 deletions(-)

diff --git a/docs/configuration.md b/docs/configuration.md
index bef75c58c362..9bb542482db5 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -33,26 +33,29 @@ val sc = new SparkContext(conf)
 ## Dynamically Loading Spark Properties
 In some cases, you may want to avoid hard-coding certain configurations in a `SparkConf`. For
 instance, if you'd like to run the same application with different masters or different
-amounts of memory. Spark allows you to omit this in your code:
+amounts of memory. Spark allows you to simply create an empty conf:
 
 {% highlight scala %}
-val conf = new SparkConf().setAppName("myApp")
+val sc = new SparkContext(new SparkConf())
 {% endhighlight %}
 
+Then, you can supply configuration values at runtime:
+{% highlight bash %}
+./bin/spark-submit --name "My fancy app" --master local[4] myApp.jar
+{% endhighlight %}
+
+The Spark shell and [`spark-submit`](cluster-overview.html#launching-applications-with-spark-submit) tool support
+two ways to load configurations dynamically. The first are command line options, such as `--master`, as shown above.
+Running `./bin/spark-submit --help` will show the entire list of options.
 
-The Spark shell and [`spark-submit`](cluster-overview.html#launching-applications-with-spark-submit) tool support two ways to load configurations dynamically. 
-When a SparkConf is created, it will read configuration options from `conf/spark-defaults.conf`, 
-in which each line consists of a key and a value separated by whitespace. For example,
+`bin/spark-submit` will also read configuration options from `conf/spark-defaults.conf`, in which each line consists
+of a key and a value separated by whitespace. For example:
 
     spark.master            spark://5.6.7.8:7077
     spark.executor.memory   512m
     spark.eventLog.enabled  true
     spark.serializer        org.apache.spark.serializer.KryoSerializer
 
-
-In addition, when launching programs with the [`spark-submit`](cluster-overview.html#launching-applications-with-spark-submit) tool, certain options can be configured as flags. For instance, the 
-`--master` flag to `spark-submit` will automatically set the master. Run `./bin/spark-submit --help` to see the entire list of options.
-
 Any values specified as flags or in the properties file will be passed on to the application
 and merged with those specified through SparkConf. Properties set directly on the SparkConf
 take highest precedence, then flags passed to `spark-submit` or `spark-shell`, then options
@@ -67,16 +70,31 @@ appear. For all other configuration properties, you can assume the default value
 
 ## All Configuration Properties
 
-Most of the properties that control internal settings have reasonable default values. However,
-there are at least five properties that you will commonly want to control:
+Most of the properties that control internal settings have reasonable default values. Some
+of the most common options to set are:
 
 <table class="table">
 <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+<tr>
+  <td><strong><code>spark.app.name</code></strong></td>
+  <td>(none)</td>
+  <td>
+    The name of your application. This will appear in the UI and in log data.
+  </td>
+</tr>
+<tr>
+  <td><strong><code>spark.master</code></strong></td>
+  <td>(none)</td>
+  <td>
+    The cluster manager to connect to. See the list of [allowed master URL's](scala-programming-guide.html#master-urls).
+  </td>
+</tr>
 <tr>
   <td><code>spark.executor.memory</code></td>
   <td>512m</td>
   <td>
-    Amount of memory to use per executor process, in the same format as JVM memory strings (e.g. <code>512m</code>, <code>2g</code>).
+    Amount of memory to use per executor process, in the same format as JVM memory strings
+    (e.g. <code>512m</code>, <code>2g</code>).
   </td>
 </tr>
 <tr>
@@ -109,49 +127,94 @@ there are at least five properties that you will commonly want to control:
     list of multiple directories on different disks.
 
     NOTE: In Spark 1.0 and later this will be overriden by SPARK_LOCAL_DIRS (Standalone, Mesos) or
-    LOCAL_DIRS (YARN) envrionment variables set by the cluster manager.
+    LOCAL_DIRS (YARN) environment variables set by the cluster manager.
   </td>
 </tr>
 <tr>
-  <td><code>spark.cores.max</code></td>
-  <td>(not set)</td>
+  <td><code>spark.logConf</code></td>
+  <td>false</td>
   <td>
-    When running on a <a href="spark-standalone.html">standalone deploy cluster</a> or a
-    <a href="running-on-mesos.html#mesos-run-modes">Mesos cluster in "coarse-grained"
-    sharing mode</a>, the maximum amount of CPU cores to request for the application from
-    across the cluster (not from each machine). If not set, the default will be
-    <code>spark.deploy.defaultCores</code> on Spark's standalone cluster manager, or
-    infinite (all available cores) on Mesos.
+    Logs the effective SparkConf as INFO when a SparkContext is started.
   </td>
 </tr>
 </table>
 
-
 Apart from these, the following properties are also available, and may be useful in some situations:
 
+#### Runtime Environment
 <table class="table">
 <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
 <tr>
-  <td><code>spark.default.parallelism</code></td>
+  <td><code>spark.executor.memory</code></td>
+  <td>512m</td>
   <td>
-    <ul>
-      <li>Local mode: number of cores on the local machine</li>
-      <li>Mesos fine grained mode: 8</li>
-      <li>Others: total number of cores on all executor nodes or 2, whichever is larger</li>
-    </ul>
+    Amount of memory to use per executor process, in the same format as JVM memory strings (e.g. <code>512m</code>, <code>2g</code>).
   </td>
+</tr>
+<tr>
+  <td><code>spark.executor.extraJavaOptions</code></td>
+  <td>(none)</td>
   <td>
-    Default number of tasks to use across the cluster for distributed shuffle operations (<code>groupByKey</code>,
-    <code>reduceByKey</code>, etc) when not set by user.
+    A string of extra JVM options to pass to executors. For instance, GC settings or other
+    logging. Note that it is illegal to set Spark properties or heap size settings with this
+    option. Spark properties should be set using a SparkConf object or the
+    spark-defaults.conf file used with the spark-submit script. Heap size settings can be set
+    with spark.executor.memory.
   </td>
 </tr>
 <tr>
-  <td><code>spark.storage.memoryFraction</code></td>
-  <td>0.6</td>
+  <td><code>spark.executor.extraClassPath</code></td>
+  <td>(none)</td>
   <td>
-    Fraction of Java heap to use for Spark's memory cache. This should not be larger than the "old"
-    generation of objects in the JVM, which by default is given 0.6 of the heap, but you can increase
-    it if you configure your own old generation size.
+    Extra classpath entries to append to the classpath of executors. This exists primarily
+    for backwards-compatibility with older versions of Spark. Users typically should not need
+    to set this option.
+  </td>
+</tr>
+<tr>
+  <td><code>spark.executor.extraLibraryPath</code></td>
+  <td>(none)</td>
+  <td>
+    Set a special library path to use when launching executor JVM's.
+  </td>
+</tr>
+<tr>
+  <td><code>spark.files.userClassPathFirst</code></td>
+  <td>false</td>
+  <td>
+    (Experimental) Whether to give user-added jars precedence over Spark's own jars when
+    loading classes in Executors. This feature can be used to mitigate conflicts between
+    Spark's dependencies and user dependencies. It is currently an experimental feature.
+  </td>
+</tr>
+</table>
+
+#### Shuffle Behavior
+<table class="table">
+<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+<tr>
+  <td><code>spark.shuffle.consolidateFiles</code></td>
+  <td>false</td>
+  <td>
+    If set to "true", consolidates intermediate files created during a shuffle. Creating fewer files can improve
+    filesystem performance for shuffles with large numbers of reduce tasks. It is recommended to set this to "true"
+    when using ext4 or xfs filesystems. On ext3, this option might degrade performance on machines with many (>8)
+    cores due to filesystem limitations.
+  </td>
+</tr>
+<tr>
+  <td><code>spark.shuffle.spill</code></td>
+  <td>true</td>
+  <td>
+    If set to "true", limits the amount of memory used during reduces by spilling data out to disk. This spilling
+    threshold is specified by <code>spark.shuffle.memoryFraction</code>.
+  </td>
+</tr>
+<tr>
+  <td><code>spark.shuffle.spill.compress</code></td>
+  <td>true</td>
+  <td>
+    Whether to compress data spilled during shuffles.
   </td>
 </tr>
 <tr>
@@ -166,40 +229,43 @@ Apart from these, the following properties are also available, and may be useful
   </td>
 </tr>
 <tr>
-  <td><code>spark.storage.memoryMapThreshold</code></td>
-  <td>8192</td>
+  <td><code>spark.shuffle.compress</code></td>
+  <td>true</td>
   <td>
-    Size of a block, in bytes, above which Spark memory maps when reading a block from disk.
-    This prevents Spark from memory mapping very small blocks. In general, memory
-    mapping has high overhead for blocks close to or below the page size of the operating system.
+    Whether to compress map output files. Generally a good idea.
   </td>
 </tr>
 <tr>
-  <td><code>spark.tachyonStore.baseDir</code></td>
-  <td>System.getProperty("java.io.tmpdir")</td>
+  <td><code>spark.shuffle.file.buffer.kb</code></td>
+  <td>100</td>
   <td>
-    Directories of the Tachyon File System that store RDDs. The Tachyon file system's URL is set by <code>spark.tachyonStore.url</code>.
-    It can also be a comma-separated list of multiple directories on Tachyon file system.
+    Size of the in-memory buffer for each shuffle file output stream, in kilobytes. These buffers
+    reduce the number of disk seeks and system calls made in creating intermediate shuffle files.
   </td>
 </tr>
 <tr>
-  <td><code>spark.tachyonStore.url</code></td>
-  <td>tachyon://localhost:19998</td>
+  <td><code>spark.storage.memoryMapThreshold</code></td>
+  <td>8192</td>
   <td>
-    The URL of the underlying Tachyon file system in the TachyonStore.
+    Size of a block, in bytes, above which Spark memory maps when reading a block from disk.
+    This prevents Spark from memory mapping very small blocks. In general, memory
+    mapping has high overhead for blocks close to or below the page size of the operating system.
   </td>
 </tr>
 <tr>
-  <td><code>spark.mesos.coarse</code></td>
-  <td>false</td>
+  <td><code>spark.reducer.maxMbInFlight</code></td>
+  <td>48</td>
   <td>
-    If set to "true", runs over Mesos clusters in
-    <a href="running-on-mesos.html#mesos-run-modes">"coarse-grained" sharing mode</a>,
-    where Spark acquires one long-lived Mesos task on each machine instead of one Mesos task per Spark task.
-    This gives lower-latency scheduling for short queries, but leaves resources in use for the whole
-    duration of the Spark job.
+    Maximum size (in megabytes) of map outputs to fetch simultaneously from each reduce task. Since
+    each output requires us to create a buffer to receive it, this represents a fixed memory overhead
+    per reduce task, so keep it small unless you have a large amount of memory.
   </td>
 </tr>
+</table>
+
+#### Spark UI
+<table class="table">
+<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
 <tr>
   <td><code>spark.ui.port</code></td>
   <td>4040</td>
@@ -215,54 +281,40 @@ Apart from these, the following properties are also available, and may be useful
   </td>
 </tr>
 <tr>
-  <td><code>spark.ui.filters</code></td>
-  <td>None</td>
+  <td><code>spark.ui.killEnabled</code></td>
+  <td>true</td>
   <td>
-    Comma separated list of filter class names to apply to the Spark web ui. The filter should be a
-    standard javax servlet Filter. Parameters to each filter can also be specified by setting a
-    java system property of spark.&lt;class name of filter&gt;.params='param1=value1,param2=value2'
-    (e.g. -Dspark.ui.filters=com.test.filter1 -Dspark.com.test.filter1.params='param1=foo,param2=testing')
+    Allows stages and corresponding jobs to be killed from the web ui.
   </td>
 </tr>
 <tr>
-  <td><code>spark.ui.acls.enable</code></td>
+  <td><code>spark.eventLog.enabled</code></td>
   <td>false</td>
   <td>
-    Whether spark web ui acls should are enabled. If enabled, this checks to see if the user has
-    access permissions to view the web ui. See <code>spark.ui.view.acls</code> for more details.
-    Also note this requires the user to be known, if the user comes across as null no checks
-    are done. Filters can be used to authenticate and set the user.
-  </td>
-</tr>
-<tr>
-  <td><code>spark.ui.view.acls</code></td>
-  <td>Empty</td>
-  <td>
-    Comma separated list of users that have view access to the spark web ui. By default only the
-    user that started the Spark job has view access.
-  </td>
-</tr>
-<tr>
-  <td><code>spark.ui.killEnabled</code></td>
-  <td>true</td>
-  <td>
-    Allows stages and corresponding jobs to be killed from the web ui.
+    Whether to log spark events, useful for reconstructing the Web UI after the application has finished.
   </td>
 </tr>
 <tr>
-  <td><code>spark.shuffle.compress</code></td>
-  <td>true</td>
+  <td><code>spark.eventLog.compress</code></td>
+  <td>false</td>
   <td>
-    Whether to compress map output files. Generally a good idea.
+    Whether to compress logged events, if <code>spark.eventLog.enabled</code> is true.
   </td>
 </tr>
 <tr>
-  <td><code>spark.shuffle.spill.compress</code></td>
-  <td>true</td>
+  <td><code>spark.eventLog.dir</code></td>
+  <td>file:///tmp/spark-events</td>
   <td>
-    Whether to compress data spilled during shuffles.
+    Base directory in which spark events are logged, if <code>spark.eventLog.enabled</code> is true.
+    Within this base directory, Spark creates a sub-directory for each application, and logs the events
+    specific to the application in this directory.
   </td>
 </tr>
+</table>
+
+#### Compression and Serialization
+<table class="table">
+<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
 <tr>
   <td><code>spark.broadcast.compress</code></td>
   <td>true</td>
@@ -294,36 +346,21 @@ Apart from these, the following properties are also available, and may be useful
   </td>
 </tr>
 <tr>
-  <td><code>spark.scheduler.mode</code></td>
-  <td>FIFO</td>
-  <td>
-    The <a href="job-scheduling.html#scheduling-within-an-application">scheduling mode</a> between
-    jobs submitted to the same SparkContext. Can be set to <code>FAIR</code>
-    to use fair sharing instead of queueing jobs one after another. Useful for
-    multi-user services.
-  </td>
-</tr>
-<tr>
-  <td><code>spark.scheduler.revive.interval</code></td>
-  <td>1000</td>
-  <td>
-    The interval length for the scheduler to revive the worker resource offers to run tasks. (in milliseconds)
-  </td>
-</tr>
-<tr>
-  <td><code>spark.reducer.maxMbInFlight</code></td>
-  <td>48</td>
+  <td><code>spark.closure.serializer</code></td>
+  <td>org.apache.spark.serializer.<br />JavaSerializer</td>
   <td>
-    Maximum size (in megabytes) of map outputs to fetch simultaneously from each reduce task. Since
-    each output requires us to create a buffer to receive it, this represents a fixed memory overhead
-    per reduce task, so keep it small unless you have a large amount of memory.
+    Serializer class to use for closures. Currently only the Java serializer is supported.
   </td>
 </tr>
 <tr>
-  <td><code>spark.closure.serializer</code></td>
-  <td>org.apache.spark.serializer.<br />JavaSerializer</td>
+  <td><code>spark.serializer.objectStreamReset</code></td>
+  <td>10000</td>
   <td>
-    Serializer class to use for closures. Currently only the Java serializer is supported.
+    When serializing using org.apache.spark.serializer.JavaSerializer, the serializer caches
+    objects to prevent writing redundant data, however that stops garbage collection of those
+    objects. By calling 'reset' you flush that info from the serializer, and allow old
+    objects to be collected. To turn off this periodic reset set it to a value &lt;= 0.
+    By default it will reset the serializer every 10,000 objects.
   </td>
 </tr>
 <tr>
@@ -345,15 +382,23 @@ Apart from these, the following properties are also available, and may be useful
     exceeded" exception inside Kryo. Note that there will be one buffer <i>per core</i> on each worker.
   </td>
 </tr>
+</table>
+
+#### Execution Behavior
+<table class="table">
+<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
 <tr>
-  <td><code>spark.serializer.objectStreamReset</code></td>
-  <td>10000</td>
+  <td><code>spark.default.parallelism</code></td>
   <td>
-    When serializing using org.apache.spark.serializer.JavaSerializer, the serializer caches
-    objects to prevent writing redundant data, however that stops garbage collection of those
-    objects. By calling 'reset' you flush that info from the serializer, and allow old
-    objects to be collected. To turn off this periodic reset set it to a value &lt;= 0.
-    By default it will reset the serializer every 10,000 objects.
+    <ul>
+      <li>Local mode: number of cores on the local machine</li>
+      <li>Mesos fine grained mode: 8</li>
+      <li>Others: total number of cores on all executor nodes or 2, whichever is larger</li>
+    </ul>
+  </td>
+  <td>
+    Default number of tasks to use across the cluster for distributed shuffle operations (<code>groupByKey</code>,
+    <code>reduceByKey</code>, etc) when not set by user.
   </td>
 </tr>
 <tr>
@@ -364,73 +409,70 @@ Apart from these, the following properties are also available, and may be useful
   </td>
 </tr>
 <tr>
-  <td><code>spark.locality.wait</code></td>
-  <td>3000</td>
+  <td><code>spark.broadcast.blockSize</code></td>
+  <td>4096</td>
   <td>
-    Number of milliseconds to wait to launch a data-local task before giving up and launching it
-    on a less-local node. The same wait will be used to step through multiple locality levels
-    (process-local, node-local, rack-local and then any). It is also possible to customize the
-    waiting time for each level by setting <code>spark.locality.wait.node</code>, etc.
-    You should increase this setting if your tasks are long and see poor locality, but the
-    default usually works well.
+    Size of each piece of a block in kilobytes for <code>TorrentBroadcastFactory</code>.
+    Too large a value decreases parallelism during broadcast (makes it slower); however, if it is too small,
+    <code>BlockManager</code> might take a performance hit.
   </td>
 </tr>
 <tr>
-  <td><code>spark.locality.wait.process</code></td>
-  <td>spark.locality.wait</td>
+  <td><code>spark.files.overwrite</code></td>
+  <td>false</td>
   <td>
-    Customize the locality wait for process locality. This affects tasks that attempt to access
-    cached data in a particular executor process.
+    Whether to overwrite files added through SparkContext.addFile() when the target file exists and its contents do not match those of the source.
   </td>
 </tr>
 <tr>
-  <td><code>spark.locality.wait.node</code></td>
-  <td>spark.locality.wait</td>
+  <td><code>spark.files.fetchTimeout</code></td>
+  <td>false</td>
   <td>
-    Customize the locality wait for node locality. For example, you can set this to 0 to skip
-    node locality and search immediately for rack locality (if your cluster has rack information).
+    Communication timeout to use when fetching files added through SparkContext.addFile() from
+    the driver.
   </td>
 </tr>
 <tr>
-  <td><code>spark.locality.wait.rack</code></td>
-  <td>spark.locality.wait</td>
+  <td><code>spark.storage.memoryFraction</code></td>
+  <td>0.6</td>
   <td>
-    Customize the locality wait for rack locality.
+    Fraction of Java heap to use for Spark's memory cache. This should not be larger than the "old"
+    generation of objects in the JVM, which by default is given 0.6 of the heap, but you can increase
+    it if you configure your own old generation size.
   </td>
 </tr>
 <tr>
-  <td><code>spark.worker.timeout</code></td>
-  <td>60</td>
+  <td><code>spark.tachyonStore.baseDir</code></td>
+  <td>System.getProperty("java.io.tmpdir")</td>
   <td>
-    Number of seconds after which the standalone deploy master considers a worker lost if it
-    receives no heartbeats.
+    Directories of the Tachyon File System that store RDDs. The Tachyon file system's URL is set by <code>spark.tachyonStore.url</code>.
+    It can also be a comma-separated list of multiple directories on Tachyon file system.
   </td>
 </tr>
 <tr>
-  <td><code>spark.worker.cleanup.enabled</code></td>
-  <td>false</td>
+  <td><code>spark.tachyonStore.url</code></td>
+  <td>tachyon://localhost:19998</td>
   <td>
-    Enable periodic cleanup of worker / application directories.  Note that this only affects standalone
-    mode, as YARN works differently. Applications directories are cleaned up regardless of whether
-    the application is still running.
+    The URL of the underlying Tachyon file system in the TachyonStore.
   </td>
 </tr>
+</table>
+
+#### Networking
+<table class="table">
+<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
 <tr>
-  <td><code>spark.worker.cleanup.interval</code></td>
-  <td>1800 (30 minutes)</td>
+  <td><code>spark.driver.host</code></td>
+  <td>(local hostname)</td>
   <td>
-    Controls the interval, in seconds, at which the worker cleans up old application work dirs
-    on the local machine.
+    Hostname or IP address for the driver to listen on.
   </td>
 </tr>
 <tr>
-  <td><code>spark.worker.cleanup.appDataTtl</code></td>
-  <td>7 * 24 * 3600 (7 days)</td>
+  <td><code>spark.driver.port</code></td>
+  <td>(random)</td>
   <td>
-    The number of seconds to retain application work directories on each worker.  This is a Time To Live
-    and should depend on the amount of available disk space you have.  Application logs and jars are
-    downloaded to each application work dir.  Over time, the work dirs can quickly fill up disk space,
-    especially if you run jobs very frequently.
+    Port for the driver to listen on.
   </td>
 </tr>
 <tr>
@@ -478,47 +520,16 @@ Apart from these, the following properties are also available, and may be useful
     This is set to a larger value to disable failure detector that comes inbuilt akka. It can be enabled again, if you plan to use this feature (Not recommended). A larger interval value in seconds reduces network overhead and a smaller value ( ~ 1 s) might be more informative for akka's failure detector. Tune this in combination of `spark.akka.heartbeat.pauses` and `spark.akka.failure-detector.threshold` if you need to. Only positive use case for using failure detector can be, a sensistive failure detector can help evict rogue executors really quick. However this is usually not the case as gc pauses and network lags are expected in a real spark cluster. Apart from that enabling this leads to a lot of exchanges of heart beats between nodes leading to flooding the network with those.
   </td>
 </tr>
+</table>
+
+#### Scheduling
+<table class="table">
+<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
 <tr>
-  <td><code>spark.driver.host</code></td>
-  <td>(local hostname)</td>
-  <td>
-    Hostname or IP address for the driver to listen on.
-  </td>
-</tr>
-<tr>
-  <td><code>spark.driver.port</code></td>
-  <td>(random)</td>
-  <td>
-    Port for the driver to listen on.
-  </td>
-</tr>
-<tr>
-  <td><code>spark.cleaner.ttl</code></td>
-  <td>(infinite)</td>
-  <td>
-    Duration (seconds) of how long Spark will remember any metadata (stages generated, tasks generated, etc.).
-    Periodic cleanups will ensure that metadata older than this duration will be forgotten. This is
-    useful for running Spark for many hours / days (for example, running 24/7 in case of Spark Streaming
-    applications). Note that any RDD that persists in memory for more than this duration will be cleared as well.
-  </td>
-</tr>
-<tr>
-  <td><code>spark.streaming.blockInterval</code></td>
-  <td>200</td>
-  <td>
-    Interval (milliseconds) at which data received by Spark Streaming receivers is coalesced
-    into blocks of data before storing them in Spark.
-  </td>
-</tr>
-<tr>
-  <td><code>spark.streaming.unpersist</code></td>
-  <td>true</td>
+  <td><code>spark.task.cpus</code></td>
+  <td>1</td>
   <td>
-    Force RDDs generated and persisted by Spark Streaming to be automatically unpersisted from
-    Spark's memory. The raw input data received by Spark Streaming is also automatically cleared.
-    Setting this to false will allow the raw data and persisted RDDs to be accessible outside the
-    streaming application as they will not be cleared automatically. But it comes at the cost of
-    higher memory usage in Spark.
+    Number of cores to allocate for each task.
   </td>
 </tr>
 <tr>
@@ -530,35 +541,36 @@ Apart from these, the following properties are also available, and may be useful
   </td>
 </tr>
 <tr>
-  <td><code>spark.broadcast.blockSize</code></td>
-  <td>4096</td>
-  <td>
-    Size of each piece of a block in kilobytes for <code>TorrentBroadcastFactory</code>.
-    Too large a value decreases parallelism during broadcast (makes it slower); however, if it is too small, <code>BlockManager</code> might take a performance hit.
-  </td>
-</tr>
-
-<tr>
-  <td><code>spark.shuffle.consolidateFiles</code></td>
-  <td>false</td>
+  <td><code>spark.scheduler.mode</code></td>
+  <td>FIFO</td>
   <td>
-    If set to "true", consolidates intermediate files created during a shuffle. Creating fewer files can improve filesystem performance for shuffles with large numbers of reduce tasks. It is recommended to set this to "true" when using ext4 or xfs filesystems. On ext3, this option might degrade performance on machines with many (>8) cores due to filesystem limitations.
+    The <a href="job-scheduling.html#scheduling-within-an-application">scheduling mode</a> between
+    jobs submitted to the same SparkContext. Can be set to <code>FAIR</code>
+    to use fair sharing instead of queueing jobs one after another. Useful for
+    multi-user services.
   </td>
 </tr>
 <tr>
-  <td><code>spark.shuffle.file.buffer.kb</code></td>
-  <td>100</td>
+  <td><code>spark.cores.max</code></td>
+  <td>(not set)</td>
   <td>
-    Size of the in-memory buffer for each shuffle file output stream, in kilobytes. These buffers
-    reduce the number of disk seeks and system calls made in creating intermediate shuffle files.
+    When running on a <a href="spark-standalone.html">standalone deploy cluster</a> or a
+    <a href="running-on-mesos.html#mesos-run-modes">Mesos cluster in "coarse-grained"
+    sharing mode</a>, the maximum amount of CPU cores to request for the application from
+    across the cluster (not from each machine). If not set, the default will be
+    <code>spark.deploy.defaultCores</code> on Spark's standalone cluster manager, or
+    infinite (all available cores) on Mesos.
   </td>
 </tr>
 <tr>
-  <td><code>spark.shuffle.spill</code></td>
-  <td>true</td>
+  <td><code>spark.mesos.coarse</code></td>
+  <td>false</td>
   <td>
-    If set to "true", limits the amount of memory used during reduces by spilling data out to disk. This spilling
-    threshold is specified by <code>spark.shuffle.memoryFraction</code>.
+    If set to "true", runs over Mesos clusters in
+    <a href="running-on-mesos.html#mesos-run-modes">"coarse-grained" sharing mode</a>,
+    where Spark acquires one long-lived Mesos task on each machine instead of one Mesos task per Spark task.
+    This gives lower-latency scheduling for short queries, but leaves resources in use for the whole
+    duration of the Spark job.
   </td>
 </tr>
 <tr>
@@ -590,83 +602,52 @@ Apart from these, the following properties are also available, and may be useful
   </td>
 </tr>
 <tr>
-  <td><code>spark.logConf</code></td>
-  <td>false</td>
-  <td>
-    Whether to log the supplied SparkConf as INFO when a SparkContext is started.
-  </td>
-</tr>
-<tr>
-  <td><code>spark.eventLog.enabled</code></td>
-  <td>false</td>
-  <td>
-    Whether to log spark events, useful for reconstructing the Web UI after the application has finished.
-  </td>
-</tr>
-<tr>
-  <td><code>spark.eventLog.compress</code></td>
-  <td>false</td>
-  <td>
-    Whether to compress logged events, if <code>spark.eventLog.enabled</code> is true.
-  </td>
-</tr>
-<tr>
-  <td><code>spark.eventLog.dir</code></td>
-  <td>file:///tmp/spark-events</td>
-  <td>
-    Base directory in which spark events are logged, if <code>spark.eventLog.enabled</code> is true.
-    Within this base directory, Spark creates a sub-directory for each application, and logs the events
-    specific to the application in this directory.
-  </td>
-</tr>
-<tr>
-  <td><code>spark.deploy.spreadOut</code></td>
-  <td>true</td>
+  <td><code>spark.locality.wait</code></td>
+  <td>3000</td>
   <td>
-    Whether the standalone cluster manager should spread applications out across nodes or try
-    to consolidate them onto as few nodes as possible. Spreading out is usually better for
-    data locality in HDFS, but consolidating is more efficient for compute-intensive workloads. <br/>
-    <b>Note:</b> this setting needs to be configured in the standalone cluster master, not in individual
-    applications; you can set it through <code>SPARK_MASTER_OPTS</code> in <code>spark-env.sh</code>.
+    Number of milliseconds to wait to launch a data-local task before giving up and launching it
+    on a less-local node. The same wait will be used to step through multiple locality levels
+    (process-local, node-local, rack-local and then any). It is also possible to customize the
+    waiting time for each level by setting <code>spark.locality.wait.node</code>, etc.
+    You should increase this setting if your tasks are long and see poor locality, but the
+    default usually works well.
   </td>
 </tr>
 <tr>
-  <td><code>spark.deploy.defaultCores</code></td>
-  <td>(infinite)</td>
+  <td><code>spark.locality.wait.process</code></td>
+  <td>spark.locality.wait</td>
   <td>
-    Default number of cores to give to applications in Spark's standalone mode if they don't
-    set <code>spark.cores.max</code>. If not set, applications always get all available
-    cores unless they configure <code>spark.cores.max</code> themselves.
-    Set this lower on a shared cluster to prevent users from grabbing
-    the whole cluster by default. <br/>
-    <b>Note:</b> this setting needs to be configured in the standalone cluster master, not in individual
-    applications; you can set it through <code>SPARK_MASTER_OPTS</code> in <code>spark-env.sh</code>.
+    Customize the locality wait for process locality. This affects tasks that attempt to access
+    cached data in a particular executor process.
   </td>
 </tr>
 <tr>
-  <td><code>spark.files.overwrite</code></td>
-  <td>false</td>
+  <td><code>spark.locality.wait.node</code></td>
+  <td>spark.locality.wait</td>
   <td>
-    Whether to overwrite files added through SparkContext.addFile() when the target file exists and its contents do not match those of the source.
+    Customize the locality wait for node locality. For example, you can set this to 0 to skip
+    node locality and search immediately for rack locality (if your cluster has rack information).
   </td>
 </tr>
 <tr>
-  <td><code>spark.files.fetchTimeout</code></td>
-  <td>false</td>
+  <td><code>spark.locality.wait.rack</code></td>
+  <td>spark.locality.wait</td>
   <td>
-    Communication timeout to use when fetching files added through SparkContext.addFile() from
-    the driver.
+    Customize the locality wait for rack locality.
   </td>
 </tr>
 <tr>
-  <td><code>spark.files.userClassPathFirst</code></td>
-  <td>false</td>
+  <td><code>spark.scheduler.revive.interval</code></td>
+  <td>1000</td>
   <td>
-    (Experimental) Whether to give user-added jars precedence over Spark's own jars when
-    loading classes in Executors. This feature can be used to mitigate conflicts between
-    Spark's dependencies and user dependencies. It is currently an experimental feature.
+    The interval length for the scheduler to revive the worker resource offers to run tasks. (in milliseconds)
   </td>
 </tr>
+</table>
+
+#### Security
+<table class="table">
+<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
 <tr>
   <td><code>spark.authenticate</code></td>
   <td>false</td>
@@ -692,40 +673,67 @@ Apart from these, the following properties are also available, and may be useful
   </td>
 </tr>
 <tr>
-  <td><code>spark.task.cpus</code></td>
-  <td>1</td>
+  <td><code>spark.ui.filters</code></td>
+  <td>None</td>
   <td>
-    Number of cores to allocate for each task.
+    Comma separated list of filter class names to apply to the Spark web ui. The filter should be a
+    standard javax servlet Filter. Parameters to each filter can also be specified by setting a
+    java system property of spark.&lt;class name of filter&gt;.params='param1=value1,param2=value2'
+    (e.g. -Dspark.ui.filters=com.test.filter1 -Dspark.com.test.filter1.params='param1=foo,param2=testing')
   </td>
 </tr>
 <tr>
-  <td><code>spark.executor.extraJavaOptions</code></td>
-  <td>(none)</td>
+  <td><code>spark.ui.acls.enable</code></td>
+  <td>false</td>
   <td>
-    A string of extra JVM options to pass to executors. For instance, GC settings or other
-    logging. Note that it is illegal to set Spark properties or heap size settings with this 
-    option. Spark properties should be set using a SparkConf object or the 
-    spark-defaults.conf file used with the spark-submit script. Heap size settings can be set
-    with spark.executor.memory.
+    Whether spark web ui acls should are enabled. If enabled, this checks to see if the user has
+    access permissions to view the web ui. See <code>spark.ui.view.acls</code> for more details.
+    Also note this requires the user to be known, if the user comes across as null no checks
+    are done. Filters can be used to authenticate and set the user.
   </td>
 </tr>
 <tr>
-  <td><code>spark.executor.extraClassPath</code></td>
-  <td>(none)</td>
+  <td><code>spark.ui.view.acls</code></td>
+  <td>Empty</td>
   <td>
-    Extra classpath entries to append to the classpath of executors. This exists primarily
-    for backwards-compatibility with older versions of Spark. Users typically should not need
-    to set this option.
+    Comma separated list of users that have view access to the spark web ui. By default only the
+    user that started the Spark job has view access.
   </td>
 </tr>
+</table>
+
+#### Spark Streaming
+<table class="table">
+<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
 <tr>
-  <td><code>spark.executor.extraLibraryPath</code></td>
-  <td>(none)</td>
+  <td><code>spark.cleaner.ttl</code></td>
+  <td>(infinite)</td>
   <td>
-    Set a special library path to use when launching executor JVM's.
+    Duration (seconds) of how long Spark will remember any metadata (stages generated, tasks generated, etc.).
+    Periodic cleanups will ensure that metadata older than this duration will be forgotten. This is
+    useful for running Spark for many hours / days (for example, running 24/7 in case of Spark Streaming
+    applications). Note that any RDD that persists in memory for more than this duration will be cleared as well.
+  </td>
+</tr>
+<tr>
+  <td><code>spark.streaming.blockInterval</code></td>
+  <td>200</td>
+  <td>
+    Interval (milliseconds) at which data received by Spark Streaming receivers is coalesced
+    into blocks of data before storing them in Spark.
+  </td>
+</tr>
+<tr>
+  <td><code>spark.streaming.unpersist</code></td>
+  <td>true</td>
+  <td>
+    Force RDDs generated and persisted by Spark Streaming to be automatically unpersisted from
+    Spark's memory. The raw input data received by Spark Streaming is also automatically cleared.
+    Setting this to false will allow the raw data and persisted RDDs to be accessible outside the
+    streaming application as they will not be cleared automatically. But it comes at the cost of
+    higher memory usage in Spark.
   </td>
 </tr>
-
 </table>
 
 # Environment Variables

From 106ee312469824959ef301ed4899f91d97099fdd Mon Sep 17 00:00:00 2001
From: Patrick Wendell <pwendell@gmail.com>
Date: Sun, 25 May 2014 22:33:18 -0700
Subject: [PATCH 08/16] Small link fix

---
 docs/configuration.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/docs/configuration.md b/docs/configuration.md
index 9bb542482db5..c4b5c73e2df1 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -86,7 +86,8 @@ of the most common options to set are:
   <td><strong><code>spark.master</code></strong></td>
   <td>(none)</td>
   <td>
-    The cluster manager to connect to. See the list of [allowed master URL's](scala-programming-guide.html#master-urls).
+    The cluster manager to connect to. See the list of <a href="scala-programming-guide.html#master-urls">
+    allowed master URL's</a>.
   </td>
 </tr>
 <tr>

From 3289ea4f852408e440ca41056405265d80248089 Mon Sep 17 00:00:00 2001
From: Patrick Wendell <pwendell@gmail.com>
Date: Sun, 25 May 2014 23:14:00 -0700
Subject: [PATCH 09/16] Pulling in changes from #856

---
 docs/configuration.md    | 65 +++++++++++++++++------------
 docs/spark-standalone.md | 88 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 127 insertions(+), 26 deletions(-)

diff --git a/docs/configuration.md b/docs/configuration.md
index c4b5c73e2df1..f68f8d116c66 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -3,15 +3,8 @@ layout: global
 title: Spark Configuration
 ---
 
-Spark provides three locations to configure the system:
-
-* [Spark properties](#spark-properties) control most application parameters and can be set by
-  passing a [SparkConf](api/scala/index.html#org.apache.spark.SparkConf) object to SparkContext,
-  or through the `conf/spark-defaults.conf` properties file.
-* [Environment variables](#environment-variables) can be used to set per-machine settings, such as
-  the IP address, through the `conf/spark-env.sh` script on each node.
-* [Logging](#configuring-logging) can be configured through `log4j.properties`.
-
+* This will become a table of contents (this text will be scraped).
+{:toc}
 
 # Spark Properties
 
@@ -149,7 +142,8 @@ Apart from these, the following properties are also available, and may be useful
   <td><code>spark.executor.memory</code></td>
   <td>512m</td>
   <td>
-    Amount of memory to use per executor process, in the same format as JVM memory strings (e.g. <code>512m</code>, <code>2g</code>).
+    Amount of memory to use per executor process, in the same format as JVM memory strings
+    (e.g. <code>512m</code>, <code>2g</code>).
   </td>
 </tr>
 <tr>
@@ -422,7 +416,8 @@ Apart from these, the following properties are also available, and may be useful
   <td><code>spark.files.overwrite</code></td>
   <td>false</td>
   <td>
-    Whether to overwrite files added through SparkContext.addFile() when the target file exists and its contents do not match those of the source.
+    Whether to overwrite files added through SparkContext.addFile() when the target file exists and its contents do not
+    match those of the source.
   </td>
 </tr>
 <tr>
@@ -446,8 +441,9 @@ Apart from these, the following properties are also available, and may be useful
   <td><code>spark.tachyonStore.baseDir</code></td>
   <td>System.getProperty("java.io.tmpdir")</td>
   <td>
-    Directories of the Tachyon File System that store RDDs. The Tachyon file system's URL is set by <code>spark.tachyonStore.url</code>.
-    It can also be a comma-separated list of multiple directories on Tachyon file system.
+    Directories of the Tachyon File System that store RDDs. The Tachyon file system's URL is set by
+    <code>spark.tachyonStore.url</code>. It can also be a comma-separated list of multiple directories
+    on Tachyon file system.
   </td>
 </tr>
 <tr>
@@ -504,21 +500,33 @@ Apart from these, the following properties are also available, and may be useful
   <td><code>spark.akka.heartbeat.pauses</code></td>
   <td>600</td>
   <td>
-     This is set to a larger value to disable failure detector that comes inbuilt akka. It can be enabled again, if you plan to use this feature (Not recommended). Acceptable heart beat pause in seconds for akka. This can be used to control sensitivity to gc pauses. Tune this in combination of `spark.akka.heartbeat.interval` and `spark.akka.failure-detector.threshold` if you need to.
+     This is set to a larger value to disable failure detector that comes inbuilt akka. It can be enabled again, if you
+     plan to use this feature (Not recommended). Acceptable heart beat pause in seconds for akka. This can be used to
+     control sensitivity to gc pauses. Tune this in combination of `spark.akka.heartbeat.interval` and
+     `spark.akka.failure-detector.threshold` if you need to.
   </td>
 </tr>
 <tr>
   <td><code>spark.akka.failure-detector.threshold</code></td>
   <td>300.0</td>
   <td>
-     This is set to a larger value to disable failure detector that comes inbuilt akka. It can be enabled again, if you plan to use this feature (Not recommended). This maps to akka's `akka.remote.transport-failure-detector.threshold`. Tune this in combination of `spark.akka.heartbeat.pauses` and `spark.akka.heartbeat.interval` if you need to.
+     This is set to a larger value to disable failure detector that comes inbuilt akka. It can be enabled again, if you
+     plan to use this feature (Not recommended). This maps to akka's `akka.remote.transport-failure-detector.threshold`.
+     Tune this in combination of `spark.akka.heartbeat.pauses` and `spark.akka.heartbeat.interval` if you need to.
   </td>
 </tr>
 <tr>
   <td><code>spark.akka.heartbeat.interval</code></td>
   <td>1000</td>
   <td>
-    This is set to a larger value to disable failure detector that comes inbuilt akka. It can be enabled again, if you plan to use this feature (Not recommended). A larger interval value in seconds reduces network overhead and a smaller value ( ~ 1 s) might be more informative for akka's failure detector. Tune this in combination of `spark.akka.heartbeat.pauses` and `spark.akka.failure-detector.threshold` if you need to. Only positive use case for using failure detector can be, a sensistive failure detector can help evict rogue executors really quick. However this is usually not the case as gc pauses and network lags are expected in a real spark cluster. Apart from that enabling this leads to a lot of exchanges of heart beats between nodes leading to flooding the network with those.
+    This is set to a larger value to disable failure detector that comes inbuilt akka. It can be enabled again, if you
+    plan to use this feature (Not recommended). A larger interval value in seconds reduces network overhead and a
+    smaller value ( ~ 1 s) might be more informative for akka's failure detector. Tune this in combination
+    of `spark.akka.heartbeat.pauses` and `spark.akka.failure-detector.threshold` if you need to. Only positive use
+    case for using failure detector can be, a sensistive failure detector can help evict rogue executors really
+    quick. However this is usually not the case as gc pauses and network lags are expected in a real spark cluster.
+    Apart from that enabling this leads to a lot of exchanges of heart beats between nodes leading to flooding the
+    network with those.
   </td>
 </tr>
 </table>
@@ -578,7 +586,8 @@ Apart from these, the following properties are also available, and may be useful
   <td><code>spark.speculation</code></td>
   <td>false</td>
   <td>
-    If set to "true", performs speculative execution of tasks. This means if one or more tasks are running slowly in a stage, they will be re-launched.
+    If set to "true", performs speculative execution of tasks. This means if one or more tasks are running slowly in a
+    stage, they will be re-launched.
   </td>
 </tr>
 <tr>
@@ -739,13 +748,13 @@ Apart from these, the following properties are also available, and may be useful
 
 # Environment Variables
 
-Certain Spark settings can be configured through environment variables, which are read from the `conf/spark-env.sh`
-script in the directory where Spark is installed (or `conf/spark-env.cmd` on Windows). In Standalone and Mesos modes,
-this file can give machine specific information such as hostnames. It is also sourced when running local
-Spark applications or submission scripts.
+Certain Spark settings can be configured through environment variables, which are read from the
+`conf/spark-env.sh` script in the directory where Spark is installed (or `conf/spark-env.cmd` on
+Windows). In Standalone and Mesos modes, this file can give machine specific information such as
+hostnames. It is also sourced when running local Spark applications or submission scripts.
 
-Note that `conf/spark-env.sh` does not exist by default when Spark is installed. However, you can copy
-`conf/spark-env.sh.template` to create it. Make sure you make the copy executable.
+Note that `conf/spark-env.sh` does not exist by default when Spark is installed. However, you can
+copy `conf/spark-env.sh.template` to create it. Make sure you make the copy executable.
 
 The following variables can be set in `spark-env.sh`:
 
@@ -770,12 +779,16 @@ The following variables can be set in `spark-env.sh`:
   </tr>
 </table>
 
-In addition to the above, there are also options for setting up the Spark [standalone cluster scripts](spark-standalone.html#cluster-launch-scripts), such as number of cores to use on each machine and maximum memory.
+In addition to the above, there are also options for setting up the Spark
+[standalone cluster scripts](spark-standalone.html#cluster-launch-scripts), such as number of cores to use on each
+machine and maximum memory.
 
 Since `spark-env.sh` is a shell script, some of these can be set programmatically -- for example, you might
 compute `SPARK_LOCAL_IP` by looking up the IP of a specific network interface.
 
 # Configuring Logging
 
-Spark uses [log4j](http://logging.apache.org/log4j/) for logging. You can configure it by adding a `log4j.properties`
-file in the `conf` directory. One way to start is to copy the existing `log4j.properties.template` located there.
+Spark uses [log4j](http://logging.apache.org/log4j/) for logging. You can configure it by adding a
+`log4j.properties` file in the `conf` directory. One way to start is to copy the existing
+`log4j.properties.template` located there.
+</table>
diff --git a/docs/spark-standalone.md b/docs/spark-standalone.md
index cd3fbe8a9427..15c5816182a8 100644
--- a/docs/spark-standalone.md
+++ b/docs/spark-standalone.md
@@ -286,6 +286,94 @@ In addition, detailed log output for each job is also written to the work direct
 You can run Spark alongside your existing Hadoop cluster by just launching it as a separate service on the same machines. To access Hadoop data from Spark, just use a hdfs:// URL (typically `hdfs://<namenode>:9000/path`, but you can find the right URL on your Hadoop Namenode's web UI). Alternatively, you can set up a separate cluster for Spark, and still have it access HDFS over the network; this will be slower than disk-local access, but may not be a concern if you are still running in the same local area network (e.g. you place a few Spark machines on each rack that you have Hadoop on).
 
 
+# Configuring Ports for Network Security
+
+Spark makes heavy use of the network, and some environments have strict requirements for using tight
+firewall settings.  Below are the primary ports that Spark uses for its communication and how to
+configure those ports.
+
+<table class="table">
+  <tr>
+    <th>From</th><th>To</th><th>Default Port</th><th>Purpose</th><th>Configuration
+    Setting</th><th>Notes</th>
+  </tr>
+  <!-- Web UIs -->
+  <tr>
+    <td>Browser</td>
+    <td>Standalone Cluster Master</td>
+    <td>8080</td>
+    <td>Web UI</td>
+    <td><code>master.ui.port</code></td>
+    <td>Jetty-based</td>
+  </tr>
+  <tr>
+    <td>Browser</td>
+    <td>Driver</td>
+    <td>4040</td>
+    <td>Web UI</td>
+    <td><code>spark.ui.port</code></td>
+    <td>Jetty-based</td>
+  </tr>
+  <tr>
+    <td>Browser</td>
+    <td>History Server</td>
+    <td>18080</td>
+    <td>Web UI</td>
+    <td><code>spark.history.ui.port</code></td>
+    <td>Jetty-based</td>
+  </tr>
+  <tr>
+    <td>Browser</td>
+    <td>Worker</td>
+    <td>8081</td>
+    <td>Web UI</td>
+    <td><code>worker.ui.port</code></td>
+    <td>Jetty-based</td>
+  </tr>
+  <!-- Cluster interactions -->
+  <tr>
+    <td>Application</td>
+    <td>Standalone Cluster Master</td>
+    <td>7077</td>
+    <td>Submit job to cluster</td>
+    <td><code>spark.driver.port</code></td>
+    <td>Akka-based.  Set to "0" to choose a port randomly</td>
+  </tr>
+  <tr>
+    <td>Worker</td>
+    <td>Standalone Cluster Master</td>
+    <td>7077</td>
+    <td>Join cluster</td>
+    <td><code>spark.driver.port</code></td>
+    <td>Akka-based.  Set to "0" to choose a port randomly</td>
+  </tr>
+  <tr>
+    <td>Application</td>
+    <td>Worker</td>
+    <td>(random)</td>
+    <td>Join cluster</td>
+    <td><code>SPARK_WORKER_PORT</code> (standalone cluster)</td>
+    <td>Akka-based</td>
+  </tr>
+
+  <!-- Other misc stuff -->
+  <tr>
+    <td>Driver and other Workers</td>
+    <td>Worker</td>
+    <td>(random)</td>
+    <td>
+      <ul>
+        <li>File server for file and jars</li>
+        <li>Http Broadcast</li>
+        <li>Class file server (Spark Shell only)</li>
+      </ul>
+    </td>
+    <td>None</td>
+    <td>Jetty-based.  Each of these services starts on a random port that cannot be configured</td>
+  </tr>
+
+</table>
+
 # High Availability
 
 By default, standalone scheduling clusters are resilient to Worker failures (insofar as Spark itself is resilient to losing work by moving it to other workers). However, the scheduler uses a Master to make scheduling decisions, and this (by default) creates a single point of failure: if the Master crashes, no new applications can be created. In order to circumvent this, we have two high availability schemes, detailed below.

From a374369e63d1c48cd71c4280167eb62607f9c9c3 Mon Sep 17 00:00:00 2001
From: Patrick Wendell <pwendell@gmail.com>
Date: Sun, 25 May 2014 23:38:45 -0700
Subject: [PATCH 10/16] Line wrapping fixes

---
 docs/configuration.md | 159 +++++++++++++++++++++++-------------------
 1 file changed, 87 insertions(+), 72 deletions(-)

diff --git a/docs/configuration.md b/docs/configuration.md
index d8a2360a7b3f..900cb884dc31 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -39,12 +39,13 @@ Then, you can supply configuration values at runtime:
 ./bin/spark-submit --name "My fancy app" --master local[4] myApp.jar
 {% endhighlight %}
 
-The Spark shell and [`spark-submit`](cluster-overview.html#launching-applications-with-spark-submit) tool support
-two ways to load configurations dynamically. The first are command line options, such as `--master`, as shown above.
-Running `./bin/spark-submit --help` will show the entire list of options.
+The Spark shell and [`spark-submit`](cluster-overview.html#launching-applications-with-spark-submit)
+tool support two ways to load configurations dynamically. The first are command line options,
+such as `--master`, as shown above. Running `./bin/spark-submit --help` will show the entire list
+of options.
 
-`bin/spark-submit` will also read configuration options from `conf/spark-defaults.conf`, in which each line consists
-of a key and a value separated by whitespace. For example:
+`bin/spark-submit` will also read configuration options from `conf/spark-defaults.conf`, in which
+each line consists of a key and a value separated by whitespace. For example:
 
     spark.master            spark://5.6.7.8:7077
     spark.executor.memory   512m
@@ -81,8 +82,8 @@ of the most common options to set are:
   <td><strong><code>spark.master</code></strong></td>
   <td>(none)</td>
   <td>
-    The cluster manager to connect to. See the list of <a href="scala-programming-guide.html#master-urls">
-    allowed master URL's</a>.
+    The cluster manager to connect to. See the list of
+    <a href="scala-programming-guide.html#master-urls"> allowed master URL's</a>.
   </td>
 </tr>
 <tr>
@@ -98,10 +99,12 @@ of the most common options to set are:
   <td>org.apache.spark.serializer.<br />JavaSerializer</td>
   <td>
     Class to use for serializing objects that will be sent over the network or need to be cached
-    in serialized form. The default of Java serialization works with any Serializable Java object but is
-    quite slow, so we recommend <a href="tuning.html">using <code>org.apache.spark.serializer.KryoSerializer</code>
-    and configuring Kryo serialization</a> when speed is necessary. Can be any subclass of
-    <a href="api/scala/index.html#org.apache.spark.serializer.Serializer"><code>org.apache.spark.Serializer</code></a>.
+    in serialized form. The default of Java serialization works with any Serializable Java object
+    but is quite slow, so we recommend <a href="tuning.html">using
+    <code>org.apache.spark.serializer.KryoSerializer</code> and configuring Kryo serialization</a>
+    when speed is necessary. Can be any subclass of
+    <a href="api/scala/index.html#org.apache.spark.serializer.Serializer">
+    <code>org.apache.spark.Serializer</code></a>.
   </td>
 </tr>
 <tr>
@@ -110,7 +113,8 @@ of the most common options to set are:
   <td>
     If you use Kryo serialization, set this class to register your custom classes with Kryo.
     It should be set to a class that extends
-    <a href="api/scala/index.html#org.apache.spark.serializer.KryoRegistrator"><code>KryoRegistrator</code></a>.
+    <a href="api/scala/index.html#org.apache.spark.serializer.KryoRegistrator">
+    <code>KryoRegistrator</code></a>.
     See the <a href="tuning.html#data-serialization">tuning guide</a> for more details.
   </td>
 </tr>
@@ -118,9 +122,9 @@ of the most common options to set are:
   <td><code>spark.local.dir</code></td>
   <td>/tmp</td>
   <td>
-    Directory to use for "scratch" space in Spark, including map output files and RDDs that get stored
-    on disk. This should be on a fast, local disk in your system. It can also be a comma-separated
-    list of multiple directories on different disks.
+    Directory to use for "scratch" space in Spark, including map output files and RDDs that get
+    stored on disk. This should be on a fast, local disk in your system. It can also be a
+    comma-separated list of multiple directories on different disks.
 
     NOTE: In Spark 1.0 and later this will be overriden by SPARK_LOCAL_DIRS (Standalone, Mesos) or
     LOCAL_DIRS (YARN) environment variables set by the cluster manager.
@@ -193,18 +197,18 @@ Apart from these, the following properties are also available, and may be useful
   <td><code>spark.shuffle.consolidateFiles</code></td>
   <td>false</td>
   <td>
-    If set to "true", consolidates intermediate files created during a shuffle. Creating fewer files can improve
-    filesystem performance for shuffles with large numbers of reduce tasks. It is recommended to set this to "true"
-    when using ext4 or xfs filesystems. On ext3, this option might degrade performance on machines with many (>8)
-    cores due to filesystem limitations.
+    If set to "true", consolidates intermediate files created during a shuffle. Creating fewer
+    files can improve filesystem performance for shuffles with large numbers of reduce tasks. It
+    is recommended to set this to "true" when using ext4 or xfs filesystems. On ext3, this option
+    might degrade performance on machines with many (>8) cores due to filesystem limitations.
   </td>
 </tr>
 <tr>
   <td><code>spark.shuffle.spill</code></td>
   <td>true</td>
   <td>
-    If set to "true", limits the amount of memory used during reduces by spilling data out to disk. This spilling
-    threshold is specified by <code>spark.shuffle.memoryFraction</code>.
+    If set to "true", limits the amount of memory used during reduces by spilling data out to disk.
+    This spilling threshold is specified by <code>spark.shuffle.memoryFraction</code>.
   </td>
 </tr>
 <tr>
@@ -254,8 +258,8 @@ Apart from these, the following properties are also available, and may be useful
   <td>48</td>
   <td>
     Maximum size (in megabytes) of map outputs to fetch simultaneously from each reduce task. Since
-    each output requires us to create a buffer to receive it, this represents a fixed memory overhead
-    per reduce task, so keep it small unless you have a large amount of memory.
+    each output requires us to create a buffer to receive it, this represents a fixed memory
+    overhead per reduce task, so keep it small unless you have a large amount of memory.
   </td>
 </tr>
 </table>
@@ -288,7 +292,8 @@ Apart from these, the following properties are also available, and may be useful
   <td><code>spark.eventLog.enabled</code></td>
   <td>false</td>
   <td>
-    Whether to log spark events, useful for reconstructing the Web UI after the application has finished.
+    Whether to log spark events, useful for reconstructing the Web UI after the application has
+    finished.
   </td>
 </tr>
 <tr>
@@ -303,8 +308,8 @@ Apart from these, the following properties are also available, and may be useful
   <td>file:///tmp/spark-events</td>
   <td>
     Base directory in which spark events are logged, if <code>spark.eventLog.enabled</code> is true.
-    Within this base directory, Spark creates a sub-directory for each application, and logs the events
-    specific to the application in this directory.
+    Within this base directory, Spark creates a sub-directory for each application, and logs the
+    events specific to the application in this directory.
   </td>
 </tr>
 </table>
@@ -323,23 +328,26 @@ Apart from these, the following properties are also available, and may be useful
   <td><code>spark.rdd.compress</code></td>
   <td>false</td>
   <td>
-    Whether to compress serialized RDD partitions (e.g. for <code>StorageLevel.MEMORY_ONLY_SER</code>).
-    Can save substantial space at the cost of some extra CPU time.
+    Whether to compress serialized RDD partitions (e.g. for
+    <code>StorageLevel.MEMORY_ONLY_SER</code>). Can save substantial space at the cost of some
+    extra CPU time.
   </td>
 </tr>
 <tr>
   <td><code>spark.io.compression.codec</code></td>
   <td>org.apache.spark.io.<br />LZFCompressionCodec</td>
   <td>
-    The codec used to compress internal data such as RDD partitions and shuffle outputs. By default, Spark provides two
-    codecs: <code>org.apache.spark.io.LZFCompressionCodec</code> and <code>org.apache.spark.io.SnappyCompressionCodec</code>.
+    The codec used to compress internal data such as RDD partitions and shuffle outputs.
+    By default, Spark provides two codecs: <code>org.apache.spark.io.LZFCompressionCodec</code>
+    and <code>org.apache.spark.io.SnappyCompressionCodec</code>.
   </td>
 </tr>
 <tr>
   <td><code>spark.io.compression.snappy.block.size</code></td>
   <td>32768</td>
   <td>
-    Block size (in bytes) used in Snappy compression, in the case when Snappy compression codec is used.
+    Block size (in bytes) used in Snappy compression, in the case when Snappy compression codec
+    is used.
   </td>
 </tr>
 <tr>
@@ -376,7 +384,8 @@ Apart from these, the following properties are also available, and may be useful
   <td>
     Maximum object size to allow within Kryo (the library needs to create a buffer at least as
     large as the largest single object you'll serialize). Increase this if you get a "buffer limit
-    exceeded" exception inside Kryo. Note that there will be one buffer <i>per core</i> on each worker.
+    exceeded" exception inside Kryo. Note that there will be one buffer <i>per core</i> on each
+    worker.
   </td>
 </tr>
 </table>
@@ -394,8 +403,8 @@ Apart from these, the following properties are also available, and may be useful
     </ul>
   </td>
   <td>
-    Default number of tasks to use across the cluster for distributed shuffle operations (<code>groupByKey</code>,
-    <code>reduceByKey</code>, etc) when not set by user.
+    Default number of tasks to use across the cluster for distributed shuffle operations
+    (<code>groupByKey</code>, <code>reduceByKey</code>, etc) when not set by user.
   </td>
 </tr>
 <tr>
@@ -410,16 +419,16 @@ Apart from these, the following properties are also available, and may be useful
   <td>4096</td>
   <td>
     Size of each piece of a block in kilobytes for <code>TorrentBroadcastFactory</code>.
-    Too large a value decreases parallelism during broadcast (makes it slower); however, if it is too small,
-    <code>BlockManager</code> might take a performance hit.
+    Too large a value decreases parallelism during broadcast (makes it slower); however, if it is
+    too small, <code>BlockManager</code> might take a performance hit.
   </td>
 </tr>
 <tr>
   <td><code>spark.files.overwrite</code></td>
   <td>false</td>
   <td>
-    Whether to overwrite files added through SparkContext.addFile() when the target file exists and its contents do not
-    match those of the source.
+    Whether to overwrite files added through SparkContext.addFile() when the target file exists and
+    its contents do not match those of the source.
   </td>
 </tr>
 <tr>
@@ -435,8 +444,8 @@ Apart from these, the following properties are also available, and may be useful
   <td>0.6</td>
   <td>
     Fraction of Java heap to use for Spark's memory cache. This should not be larger than the "old"
-    generation of objects in the JVM, which by default is given 0.6 of the heap, but you can increase
-    it if you configure your own old generation size.
+    generation of objects in the JVM, which by default is given 0.6 of the heap, but you can
+    increase it if you configure your own old generation size.
   </td>
 </tr>
 <tr>
@@ -444,8 +453,8 @@ Apart from these, the following properties are also available, and may be useful
   <td>System.getProperty("java.io.tmpdir")</td>
   <td>
     Directories of the Tachyon File System that store RDDs. The Tachyon file system's URL is set by
-    <code>spark.tachyonStore.url</code>. It can also be a comma-separated list of multiple directories
-    on Tachyon file system.
+    <code>spark.tachyonStore.url</code>. It can also be a comma-separated list of multiple
+    directories on Tachyon file system.
   </td>
 </tr>
 <tr>
@@ -502,33 +511,36 @@ Apart from these, the following properties are also available, and may be useful
   <td><code>spark.akka.heartbeat.pauses</code></td>
   <td>600</td>
   <td>
-     This is set to a larger value to disable failure detector that comes inbuilt akka. It can be enabled again, if you
-     plan to use this feature (Not recommended). Acceptable heart beat pause in seconds for akka. This can be used to
-     control sensitivity to gc pauses. Tune this in combination of `spark.akka.heartbeat.interval` and
-     `spark.akka.failure-detector.threshold` if you need to.
+     This is set to a larger value to disable failure detector that comes inbuilt akka. It can be
+     enabled again, if you plan to use this feature (Not recommended). Acceptable heart beat pause
+     in seconds for akka. This can be used to control sensitivity to gc pauses. Tune this in
+     combination of `spark.akka.heartbeat.interval` and `spark.akka.failure-detector.threshold`
+     if you need to.
   </td>
 </tr>
 <tr>
   <td><code>spark.akka.failure-detector.threshold</code></td>
   <td>300.0</td>
   <td>
-     This is set to a larger value to disable failure detector that comes inbuilt akka. It can be enabled again, if you
-     plan to use this feature (Not recommended). This maps to akka's `akka.remote.transport-failure-detector.threshold`.
-     Tune this in combination of `spark.akka.heartbeat.pauses` and `spark.akka.heartbeat.interval` if you need to.
+     This is set to a larger value to disable failure detector that comes inbuilt akka. It can be
+     enabled again, if you plan to use this feature (Not recommended). This maps to akka's
+     `akka.remote.transport-failure-detector.threshold`. Tune this in combination of
+     `spark.akka.heartbeat.pauses` and `spark.akka.heartbeat.interval` if you need to.
   </td>
 </tr>
 <tr>
   <td><code>spark.akka.heartbeat.interval</code></td>
   <td>1000</td>
   <td>
-    This is set to a larger value to disable failure detector that comes inbuilt akka. It can be enabled again, if you
-    plan to use this feature (Not recommended). A larger interval value in seconds reduces network overhead and a
-    smaller value ( ~ 1 s) might be more informative for akka's failure detector. Tune this in combination
-    of `spark.akka.heartbeat.pauses` and `spark.akka.failure-detector.threshold` if you need to. Only positive use
-    case for using failure detector can be, a sensistive failure detector can help evict rogue executors really
-    quick. However this is usually not the case as gc pauses and network lags are expected in a real spark cluster.
-    Apart from that enabling this leads to a lot of exchanges of heart beats between nodes leading to flooding the
-    network with those.
+    This is set to a larger value to disable failure detector that comes inbuilt akka. It can be
+    enabled again, if you plan to use this feature (Not recommended). A larger interval value in
+    seconds reduces network overhead and a smaller value ( ~ 1 s) might be more informative for
+    akka's failure detector. Tune this in combination of `spark.akka.heartbeat.pauses` and
+    `spark.akka.failure-detector.threshold` if you need to. Only positive use case for using
+    failure detector can be, a sensistive failure detector can help evict rogue executors really
+    quick. However this is usually not the case as gc pauses and network lags are expected in a
+    real spark cluster. Apart from that enabling this leads to a lot of exchanges of heart beats
+    between nodes leading to flooding the network with those.
   </td>
 </tr>
 </table>
@@ -579,17 +591,17 @@ Apart from these, the following properties are also available, and may be useful
   <td>
     If set to "true", runs over Mesos clusters in
     <a href="running-on-mesos.html#mesos-run-modes">"coarse-grained" sharing mode</a>,
-    where Spark acquires one long-lived Mesos task on each machine instead of one Mesos task per Spark task.
-    This gives lower-latency scheduling for short queries, but leaves resources in use for the whole
-    duration of the Spark job.
+    where Spark acquires one long-lived Mesos task on each machine instead of one Mesos task per
+    Spark task. This gives lower-latency scheduling for short queries, but leaves resources in use
+    for the whole duration of the Spark job.
   </td>
 </tr>
 <tr>
   <td><code>spark.speculation</code></td>
   <td>false</td>
   <td>
-    If set to "true", performs speculative execution of tasks. This means if one or more tasks are running slowly in a
-    stage, they will be re-launched.
+    If set to "true", performs speculative execution of tasks. This means if one or more tasks are
+    running slowly in a stage, they will be re-launched.
   </td>
 </tr>
 <tr>
@@ -652,7 +664,8 @@ Apart from these, the following properties are also available, and may be useful
   <td><code>spark.scheduler.revive.interval</code></td>
   <td>1000</td>
   <td>
-    The interval length for the scheduler to revive the worker resource offers to run tasks. (in milliseconds)
+    The interval length for the scheduler to revive the worker resource offers to run tasks.
+    (in milliseconds)
   </td>
 </tr>
 </table>
@@ -664,8 +677,8 @@ Apart from these, the following properties are also available, and may be useful
   <td><code>spark.authenticate</code></td>
   <td>false</td>
   <td>
-    Whether spark authenticates its internal connections. See <code>spark.authenticate.secret</code> if not
-    running on Yarn.
+    Whether spark authenticates its internal connections. See
+    <code>spark.authenticate.secret</code> if not running on Yarn.
   </td>
 </tr>
 <tr>
@@ -691,7 +704,8 @@ Apart from these, the following properties are also available, and may be useful
     Comma separated list of filter class names to apply to the Spark web ui. The filter should be a
     standard javax servlet Filter. Parameters to each filter can also be specified by setting a
     java system property of spark.&lt;class name of filter&gt;.params='param1=value1,param2=value2'
-    (e.g. -Dspark.ui.filters=com.test.filter1 -Dspark.com.test.filter1.params='param1=foo,param2=testing')
+    (e.g. -Dspark.ui.filters=com.test.filter1
+    -Dspark.com.test.filter1.params='param1=foo,param2=testing')
   </td>
 </tr>
 <tr>
@@ -721,10 +735,11 @@ Apart from these, the following properties are also available, and may be useful
   <td><code>spark.cleaner.ttl</code></td>
   <td>(infinite)</td>
   <td>
-    Duration (seconds) of how long Spark will remember any metadata (stages generated, tasks generated, etc.).
-    Periodic cleanups will ensure that metadata older than this duration will be forgotten. This is
-    useful for running Spark for many hours / days (for example, running 24/7 in case of Spark Streaming
-    applications). Note that any RDD that persists in memory for more than this duration will be cleared as well.
+    Duration (seconds) of how long Spark will remember any metadata (stages generated, tasks
+    generated, etc.). Periodic cleanups will ensure that metadata older than this duration will be
+    forgotten. This is useful for running Spark for many hours / days (for example, running 24/7 in
+    case of Spark Streaming applications). Note that any RDD that persists in memory for more than
+    this duration will be cleared as well.
   </td>
 </tr>
 <tr>
@@ -782,8 +797,8 @@ The following variables can be set in `spark-env.sh`:
 </table>
 
 In addition to the above, there are also options for setting up the Spark
-[standalone cluster scripts](spark-standalone.html#cluster-launch-scripts), such as number of cores to use on each
-machine and maximum memory.
+[standalone cluster scripts](spark-standalone.html#cluster-launch-scripts), such as number of cores
+to use on each machine and maximum memory.
 
 Since `spark-env.sh` is a shell script, some of these can be set programmatically -- for example, you might
 compute `SPARK_LOCAL_IP` by looking up the IP of a specific network interface.

From 27d57db59621703c948de02226bb1bc1d382aad1 Mon Sep 17 00:00:00 2001
From: Patrick Wendell <pwendell@gmail.com>
Date: Wed, 28 May 2014 11:19:28 -0700
Subject: [PATCH 11/16] Reverting changes to index.html (covered in #896)

---
 docs/index.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/index.md b/docs/index.md
index fb75bc678c8a..c9b10376cc80 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -5,7 +5,7 @@ title: Spark Overview
 
 Apache Spark is a fast and general-purpose cluster computing system.
 It provides high-level APIs in [Scala](scala-programming-guide.html), [Java](java-programming-guide.html), and [Python](python-programming-guide.html) that make parallel jobs easy to write, and an optimized engine that supports general computation graphs.
-It also supports a rich set of higher-level tools including [Spark SQL](sql-programming-guide.html) (SQL on Spark), [MLlib](mllib-guide.html) for machine learning, [GraphX](graphx-programming-guide.html) for graph processing, and [Spark Streaming](streaming-programming-guide.html).
+It also supports a rich set of higher-level tools including [Shark](http://shark.cs.berkeley.edu) (Hive on Spark), [MLlib](mllib-guide.html) for machine learning, [GraphX](graphx-programming-guide.html) for graph processing, and [Spark Streaming](streaming-programming-guide.html).
 
 # Downloading
 

From e0c17289ec77c7a2b9c717fbe5939435e2e2bb9e Mon Sep 17 00:00:00 2001
From: Patrick Wendell <pwendell@gmail.com>
Date: Wed, 28 May 2014 11:33:45 -0700
Subject: [PATCH 12/16] Response to Matei's review

---
 docs/configuration.md    | 65 ++++++++++++++++++++--------------------
 docs/spark-standalone.md | 12 ++++----
 2 files changed, 39 insertions(+), 38 deletions(-)

diff --git a/docs/configuration.md b/docs/configuration.md
index 900cb884dc31..9d00d25549ac 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -64,7 +64,7 @@ This is a useful place to check to make sure that your properties have been set
 that only values explicitly specified through either `spark-defaults.conf` or SparkConf will
 appear. For all other configuration properties, you can assume the default value is used.
 
-## All Configuration Properties
+## Available Properties
 
 Most of the properties that control internal settings have reasonable default values. Some
 of the most common options to set are:
@@ -72,14 +72,14 @@ of the most common options to set are:
 <table class="table">
 <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
 <tr>
-  <td><strong><code>spark.app.name</code></strong></td>
+  <td><code>spark.app.name</code></td>
   <td>(none)</td>
   <td>
     The name of your application. This will appear in the UI and in log data.
   </td>
 </tr>
 <tr>
-  <td><strong><code>spark.master</code></strong></td>
+  <td><code>spark.master</code></td>
   <td>(none)</td>
   <td>
     The cluster manager to connect to. See the list of
@@ -244,15 +244,6 @@ Apart from these, the following properties are also available, and may be useful
     reduce the number of disk seeks and system calls made in creating intermediate shuffle files.
   </td>
 </tr>
-<tr>
-  <td><code>spark.storage.memoryMapThreshold</code></td>
-  <td>8192</td>
-  <td>
-    Size of a block, in bytes, above which Spark memory maps when reading a block from disk.
-    This prevents Spark from memory mapping very small blocks. In general, memory
-    mapping has high overhead for blocks close to or below the page size of the operating system.
-  </td>
-</tr>
 <tr>
   <td><code>spark.reducer.maxMbInFlight</code></td>
   <td>48</td>
@@ -292,7 +283,7 @@ Apart from these, the following properties are also available, and may be useful
   <td><code>spark.eventLog.enabled</code></td>
   <td>false</td>
   <td>
-    Whether to log spark events, useful for reconstructing the Web UI after the application has
+    Whether to log Spark events, useful for reconstructing the Web UI after the application has
     finished.
   </td>
 </tr>
@@ -307,7 +298,7 @@ Apart from these, the following properties are also available, and may be useful
   <td><code>spark.eventLog.dir</code></td>
   <td>file:///tmp/spark-events</td>
   <td>
-    Base directory in which spark events are logged, if <code>spark.eventLog.enabled</code> is true.
+    Base directory in which Spark events are logged, if <code>spark.eventLog.enabled</code> is true.
     Within this base directory, Spark creates a sub-directory for each application, and logs the
     events specific to the application in this directory.
   </td>
@@ -457,6 +448,15 @@ Apart from these, the following properties are also available, and may be useful
     directories on Tachyon file system.
   </td>
 </tr>
+<tr>
+  <td><code>spark.storage.memoryMapThreshold</code></td>
+  <td>8192</td>
+  <td>
+    Size of a block, in bytes, above which Spark memory maps when reading a block from disk.
+    This prevents Spark from memory mapping very small blocks. In general, memory
+    mapping has high overhead for blocks close to or below the page size of the operating system.
+  </td>
+</tr>
 <tr>
   <td><code>spark.tachyonStore.url</code></td>
   <td>tachyon://localhost:19998</td>
@@ -464,6 +464,17 @@ Apart from these, the following properties are also available, and may be useful
     The URL of the underlying Tachyon file system in the TachyonStore.
   </td>
 </tr>
+<tr>
+  <td><code>spark.cleaner.ttl</code></td>
+  <td>(infinite)</td>
+  <td>
+    Duration (seconds) of how long Spark will remember any metadata (stages generated, tasks
+    generated, etc.). Periodic cleanups will ensure that metadata older than this duration will be
+    forgotten. This is useful for running Spark for many hours / days (for example, running 24/7 in
+    case of Spark Streaming applications). Note that any RDD that persists in memory for more than
+    this duration will be cleared as well.
+  </td>
+</tr>
 </table>
 
 #### Networking
@@ -539,7 +550,7 @@ Apart from these, the following properties are also available, and may be useful
     `spark.akka.failure-detector.threshold` if you need to. Only positive use case for using
     failure detector can be, a sensistive failure detector can help evict rogue executors really
     quick. However this is usually not the case as gc pauses and network lags are expected in a
-    real spark cluster. Apart from that enabling this leads to a lot of exchanges of heart beats
+    real Spark cluster. Apart from that enabling this leads to a lot of exchanges of heart beats
     between nodes leading to flooding the network with those.
   </td>
 </tr>
@@ -677,8 +688,8 @@ Apart from these, the following properties are also available, and may be useful
   <td><code>spark.authenticate</code></td>
   <td>false</td>
   <td>
-    Whether spark authenticates its internal connections. See
-    <code>spark.authenticate.secret</code> if not running on Yarn.
+    Whether Spark authenticates its internal connections. See
+    <code>spark.authenticate.secret</code> if not running on YARN.
   </td>
 </tr>
 <tr>
@@ -686,7 +697,7 @@ Apart from these, the following properties are also available, and may be useful
   <td>None</td>
   <td>
     Set the secret key used for Spark to authenticate between components. This needs to be set if
-    not running on Yarn and authentication is enabled.
+    not running on YARN and authentication is enabled.
   </td>
 </tr>
 <tr>
@@ -702,7 +713,8 @@ Apart from these, the following properties are also available, and may be useful
   <td>None</td>
   <td>
     Comma separated list of filter class names to apply to the Spark web ui. The filter should be a
-    standard javax servlet Filter. Parameters to each filter can also be specified by setting a
+    standard <a href="http://docs.oracle.com/javaee/6/api/javax/servlet/Filter.html">
+    javax servlet Filter</a>. Parameters to each filter can also be specified by setting a
     java system property of spark.&lt;class name of filter&gt;.params='param1=value1,param2=value2'
     (e.g. -Dspark.ui.filters=com.test.filter1
     -Dspark.com.test.filter1.params='param1=foo,param2=testing')
@@ -712,7 +724,7 @@ Apart from these, the following properties are also available, and may be useful
   <td><code>spark.ui.acls.enable</code></td>
   <td>false</td>
   <td>
-    Whether spark web ui acls should are enabled. If enabled, this checks to see if the user has
+    Whether Spark web ui acls should are enabled. If enabled, this checks to see if the user has
     access permissions to view the web ui. See <code>spark.ui.view.acls</code> for more details.
     Also note this requires the user to be known, if the user comes across as null no checks
     are done. Filters can be used to authenticate and set the user.
@@ -722,7 +734,7 @@ Apart from these, the following properties are also available, and may be useful
   <td><code>spark.ui.view.acls</code></td>
   <td>Empty</td>
   <td>
-    Comma separated list of users that have view access to the spark web ui. By default only the
+    Comma separated list of users that have view access to the Spark web ui. By default only the
     user that started the Spark job has view access.
   </td>
 </tr>
@@ -731,17 +743,6 @@ Apart from these, the following properties are also available, and may be useful
 #### Spark Streaming
 <table class="table">
 <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
-<tr>
-  <td><code>spark.cleaner.ttl</code></td>
-  <td>(infinite)</td>
-  <td>
-    Duration (seconds) of how long Spark will remember any metadata (stages generated, tasks
-    generated, etc.). Periodic cleanups will ensure that metadata older than this duration will be
-    forgotten. This is useful for running Spark for many hours / days (for example, running 24/7 in
-    case of Spark Streaming applications). Note that any RDD that persists in memory for more than
-    this duration will be cleared as well.
-  </td>
-</tr>
 <tr>
   <td><code>spark.streaming.blockInterval</code></td>
   <td>200</td>
diff --git a/docs/spark-standalone.md b/docs/spark-standalone.md
index 15c5816182a8..20ae4b00115c 100644
--- a/docs/spark-standalone.md
+++ b/docs/spark-standalone.md
@@ -157,7 +157,7 @@ SPARK_MASTER_OPTS supports the following system properties:
 <table class="table">
 <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
 <tr>
-  <td>spark.deploy.spreadOut</td>
+  <td><code>spark.deploy.spreadOut</code></td>
   <td>true</td>
   <td>
     Whether the standalone cluster manager should spread applications out across nodes or try
@@ -166,7 +166,7 @@ SPARK_MASTER_OPTS supports the following system properties:
   </td>
 </tr>
 <tr>
-  <td>spark.deploy.defaultCores</td>
+  <td><code>spark.deploy.defaultCores</code></td>
   <td>(infinite)</td>
   <td>
     Default number of cores to give to applications in Spark's standalone mode if they don't
@@ -177,7 +177,7 @@ SPARK_MASTER_OPTS supports the following system properties:
   </td>
 </tr>
 <tr>
-  <td>spark.worker.timeout</td>
+  <td><code>spark.worker.timeout</code></td>
   <td>60</td>
   <td>
     Number of seconds after which the standalone deploy master considers a worker lost if it
@@ -191,7 +191,7 @@ SPARK_WORKER_OPTS supports the following system properties:
 <table class="table">
 <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
 <tr>
-  <td>spark.worker.cleanup.enabled</td>
+  <td><code>spark.worker.cleanup.enabled</code></td>
   <td>false</td>
   <td>
     Enable periodic cleanup of worker / application directories.  Note that this only affects standalone
@@ -200,7 +200,7 @@ SPARK_WORKER_OPTS supports the following system properties:
   </td>
 </tr>
 <tr>
-  <td>spark.worker.cleanup.interval</td>
+  <td><code>spark.worker.cleanup.interval</code></td>
   <td>1800 (30 minutes)</td>
   <td>
     Controls the interval, in seconds, at which the worker cleans up old application work dirs
@@ -208,7 +208,7 @@ SPARK_WORKER_OPTS supports the following system properties:
   </td>
 </tr>
 <tr>
-  <td>spark.worker.cleanup.appDataTtl</td>
+  <td><code>spark.worker.cleanup.appDataTtl</code></td>
   <td>7 * 24 * 3600 (7 days)</td>
   <td>
     The number of seconds to retain application work directories on each worker.  This is a Time To Live

From d9c264ff225b65d47f616aa1a1690933802b9973 Mon Sep 17 00:00:00 2001
From: Patrick Wendell <pwendell@gmail.com>
Date: Wed, 28 May 2014 11:44:23 -0700
Subject: [PATCH 13/16] Small fix

---
 docs/spark-standalone.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/docs/spark-standalone.md b/docs/spark-standalone.md
index 20ae4b00115c..dca80a9a6961 100644
--- a/docs/spark-standalone.md
+++ b/docs/spark-standalone.md
@@ -218,6 +218,7 @@ SPARK_WORKER_OPTS supports the following system properties:
   </td>
 </tr>
 </table>
+
 # Connecting an Application to the Cluster
 
 To run an application on the Spark cluster, simply pass the `spark://IP:PORT` URL of the master as to the [`SparkContext`

From 16ae7767e7deb5366ea46732f8d6d7e52d7f0d6f Mon Sep 17 00:00:00 2001
From: Patrick Wendell <pwendell@gmail.com>
Date: Wed, 28 May 2014 14:41:50 -0700
Subject: [PATCH 14/16] Adding back header section

---
 docs/configuration.md | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/docs/configuration.md b/docs/configuration.md
index 9d00d25549ac..df50be003c7d 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -2,11 +2,17 @@
 layout: global
 title: Spark Configuration
 ---
-
 * This will become a table of contents (this text will be scraped).
 {:toc}
 
-Spark provides several locations to configure the system:
+Spark provides three locations to configure the system:
+
+* [Spark properties](#spark-properties) control most application parameters and can be set by passing
+  a [SparkConf](api/core/index.html#org.apache.spark.SparkConf) object to SparkContext, or through Java
+  system properties.
+* [Environment variables](#environment-variables) can be used to set per-machine settings, such as
+  the IP address, through the `conf/spark-env.sh` script on each node.
+* [Logging](#configuring-logging) can be configured through `log4j.properties`.
 
 # Spark Properties
 

From 6f66efc31612ff43814f48da946339f855f24f38 Mon Sep 17 00:00:00 2001
From: Patrick Wendell <pwendell@gmail.com>
Date: Wed, 28 May 2014 14:58:57 -0700
Subject: [PATCH 15/16] More feedback

---
 docs/configuration.md | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/docs/configuration.md b/docs/configuration.md
index df50be003c7d..d654975b8dc8 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -221,7 +221,8 @@ Apart from these, the following properties are also available, and may be useful
   <td><code>spark.shuffle.spill.compress</code></td>
   <td>true</td>
   <td>
-    Whether to compress data spilled during shuffles.
+    Whether to compress data spilled during shuffles. Compression will use
+    <code>spark.io.compression.codec</code>.
   </td>
 </tr>
 <tr>
@@ -239,7 +240,8 @@ Apart from these, the following properties are also available, and may be useful
   <td><code>spark.shuffle.compress</code></td>
   <td>true</td>
   <td>
-    Whether to compress map output files. Generally a good idea.
+    Whether to compress map output files. Generally a good idea. Compression will use
+    <code>spark.io.compression.codec</code>.
   </td>
 </tr>
 <tr>
@@ -306,7 +308,8 @@ Apart from these, the following properties are also available, and may be useful
   <td>
     Base directory in which Spark events are logged, if <code>spark.eventLog.enabled</code> is true.
     Within this base directory, Spark creates a sub-directory for each application, and logs the
-    events specific to the application in this directory.
+    events specific to the application in this directory. Users may want to set this to
+    and HDFS directory so that history files can be read by the history server.
   </td>
 </tr>
 </table>
@@ -336,7 +339,9 @@ Apart from these, the following properties are also available, and may be useful
   <td>
     The codec used to compress internal data such as RDD partitions and shuffle outputs.
     By default, Spark provides two codecs: <code>org.apache.spark.io.LZFCompressionCodec</code>
-    and <code>org.apache.spark.io.SnappyCompressionCodec</code>.
+    and <code>org.apache.spark.io.SnappyCompressionCodec</code>. Of these two choices,
+    Snappy offers faster compression and decompression, while LZF offers a better compression
+    ratio.
   </td>
 </tr>
 <tr>
@@ -770,6 +775,14 @@ Apart from these, the following properties are also available, and may be useful
 </tr>
 </table>
 
+#### Cluster Managers (YARN, Mesos, Standalone)
+Each cluster manager in Spark has additional configuration options. Configurations 
+can be found on the pages for each mode:
+
+ * [Yarn](running-on-yarn.html#configuration)
+ * [Mesos](running-on-mesos.html)
+ * [Standalone Mode](spark-standalone.html#cluster-launch-scripts)
+
 # Environment Variables
 
 Certain Spark settings can be configured through environment variables, which are read from the

From 93f56c3e9248f2977d8db9d162b902fe2d52333e Mon Sep 17 00:00:00 2001
From: Patrick Wendell <pwendell@gmail.com>
Date: Wed, 28 May 2014 15:47:25 -0700
Subject: [PATCH 16/16] Feedback from Matei

---
 docs/configuration.md | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/docs/configuration.md b/docs/configuration.md
index d654975b8dc8..b6e7fd34eae6 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -723,12 +723,14 @@ Apart from these, the following properties are also available, and may be useful
   <td><code>spark.ui.filters</code></td>
   <td>None</td>
   <td>
-    Comma separated list of filter class names to apply to the Spark web ui. The filter should be a
+    Comma separated list of filter class names to apply to the Spark web UI. The filter should be a
     standard <a href="http://docs.oracle.com/javaee/6/api/javax/servlet/Filter.html">
     javax servlet Filter</a>. Parameters to each filter can also be specified by setting a
-    java system property of spark.&lt;class name of filter&gt;.params='param1=value1,param2=value2'
-    (e.g. -Dspark.ui.filters=com.test.filter1
-    -Dspark.com.test.filter1.params='param1=foo,param2=testing')
+    java system property of: <br />
+    <code>spark.&lt;class name of filter&gt;.params='param1=value1,param2=value2'</code><br />
+    For example: <br />
+    <code>-Dspark.ui.filters=com.test.filter1</code> <br />
+    <code>-Dspark.com.test.filter1.params='param1=foo,param2=testing'</code>
   </td>
 </tr>
 <tr>
@@ -779,7 +781,7 @@ Apart from these, the following properties are also available, and may be useful
 Each cluster manager in Spark has additional configuration options. Configurations 
 can be found on the pages for each mode:
 
- * [Yarn](running-on-yarn.html#configuration)
+ * [YARN](running-on-yarn.html#configuration)
  * [Mesos](running-on-mesos.html)
  * [Standalone Mode](spark-standalone.html#cluster-launch-scripts)
 
@@ -828,4 +830,3 @@ compute `SPARK_LOCAL_IP` by looking up the IP of a specific network interface.
 Spark uses [log4j](http://logging.apache.org/log4j/) for logging. You can configure it by adding a
 `log4j.properties` file in the `conf` directory. One way to start is to copy the existing
 `log4j.properties.template` located there.
-</table>

Property Name	Default	Meaning
spark.deploy.spreadOut	true	+ Whether the standalone cluster manager should spread applications out across nodes or try + to consolidate them onto as few nodes as possible. Spreading out is usually better for + data locality in HDFS, but consolidating is more efficient for compute-intensive workloads. +
spark.deploy.defaultCores	(infinite)	+ Default number of cores to give to applications in Spark's standalone mode if they don't + set `spark.cores.max`. If not set, applications always get all available + cores unless they configure `spark.cores.max` themselves. + Set this lower on a shared cluster to prevent users from grabbing + the whole cluster by default. +
spark.worker.timeout	60	+ Number of seconds after which the standalone deploy master considers a worker lost if it + receives no heartbeats. +
Property Name	Default	Meaning
spark.worker.cleanup.enabled	false	+ Enable periodic cleanup of worker / application directories. Note that this only affects standalone + mode, as YARN works differently. Applications directories are cleaned up regardless of whether + the application is still running. +
spark.worker.cleanup.interval	1800 (30 minutes)	+ Controls the interval, in seconds, at which the worker cleans up old application work dirs + on the local machine. +
spark.worker.cleanup.appDataTtl	7 * 24 * 3600 (7 days)	+ The number of seconds to retain application work directories on each worker. This is a Time To Live + and should depend on the amount of available disk space you have. Application logs and jars are + downloaded to each application work dir. Over time, the work dirs can quickly fill up disk space, + especially if you run jobs very frequently. +
Property Name	Default	Meaning
`spark.app.name`	(none)	+ The name of your application. This will appear in the UI and in log data. +
`spark.master`	(none)	+ The cluster manager to connect to. See the list of [allowed master URL's](scala-programming-guide.html#master-urls). +
`spark.executor.memory`	512m	- Amount of memory to use per executor process, in the same format as JVM memory strings (e.g. `512m`, `2g`). + Amount of memory to use per executor process, in the same format as JVM memory strings + (e.g. `512m`, `2g`).
`spark.cores.max`	(not set)	`spark.logConf`	false	- When running on a standalone deploy cluster or a - Mesos cluster in "coarse-grained" - sharing mode, the maximum amount of CPU cores to request for the application from - across the cluster (not from each machine). If not set, the default will be - `spark.deploy.defaultCores` on Spark's standalone cluster manager, or - infinite (all available cores) on Mesos. + Logs the effective SparkConf as INFO when a SparkContext is started.
Property Name	Default	Meaning
`spark.default.parallelism`	`spark.executor.memory`	512m	- - Local mode: number of cores on the local machine - Mesos fine grained mode: 8 - Others: total number of cores on all executor nodes or 2, whichever is larger - + Amount of memory to use per executor process, in the same format as JVM memory strings (e.g. `512m`, `2g`).
`spark.executor.extraJavaOptions`	(none)	- Default number of tasks to use across the cluster for distributed shuffle operations (`groupByKey`, - `reduceByKey`, etc) when not set by user. + A string of extra JVM options to pass to executors. For instance, GC settings or other + logging. Note that it is illegal to set Spark properties or heap size settings with this + option. Spark properties should be set using a SparkConf object or the + spark-defaults.conf file used with the spark-submit script. Heap size settings can be set + with spark.executor.memory.
`spark.storage.memoryFraction`	0.6	`spark.executor.extraClassPath`	(none)	- Fraction of Java heap to use for Spark's memory cache. This should not be larger than the "old" - generation of objects in the JVM, which by default is given 0.6 of the heap, but you can increase - it if you configure your own old generation size. + Extra classpath entries to append to the classpath of executors. This exists primarily + for backwards-compatibility with older versions of Spark. Users typically should not need + to set this option. +
`spark.executor.extraLibraryPath`	(none)	+ Set a special library path to use when launching executor JVM's. +
`spark.files.userClassPathFirst`	false	+ (Experimental) Whether to give user-added jars precedence over Spark's own jars when + loading classes in Executors. This feature can be used to mitigate conflicts between + Spark's dependencies and user dependencies. It is currently an experimental feature. +
Property Name	Default	Meaning
`spark.shuffle.consolidateFiles`	false	+ If set to "true", consolidates intermediate files created during a shuffle. Creating fewer files can improve + filesystem performance for shuffles with large numbers of reduce tasks. It is recommended to set this to "true" + when using ext4 or xfs filesystems. On ext3, this option might degrade performance on machines with many (>8) + cores due to filesystem limitations. +
`spark.shuffle.spill`	true	+ If set to "true", limits the amount of memory used during reduces by spilling data out to disk. This spilling + threshold is specified by `spark.shuffle.memoryFraction`. +
`spark.shuffle.spill.compress`	true	+ Whether to compress data spilled during shuffles.
`spark.storage.memoryMapThreshold`	8192	`spark.shuffle.compress`	true	- Size of a block, in bytes, above which Spark memory maps when reading a block from disk. - This prevents Spark from memory mapping very small blocks. In general, memory - mapping has high overhead for blocks close to or below the page size of the operating system. + Whether to compress map output files. Generally a good idea.
`spark.tachyonStore.baseDir`	System.getProperty("java.io.tmpdir")	`spark.shuffle.file.buffer.kb`	100	- Directories of the Tachyon File System that store RDDs. The Tachyon file system's URL is set by `spark.tachyonStore.url`. - It can also be a comma-separated list of multiple directories on Tachyon file system. + Size of the in-memory buffer for each shuffle file output stream, in kilobytes. These buffers + reduce the number of disk seeks and system calls made in creating intermediate shuffle files.
`spark.tachyonStore.url`	tachyon://localhost:19998	`spark.storage.memoryMapThreshold`	8192	- The URL of the underlying Tachyon file system in the TachyonStore. + Size of a block, in bytes, above which Spark memory maps when reading a block from disk. + This prevents Spark from memory mapping very small blocks. In general, memory + mapping has high overhead for blocks close to or below the page size of the operating system.
`spark.mesos.coarse`	false	`spark.reducer.maxMbInFlight`	48	- If set to "true", runs over Mesos clusters in - "coarse-grained" sharing mode, - where Spark acquires one long-lived Mesos task on each machine instead of one Mesos task per Spark task. - This gives lower-latency scheduling for short queries, but leaves resources in use for the whole - duration of the Spark job. + Maximum size (in megabytes) of map outputs to fetch simultaneously from each reduce task. Since + each output requires us to create a buffer to receive it, this represents a fixed memory overhead + per reduce task, so keep it small unless you have a large amount of memory.
Property Name	Default	Meaning
`spark.ui.port`	4040
`spark.ui.filters`	None	`spark.ui.killEnabled`	true	- Comma separated list of filter class names to apply to the Spark web ui. The filter should be a - standard javax servlet Filter. Parameters to each filter can also be specified by setting a - java system property of spark.<class name of filter>.params='param1=value1,param2=value2' - (e.g. -Dspark.ui.filters=com.test.filter1 -Dspark.com.test.filter1.params='param1=foo,param2=testing') + Allows stages and corresponding jobs to be killed from the web ui.
`spark.ui.acls.enable`	`spark.eventLog.enabled`	false	- Whether spark web ui acls should are enabled. If enabled, this checks to see if the user has - access permissions to view the web ui. See `spark.ui.view.acls` for more details. - Also note this requires the user to be known, if the user comes across as null no checks - are done. Filters can be used to authenticate and set the user. -
`spark.ui.view.acls`	Empty	- Comma separated list of users that have view access to the spark web ui. By default only the - user that started the Spark job has view access. -
`spark.ui.killEnabled`	true	- Allows stages and corresponding jobs to be killed from the web ui. + Whether to log spark events, useful for reconstructing the Web UI after the application has finished.
`spark.shuffle.compress`	true	`spark.eventLog.compress`	false	- Whether to compress map output files. Generally a good idea. + Whether to compress logged events, if `spark.eventLog.enabled` is true.
`spark.shuffle.spill.compress`	true	`spark.eventLog.dir`	file:///tmp/spark-events	- Whether to compress data spilled during shuffles. + Base directory in which spark events are logged, if `spark.eventLog.enabled` is true. + Within this base directory, Spark creates a sub-directory for each application, and logs the events + specific to the application in this directory.
Property Name	Default	Meaning
`spark.broadcast.compress`	true
`spark.scheduler.mode`	FIFO	- The scheduling mode between - jobs submitted to the same SparkContext. Can be set to `FAIR` - to use fair sharing instead of queueing jobs one after another. Useful for - multi-user services. -
`spark.scheduler.revive.interval`	1000	- The interval length for the scheduler to revive the worker resource offers to run tasks. (in milliseconds) -
`spark.reducer.maxMbInFlight`	48	`spark.closure.serializer`	org.apache.spark.serializer. JavaSerializer	- Maximum size (in megabytes) of map outputs to fetch simultaneously from each reduce task. Since - each output requires us to create a buffer to receive it, this represents a fixed memory overhead - per reduce task, so keep it small unless you have a large amount of memory. + Serializer class to use for closures. Currently only the Java serializer is supported.
`spark.closure.serializer`	org.apache.spark.serializer. JavaSerializer	`spark.serializer.objectStreamReset`	10000	- Serializer class to use for closures. Currently only the Java serializer is supported. + When serializing using org.apache.spark.serializer.JavaSerializer, the serializer caches + objects to prevent writing redundant data, however that stops garbage collection of those + objects. By calling 'reset' you flush that info from the serializer, and allow old + objects to be collected. To turn off this periodic reset set it to a value <= 0. + By default it will reset the serializer every 10,000 objects.
Property Name	Default	Meaning
`spark.worker.cleanup.interval`	1800 (30 minutes)	`spark.driver.host`	(local hostname)	- Controls the interval, in seconds, at which the worker cleans up old application work dirs - on the local machine. + Hostname or IP address for the driver to listen on.
`spark.worker.cleanup.appDataTtl`	7 * 24 * 3600 (7 days)	`spark.driver.port`	(random)	- The number of seconds to retain application work directories on each worker. This is a Time To Live - and should depend on the amount of available disk space you have. Application logs and jars are - downloaded to each application work dir. Over time, the work dirs can quickly fill up disk space, - especially if you run jobs very frequently. + Port for the driver to listen on.
Property Name	Default	Meaning
`spark.driver.host`	(local hostname)	- Hostname or IP address for the driver to listen on. -
`spark.driver.port`	(random)	- Port for the driver to listen on. -
`spark.cleaner.ttl`	(infinite)	- Duration (seconds) of how long Spark will remember any metadata (stages generated, tasks generated, etc.). - Periodic cleanups will ensure that metadata older than this duration will be forgotten. This is - useful for running Spark for many hours / days (for example, running 24/7 in case of Spark Streaming - applications). Note that any RDD that persists in memory for more than this duration will be cleared as well. -
`spark.streaming.blockInterval`	200	- Interval (milliseconds) at which data received by Spark Streaming receivers is coalesced - into blocks of data before storing them in Spark. -
`spark.streaming.unpersist`	true	`spark.task.cpus`	1	- Force RDDs generated and persisted by Spark Streaming to be automatically unpersisted from - Spark's memory. The raw input data received by Spark Streaming is also automatically cleared. - Setting this to false will allow the raw data and persisted RDDs to be accessible outside the - streaming application as they will not be cleared automatically. But it comes at the cost of - higher memory usage in Spark. + Number of cores to allocate for each task.
`spark.broadcast.blockSize`	4096	- Size of each piece of a block in kilobytes for `TorrentBroadcastFactory`. - Too large a value decreases parallelism during broadcast (makes it slower); however, if it is too small, `BlockManager` might take a performance hit. -
`spark.shuffle.consolidateFiles`	false	`spark.scheduler.mode`	FIFO	- If set to "true", consolidates intermediate files created during a shuffle. Creating fewer files can improve filesystem performance for shuffles with large numbers of reduce tasks. It is recommended to set this to "true" when using ext4 or xfs filesystems. On ext3, this option might degrade performance on machines with many (>8) cores due to filesystem limitations. + The scheduling mode between + jobs submitted to the same SparkContext. Can be set to `FAIR` + to use fair sharing instead of queueing jobs one after another. Useful for + multi-user services.
`spark.shuffle.file.buffer.kb`	100	`spark.cores.max`	(not set)	- Size of the in-memory buffer for each shuffle file output stream, in kilobytes. These buffers - reduce the number of disk seeks and system calls made in creating intermediate shuffle files. + When running on a standalone deploy cluster or a + Mesos cluster in "coarse-grained" + sharing mode, the maximum amount of CPU cores to request for the application from + across the cluster (not from each machine). If not set, the default will be + `spark.deploy.defaultCores` on Spark's standalone cluster manager, or + infinite (all available cores) on Mesos.
`spark.shuffle.spill`	true	`spark.mesos.coarse`	false	- If set to "true", limits the amount of memory used during reduces by spilling data out to disk. This spilling - threshold is specified by `spark.shuffle.memoryFraction`. + If set to "true", runs over Mesos clusters in + "coarse-grained" sharing mode, + where Spark acquires one long-lived Mesos task on each machine instead of one Mesos task per Spark task. + This gives lower-latency scheduling for short queries, but leaves resources in use for the whole + duration of the Spark job.
`spark.logConf`	false	- Whether to log the supplied SparkConf as INFO when a SparkContext is started. -
`spark.eventLog.enabled`	false	- Whether to log spark events, useful for reconstructing the Web UI after the application has finished. -
`spark.eventLog.compress`	false	- Whether to compress logged events, if `spark.eventLog.enabled` is true. -
`spark.eventLog.dir`	file:///tmp/spark-events	- Base directory in which spark events are logged, if `spark.eventLog.enabled` is true. - Within this base directory, Spark creates a sub-directory for each application, and logs the events - specific to the application in this directory. -
`spark.deploy.spreadOut`	true	`spark.locality.wait`	3000	- Whether the standalone cluster manager should spread applications out across nodes or try - to consolidate them onto as few nodes as possible. Spreading out is usually better for - data locality in HDFS, but consolidating is more efficient for compute-intensive workloads. - Note: this setting needs to be configured in the standalone cluster master, not in individual - applications; you can set it through `SPARK_MASTER_OPTS` in `spark-env.sh`. + Number of milliseconds to wait to launch a data-local task before giving up and launching it + on a less-local node. The same wait will be used to step through multiple locality levels + (process-local, node-local, rack-local and then any). It is also possible to customize the + waiting time for each level by setting `spark.locality.wait.node`, etc. + You should increase this setting if your tasks are long and see poor locality, but the + default usually works well.
`spark.deploy.defaultCores`	(infinite)	`spark.locality.wait.process`	spark.locality.wait	- Default number of cores to give to applications in Spark's standalone mode if they don't - set `spark.cores.max`. If not set, applications always get all available - cores unless they configure `spark.cores.max` themselves. - Set this lower on a shared cluster to prevent users from grabbing - the whole cluster by default. - Note: this setting needs to be configured in the standalone cluster master, not in individual - applications; you can set it through `SPARK_MASTER_OPTS` in `spark-env.sh`. + Customize the locality wait for process locality. This affects tasks that attempt to access + cached data in a particular executor process.
`spark.files.overwrite`	false	`spark.locality.wait.node`	spark.locality.wait	- Whether to overwrite files added through SparkContext.addFile() when the target file exists and its contents do not match those of the source. + Customize the locality wait for node locality. For example, you can set this to 0 to skip + node locality and search immediately for rack locality (if your cluster has rack information).
`spark.files.fetchTimeout`	false	`spark.locality.wait.rack`	spark.locality.wait	- Communication timeout to use when fetching files added through SparkContext.addFile() from - the driver. + Customize the locality wait for rack locality.
`spark.files.userClassPathFirst`	false	`spark.scheduler.revive.interval`	1000	- (Experimental) Whether to give user-added jars precedence over Spark's own jars when - loading classes in Executors. This feature can be used to mitigate conflicts between - Spark's dependencies and user dependencies. It is currently an experimental feature. + The interval length for the scheduler to revive the worker resource offers to run tasks. (in milliseconds)
Property Name	Default	Meaning
`spark.authenticate`	false
`spark.task.cpus`	1	`spark.ui.filters`	None	- Number of cores to allocate for each task. + Comma separated list of filter class names to apply to the Spark web ui. The filter should be a + standard javax servlet Filter. Parameters to each filter can also be specified by setting a + java system property of spark.<class name of filter>.params='param1=value1,param2=value2' + (e.g. -Dspark.ui.filters=com.test.filter1 -Dspark.com.test.filter1.params='param1=foo,param2=testing')
`spark.executor.extraJavaOptions`	(none)	`spark.ui.acls.enable`	false	- A string of extra JVM options to pass to executors. For instance, GC settings or other - logging. Note that it is illegal to set Spark properties or heap size settings with this - option. Spark properties should be set using a SparkConf object or the - spark-defaults.conf file used with the spark-submit script. Heap size settings can be set - with spark.executor.memory. + Whether spark web ui acls should are enabled. If enabled, this checks to see if the user has + access permissions to view the web ui. See `spark.ui.view.acls` for more details. + Also note this requires the user to be known, if the user comes across as null no checks + are done. Filters can be used to authenticate and set the user.
`spark.executor.extraClassPath`	(none)	`spark.ui.view.acls`	Empty	- Extra classpath entries to append to the classpath of executors. This exists primarily - for backwards-compatibility with older versions of Spark. Users typically should not need - to set this option. + Comma separated list of users that have view access to the spark web ui. By default only the + user that started the Spark job has view access.