diff --git a/docs/building-spark.md b/docs/building-spark.md index 4b8e70655d59c..33d253a49dbf3 100644 --- a/docs/building-spark.md +++ b/docs/building-spark.md @@ -286,7 +286,7 @@ If use an individual repository or a repository on GitHub Enterprise, export bel ### Related environment variables - +
diff --git a/docs/cluster-overview.md b/docs/cluster-overview.md index 7da06a852089e..34913bd97a418 100644 --- a/docs/cluster-overview.md +++ b/docs/cluster-overview.md @@ -91,7 +91,7 @@ The [job scheduling overview](job-scheduling.html) describes this in more detail The following table summarizes terms you'll see used to refer to cluster concepts: -
Variable NameDefaultMeaning
SPARK_PROJECT_URL
+
diff --git a/docs/configuration.md b/docs/configuration.md index 1139beb66462f..3d0842a8a13d3 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -135,7 +135,7 @@ of the most common options to set are: ### Application Properties -
TermMeaning
+
@@ -520,7 +520,7 @@ Apart from these, the following properties are also available, and may be useful ### Runtime Environment -
Property NameDefaultMeaningSince Version
spark.app.name
+
@@ -907,7 +907,7 @@ Apart from these, the following properties are also available, and may be useful ### Shuffle Behavior -
Property NameDefaultMeaningSince Version
spark.driver.extraClassPath
+
@@ -1282,7 +1282,7 @@ Apart from these, the following properties are also available, and may be useful ### Spark UI -
Property NameDefaultMeaningSince Version
spark.reducer.maxSizeInFlight
+
@@ -1674,7 +1674,7 @@ Apart from these, the following properties are also available, and may be useful ### Compression and Serialization -
Property NameDefaultMeaningSince Version
spark.eventLog.logBlockUpdates.enabled
+
@@ -1872,7 +1872,7 @@ Apart from these, the following properties are also available, and may be useful ### Memory Management -
Property NameDefaultMeaningSince Version
spark.broadcast.compress
+
@@ -1997,7 +1997,7 @@ Apart from these, the following properties are also available, and may be useful ### Execution Behavior -
Property NameDefaultMeaningSince Version
spark.memory.fraction
+
@@ -2247,7 +2247,7 @@ Apart from these, the following properties are also available, and may be useful ### Executor Metrics -
Property NameDefaultMeaningSince Version
spark.broadcast.blockSize
+
@@ -2315,7 +2315,7 @@ Apart from these, the following properties are also available, and may be useful ### Networking -
Property NameDefaultMeaningSince Version
spark.eventLog.logStageExecutorMetrics
+
@@ -2478,7 +2478,7 @@ Apart from these, the following properties are also available, and may be useful ### Scheduling -
Property NameDefaultMeaningSince Version
spark.rpc.message.maxSize
+
@@ -2962,7 +2962,7 @@ Apart from these, the following properties are also available, and may be useful ### Barrier Execution Mode -
Property NameDefaultMeaningSince Version
spark.cores.max
+
@@ -3009,7 +3009,7 @@ Apart from these, the following properties are also available, and may be useful ### Dynamic Allocation -
Property NameDefaultMeaningSince Version
spark.barrier.sync.timeout
+
@@ -3151,7 +3151,7 @@ finer granularity starting from driver and executor. Take RPC module as example like shuffle, just replace "rpc" with "shuffle" in the property names except spark.{driver|executor}.rpc.netty.dispatcher.numThreads, which is only for RPC module. -
Property NameDefaultMeaningSince Version
spark.dynamicAllocation.enabled
+
@@ -3294,7 +3294,7 @@ External users can query the static sql config values via `SparkSession.conf` or ### Spark Streaming -
Property NameDefaultMeaningSince Version
spark.{driver|executor}.rpc.io.serverThreads
+
@@ -3426,7 +3426,7 @@ External users can query the static sql config values via `SparkSession.conf` or ### SparkR -
Property NameDefaultMeaningSince Version
spark.streaming.backpressure.enabled
+
@@ -3482,7 +3482,7 @@ External users can query the static sql config values via `SparkSession.conf` or ### GraphX -
Property NameDefaultMeaningSince Version
spark.r.numRBackendThreads
+
@@ -3497,7 +3497,7 @@ External users can query the static sql config values via `SparkSession.conf` or ### Deploy -
Property NameDefaultMeaningSince Version
spark.graphx.pregel.checkpointInterval
+
@@ -3547,7 +3547,7 @@ copy `conf/spark-env.sh.template` to create it. Make sure you make the copy exec The following variables can be set in `spark-env.sh`: -
Property NameDefaultMeaningSince Version
spark.deploy.recoveryMode
+
@@ -3684,7 +3684,7 @@ Push-based shuffle helps improve the reliability and performance of spark shuffl ### External Shuffle service(server) side configuration options -
Environment VariableMeaning
JAVA_HOME
+
@@ -3718,7 +3718,7 @@ Push-based shuffle helps improve the reliability and performance of spark shuffl ### Client side configuration options -
Property NameDefaultMeaningSince Version
spark.shuffle.push.server.mergedShuffleFileManagerImpl
+
diff --git a/docs/css/custom.css b/docs/css/custom.css index 4576f45d1ab7d..e7416d9ded618 100644 --- a/docs/css/custom.css +++ b/docs/css/custom.css @@ -1110,5 +1110,18 @@ img { table { width: 100%; overflow-wrap: normal; + border-collapse: collapse; /* Ensures that the borders collapse into a single border */ } +table th, table td { + border: 1px solid #cccccc; /* Adds a border to each table header and data cell */ + padding: 6px 13px; /* Optional: Adds padding inside each cell for better readability */ +} + +table tr { + background-color: white; /* Sets a default background color for all rows */ +} + +table tr:nth-child(2n) { + background-color: #F1F4F5; /* Sets a different background color for even rows */ +} diff --git a/docs/ml-classification-regression.md b/docs/ml-classification-regression.md index d184f4fe0257c..604b3245272fc 100644 --- a/docs/ml-classification-regression.md +++ b/docs/ml-classification-regression.md @@ -703,7 +703,7 @@ others. ### Available families -
Property NameDefaultMeaningSince Version
spark.shuffle.push.enabled
+
@@ -1224,7 +1224,7 @@ All output columns are optional; to exclude an output column, set its correspond ### Input Columns -
Family
+
@@ -1251,7 +1251,7 @@ All output columns are optional; to exclude an output column, set its correspond ### Output Columns -
Param name
+
@@ -1326,7 +1326,7 @@ All output columns are optional; to exclude an output column, set its correspond #### Input Columns -
Param name
+
@@ -1353,7 +1353,7 @@ All output columns are optional; to exclude an output column, set its correspond #### Output Columns (Predictions) -
Param name
+
@@ -1407,7 +1407,7 @@ All output columns are optional; to exclude an output column, set its correspond #### Input Columns -
Param name
+
@@ -1436,7 +1436,7 @@ Note that `GBTClassifier` currently only supports binary labels. #### Output Columns (Predictions) -
Param name
+
diff --git a/docs/ml-clustering.md b/docs/ml-clustering.md index 00a156b6645ce..fdb8173ce3bbe 100644 --- a/docs/ml-clustering.md +++ b/docs/ml-clustering.md @@ -40,7 +40,7 @@ called [kmeans||](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf). ### Input Columns -
Param name
+
@@ -61,7 +61,7 @@ called [kmeans||](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf). ### Output Columns -
Param name
+
@@ -204,7 +204,7 @@ model. ### Input Columns -
Param name
+
@@ -225,7 +225,7 @@ model. ### Output Columns -
Param name
+
diff --git a/docs/mllib-classification-regression.md b/docs/mllib-classification-regression.md index 10cb85e392029..b3305314abc56 100644 --- a/docs/mllib-classification-regression.md +++ b/docs/mllib-classification-regression.md @@ -26,7 +26,7 @@ classification](http://en.wikipedia.org/wiki/Multiclass_classification), and [regression analysis](http://en.wikipedia.org/wiki/Regression_analysis). The table below outlines the supported algorithms for each type of problem. -
Param name
+
diff --git a/docs/mllib-decision-tree.md b/docs/mllib-decision-tree.md index 174255c48b699..0d9886315e288 100644 --- a/docs/mllib-decision-tree.md +++ b/docs/mllib-decision-tree.md @@ -51,7 +51,7 @@ The *node impurity* is a measure of the homogeneity of the labels at the node. T implementation provides two impurity measures for classification (Gini impurity and entropy) and one impurity measure for regression (variance). -
Problem TypeSupported Methods
+
diff --git a/docs/mllib-ensembles.md b/docs/mllib-ensembles.md index b1006f2730db5..fdad7ae68dd49 100644 --- a/docs/mllib-ensembles.md +++ b/docs/mllib-ensembles.md @@ -191,7 +191,7 @@ Note that each loss is applicable to one of classification or regression, not bo Notation: $N$ = number of instances. $y_i$ = label of instance $i$. $x_i$ = features of instance $i$. $F(x_i)$ = model's predicted label for instance $i$. -
ImpurityTaskFormulaDescription
+
diff --git a/docs/mllib-evaluation-metrics.md b/docs/mllib-evaluation-metrics.md index f82f6a01136b9..30acc3dc634be 100644 --- a/docs/mllib-evaluation-metrics.md +++ b/docs/mllib-evaluation-metrics.md @@ -76,7 +76,7 @@ plots (recall, false positive rate) points. **Available metrics** -
LossTaskFormulaDescription
+
@@ -179,7 +179,7 @@ For this section, a modified delta function $\hat{\delta}(x)$ will prove useful $$\hat{\delta}(x) = \begin{cases}1 & \text{if $x = 0$}, \\ 0 & \text{otherwise}.\end{cases}$$ -
MetricDefinition
+
@@ -296,7 +296,7 @@ The following definition of indicator function $I_A(x)$ on a set $A$ will be nec $$I_A(x) = \begin{cases}1 & \text{if $x \in A$}, \\ 0 & \text{otherwise}.\end{cases}$$ -
MetricDefinition
+
@@ -447,7 +447,7 @@ documents, returns a relevance score for the recommended document. $$rel_D(r) = \begin{cases}1 & \text{if $r \in D$}, \\ 0 & \text{otherwise}.\end{cases}$$ -
MetricDefinition
+
@@ -553,7 +553,7 @@ variable from a number of independent variables. **Available metrics** -
MetricDefinitionNotes
+
diff --git a/docs/mllib-linear-methods.md b/docs/mllib-linear-methods.md index b535d2de307a9..448d881f794a5 100644 --- a/docs/mllib-linear-methods.md +++ b/docs/mllib-linear-methods.md @@ -72,7 +72,7 @@ training error) and minimizing model complexity (i.e., to avoid overfitting). The following table summarizes the loss functions and their gradients or sub-gradients for the methods `spark.mllib` supports: -
MetricDefinition
+
@@ -105,7 +105,7 @@ The purpose of the encourage simple models and avoid overfitting. We support the following regularizers in `spark.mllib`: -
loss function $L(\wv; \x, y)$gradient or sub-gradient
+
diff --git a/docs/mllib-pmml-model-export.md b/docs/mllib-pmml-model-export.md index e20d7c2fe4e17..02b5fda7a36df 100644 --- a/docs/mllib-pmml-model-export.md +++ b/docs/mllib-pmml-model-export.md @@ -28,7 +28,7 @@ license: | The table below outlines the `spark.mllib` models that can be exported to PMML and their equivalent PMML model. -
regularizer $R(\wv)$gradient or sub-gradient
+
diff --git a/docs/monitoring.md b/docs/monitoring.md index ebd8781fd0071..ae6a9a5eaee81 100644 --- a/docs/monitoring.md +++ b/docs/monitoring.md @@ -69,7 +69,7 @@ The history server can be configured as follows: ### Environment Variables -
spark.mllib modelPMML model
+
@@ -145,7 +145,7 @@ Use it with caution. Security options for the Spark History Server are covered more detail in the [Security](security.html#web-ui) page. -
Environment VariableMeaning
SPARK_DAEMON_MEMORY
+
@@ -470,7 +470,7 @@ only for applications in cluster mode, not applications in client mode. Applicat can be identified by their `[attempt-id]`. In the API listed below, when running in YARN cluster mode, `[app-id]` will actually be `[base-app-id]/[attempt-id]`, where `[base-app-id]` is the YARN application ID. -
Property Name
+
@@ -669,7 +669,7 @@ The REST API exposes the values of the Task Metrics collected by Spark executors of task execution. The metrics can be used for performance troubleshooting and workload characterization. A list of the available metrics, with a short description: -
EndpointMeaning
/applications
+
@@ -827,7 +827,7 @@ In addition, aggregated per-stage peak values of the executor memory metrics are Executor memory metrics are also exposed via the Spark metrics system based on the [Dropwizard metrics library](https://metrics.dropwizard.io/4.2.0). A list of the available metrics, with a short description: -
Spark Executor Task Metric name
+
diff --git a/docs/rdd-programming-guide.md b/docs/rdd-programming-guide.md index aee22ad484e60..cc897aea06c93 100644 --- a/docs/rdd-programming-guide.md +++ b/docs/rdd-programming-guide.md @@ -378,7 +378,7 @@ resulting Java objects using [pickle](https://github.com/irmen/pickle/). When sa PySpark does the reverse. It unpickles Python objects into Java objects and then converts them to Writables. The following Writables are automatically converted: -
Executor Level Metric name Short description
+
@@ -954,7 +954,7 @@ and pair RDD functions doc [Java](api/java/index.html?org/apache/spark/api/java/JavaPairRDD.html)) for details. -
Writable TypePython Type
Textstr
IntWritableint
+
@@ -1069,7 +1069,7 @@ and pair RDD functions doc [Java](api/java/index.html?org/apache/spark/api/java/JavaPairRDD.html)) for details. -
TransformationMeaning
map(func)
+
@@ -1214,7 +1214,7 @@ to `persist()`. The `cache()` method is a shorthand for using the default storag which is `StorageLevel.MEMORY_ONLY` (store deserialized objects in memory). The full set of storage levels is: -
ActionMeaning
reduce(func)
+
diff --git a/docs/running-on-kubernetes.md b/docs/running-on-kubernetes.md index 707a76196f3ab..1927372221608 100644 --- a/docs/running-on-kubernetes.md +++ b/docs/running-on-kubernetes.md @@ -579,7 +579,7 @@ See the [configuration page](configuration.html) for information on Spark config #### Spark Properties -
Storage LevelMeaning
MEMORY_ONLY
+
@@ -1645,7 +1645,7 @@ See the below table for the full list of pod specifications that will be overwri ### Pod Metadata -
Property NameDefaultMeaningSince Version
spark.kubernetes.context
+
@@ -1681,7 +1681,7 @@ See the below table for the full list of pod specifications that will be overwri ### Pod Spec -
Pod metadata keyModified valueDescription
name
+
@@ -1734,7 +1734,7 @@ See the below table for the full list of pod specifications that will be overwri The following affect the driver and executor containers. All other containers in the pod spec will be unaffected. -
Pod spec keyModified valueDescription
imagePullSecrets
+
diff --git a/docs/running-on-mesos.md b/docs/running-on-mesos.md index b1a54a089a542..3d1c57030982d 100644 --- a/docs/running-on-mesos.md +++ b/docs/running-on-mesos.md @@ -374,7 +374,7 @@ See the [configuration page](configuration.html) for information on Spark config #### Spark Properties -
Container spec keyModified valueDescription
env
+
diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md index 5eec6c490cb1f..f30251c99aa5c 100644 --- a/docs/running-on-yarn.md +++ b/docs/running-on-yarn.md @@ -143,7 +143,7 @@ To use a custom metrics.properties for the application master and executors, upd #### Spark Properties -
Property NameDefaultMeaningSince Version
spark.mesos.coarse
+
@@ -696,7 +696,7 @@ To use a custom metrics.properties for the application master and executors, upd #### Available patterns for SHS custom executor log URL -
Property NameDefaultMeaningSince Version
spark.yarn.am.memory
+
@@ -779,7 +779,7 @@ staging directory of the Spark application. ## YARN-specific Kerberos Configuration -
PatternMeaning
{{HTTP_SCHEME}}
+
@@ -878,7 +878,7 @@ to avoid garbage collection issues during shuffle. The following extra configuration options are available when the shuffle service is running on YARN: -
Property NameDefaultMeaningSince Version
spark.kerberos.keytab
+
diff --git a/docs/security.md b/docs/security.md index 3c6fd507fec6d..c5d132f680a41 100644 --- a/docs/security.md +++ b/docs/security.md @@ -60,7 +60,7 @@ distributing the shared secret. Each application will use a unique shared secret the case of YARN, this feature relies on YARN RPC encryption being enabled for the distribution of secrets to be secure. -
Property NameDefaultMeaning
spark.yarn.shuffle.stopOnFailure
+
@@ -82,7 +82,7 @@ that any user that can list pods in the namespace where the Spark application is also see their authentication secret. Access control rules should be properly set up by the Kubernetes admin to ensure that Spark authentication is secure. -
Property NameDefaultMeaningSince Version
spark.yarn.shuffle.server.recovery.disabled
+
@@ -103,7 +103,7 @@ Kubernetes admin to ensure that Spark authentication is secure. Alternatively, one can mount authentication secrets using files and Kubernetes secrets that the user mounts into their pods. -
Property NameDefaultMeaningSince Version
spark.authenticate
+
@@ -159,7 +159,7 @@ is still required when talking to shuffle services from Spark versions older tha The following table describes the different options available for configuring this feature. -
Property NameDefaultMeaningSince Version
spark.authenticate.secret.file
+
@@ -219,7 +219,7 @@ encrypting output data generated by applications with APIs such as `saveAsHadoop The following settings cover enabling encryption for data written to disk: -
Property NameDefaultMeaningSince Version
spark.network.crypto.enabled
+
@@ -287,7 +287,7 @@ below. The following options control the authentication of Web UIs: -
Property NameDefaultMeaningSince Version
spark.io.encryption.enabled
+
@@ -391,7 +391,7 @@ servlet filters. To enable authorization in the SHS, a few extra options are used: -
Property NameDefaultMeaningSince Version
spark.ui.allowFramingFrom
+
@@ -440,7 +440,7 @@ protocol-specific settings. This way the user can easily provide the common sett protocols without disabling the ability to configure each one individually. The following table describes the SSL configuration namespaces: -
Property NameDefaultMeaningSince Version
spark.history.ui.acls.enable
+
@@ -471,7 +471,7 @@ describes the SSL configuration namespaces: The full breakdown of available SSL options can be found below. The `${ns}` placeholder should be replaced with one of the above namespaces. -
Config Namespace
+
@@ -641,7 +641,7 @@ Apache Spark can be configured to include HTTP headers to aid in preventing Cros (XSS), Cross-Frame Scripting (XFS), MIME-Sniffing, and also to enforce HTTP Strict Transport Security. -
Property NameDefaultMeaning
${ns}.enabled
+
@@ -697,7 +697,7 @@ configure those ports. ## Standalone mode only -
Property NameDefaultMeaningSince Version
spark.ui.xXssProtection
+
FromToDefault PortPurposeConfiguration @@ -748,7 +748,7 @@ configure those ports. ## All cluster managers - +
FromToDefault PortPurposeConfiguration @@ -824,7 +824,7 @@ deployment-specific page for more information. The following options provides finer-grained control for this feature: - +
diff --git a/docs/spark-standalone.md b/docs/spark-standalone.md index 3e87edad0aadd..d606e7df6a187 100644 --- a/docs/spark-standalone.md +++ b/docs/spark-standalone.md @@ -53,7 +53,7 @@ You should see the new node listed there, along with its number of CPUs and memo Finally, the following configuration options can be passed to the master and worker: -
Property NameDefaultMeaningSince Version
spark.security.credentials.${service}.enabled
+
@@ -116,7 +116,7 @@ Note that these scripts must be executed on the machine you want to run the Spar You can optionally configure the cluster further by setting environment variables in `conf/spark-env.sh`. Create this file by starting with the `conf/spark-env.sh.template`, and _copy it to all your worker machines_ for the settings to take effect. The following settings are available: -
ArgumentMeaning
-h HOST, --host HOST
+
@@ -188,7 +188,7 @@ You can optionally configure the cluster further by setting environment variable SPARK_MASTER_OPTS supports the following system properties: -
Environment VariableMeaning
SPARK_MASTER_HOST
+
@@ -287,7 +287,7 @@ SPARK_MASTER_OPTS supports the following system properties: SPARK_WORKER_OPTS supports the following system properties: -
Property NameDefaultMeaningSince Version
spark.deploy.retainedApplications
+
@@ -392,7 +392,7 @@ You can also pass an option `--total-executor-cores ` to control the n Spark applications supports the following configuration properties specific to standalone mode: -
Property NameDefaultMeaningSince Version
spark.worker.cleanup.enabled
+
@@ -541,7 +541,7 @@ ZooKeeper is the best way to go for production-level high availability, but if y In order to enable this recovery mode, you can set SPARK_DAEMON_JAVA_OPTS in spark-env using this configuration: -
Property NameDefault ValueMeaningSince Version
spark.standalone.submit.waitAppCompletion
+
diff --git a/docs/sparkr.md b/docs/sparkr.md index 8e6a98e40b680..a34a1200c4c00 100644 --- a/docs/sparkr.md +++ b/docs/sparkr.md @@ -77,7 +77,7 @@ sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = "2g The following Spark driver properties can be set in `sparkConfig` with `sparkR.session` from RStudio: -
System propertyMeaningSince Version
spark.deploy.recoveryMode
+
@@ -588,7 +588,7 @@ The following example shows how to save/load a MLlib model by SparkR. {% include_example read_write r/ml/ml.R %} # Data type mapping between R and Spark -
Property NameProperty groupspark-submit equivalent
spark.master
+
@@ -728,7 +728,7 @@ function is masking another function. The following functions are masked by the SparkR package: -
RSpark
byte
+
diff --git a/docs/sql-data-sources-avro.md b/docs/sql-data-sources-avro.md index b01174b918245..c846116ebf3e3 100644 --- a/docs/sql-data-sources-avro.md +++ b/docs/sql-data-sources-avro.md @@ -233,7 +233,7 @@ Data source options of Avro can be set via: * the `.option` method on `DataFrameReader` or `DataFrameWriter`. * the `options` parameter in function `from_avro`. -
Masked functionHow to Access
cov in package:stats
+
@@ -331,7 +331,7 @@ Data source options of Avro can be set via: ## Configuration Configuration of Avro can be done using the `setConf` method on SparkSession or by running `SET key=value` commands using SQL. -
Property NameDefaultMeaningScopeSince Version
avroSchema
+
@@ -418,7 +418,7 @@ Submission Guide for more details. ## Supported types for Avro -> Spark SQL conversion Currently Spark supports reading all [primitive types](https://avro.apache.org/docs/1.11.2/specification/#primitive-types) and [complex types](https://avro.apache.org/docs/1.11.2/specification/#complex-types) under records of Avro. -
Property NameDefaultMeaningSince Version
spark.sql.legacy.replaceDatabricksSparkAvro.enabled
+
@@ -483,7 +483,7 @@ All other union types are considered complex. They will be mapped to StructType It also supports reading the following Avro [logical types](https://avro.apache.org/docs/1.11.2/specification/#logical-types): -
Avro typeSpark SQL type
boolean
+
@@ -516,7 +516,7 @@ At the moment, it ignores docs, aliases and other properties present in the Avro ## Supported types for Spark SQL -> Avro conversion Spark supports writing of all Spark SQL types into Avro. For most types, the mapping from Spark types to Avro types is straightforward (e.g. IntegerType gets converted to int); however, there are a few special cases which are listed below: -
Avro logical typeAvro typeSpark SQL type
date
+
@@ -552,7 +552,7 @@ Spark supports writing of all Spark SQL types into Avro. For most types, the map You can also specify the whole output Avro schema with the option `avroSchema`, so that Spark SQL types can be converted into other Avro types. The following conversions are not applied by default and require user specified Avro schema: -
Spark SQL typeAvro typeAvro logical type
ByteType
+
diff --git a/docs/sql-data-sources-csv.md b/docs/sql-data-sources-csv.md index 31167f5514302..241aae3571221 100644 --- a/docs/sql-data-sources-csv.md +++ b/docs/sql-data-sources-csv.md @@ -52,7 +52,7 @@ Data source options of CSV can be set via: * `OPTIONS` clause at [CREATE TABLE USING DATA_SOURCE](sql-ref-syntax-ddl-create-table-datasource.html) -
Spark SQL typeAvro typeAvro logical type
BinaryType
+
diff --git a/docs/sql-data-sources-hive-tables.md b/docs/sql-data-sources-hive-tables.md index 0de573ec64b89..13cd8fc2cc056 100644 --- a/docs/sql-data-sources-hive-tables.md +++ b/docs/sql-data-sources-hive-tables.md @@ -75,7 +75,7 @@ format("serde", "input format", "output format"), e.g. `CREATE TABLE src(id int) By default, we will read the table files as plain text. Note that, Hive storage handler is not supported yet when creating table, you can create a table using storage handler at Hive side, and use Spark SQL to read it. -
Property NameDefaultMeaningScope
sep
+
@@ -123,7 +123,7 @@ will compile against built-in Hive and use those classes for internal execution The following options can be used to configure the version of Hive that is used to retrieve metadata: -
Property NameMeaning
fileFormat
+
diff --git a/docs/sql-data-sources-jdbc.md b/docs/sql-data-sources-jdbc.md index f96776514c672..edcdef4bf0084 100644 --- a/docs/sql-data-sources-jdbc.md +++ b/docs/sql-data-sources-jdbc.md @@ -51,7 +51,7 @@ For connection properties, users can specify the JDBC connection properties in t user and password are normally provided as connection properties for logging into the data sources. -
Property NameDefaultMeaningSince Version
spark.sql.hive.metastore.version
+
diff --git a/docs/sql-data-sources-json.md b/docs/sql-data-sources-json.md index 881a69cb1cea4..4ade5170a6d81 100644 --- a/docs/sql-data-sources-json.md +++ b/docs/sql-data-sources-json.md @@ -109,7 +109,7 @@ Data source options of JSON can be set via: * `schema_of_json` * `OPTIONS` clause at [CREATE TABLE USING DATA_SOURCE](sql-ref-syntax-ddl-create-table-datasource.html) -
Property NameDefaultMeaningScope
url
+
diff --git a/docs/sql-data-sources-load-save-functions.md b/docs/sql-data-sources-load-save-functions.md index 9d0a3f9c72b9a..31f6d944bc972 100644 --- a/docs/sql-data-sources-load-save-functions.md +++ b/docs/sql-data-sources-load-save-functions.md @@ -218,7 +218,7 @@ present. It is important to realize that these save modes do not utilize any loc atomic. Additionally, when performing an `Overwrite`, the data will be deleted before writing out the new data. -
Property NameDefaultMeaningScope
+
diff --git a/docs/sql-data-sources-orc.md b/docs/sql-data-sources-orc.md index 4e492598f595d..561f601aa4e56 100644 --- a/docs/sql-data-sources-orc.md +++ b/docs/sql-data-sources-orc.md @@ -129,7 +129,7 @@ When reading from Hive metastore ORC tables and inserting to Hive metastore ORC ### Configuration -
Scala/JavaAny LanguageMeaning
SaveMode.ErrorIfExists (default)
+
@@ -230,7 +230,7 @@ Data source options of ORC can be set via: * `DataStreamWriter` * `OPTIONS` clause at [CREATE TABLE USING DATA_SOURCE](sql-ref-syntax-ddl-create-table-datasource.html) -
Property NameDefaultMeaningSince Version
spark.sql.orc.impl
+
diff --git a/docs/sql-data-sources-parquet.md b/docs/sql-data-sources-parquet.md index 925e47504e5ef..f49bbd7a9d042 100644 --- a/docs/sql-data-sources-parquet.md +++ b/docs/sql-data-sources-parquet.md @@ -386,7 +386,7 @@ Data source options of Parquet can be set via: * `DataStreamWriter` * `OPTIONS` clause at [CREATE TABLE USING DATA_SOURCE](sql-ref-syntax-ddl-create-table-datasource.html) -
Property NameDefaultMeaningScope
mergeSchema
+
@@ -434,7 +434,7 @@ Other generic options can be found in +
Property NameDefaultMeaningScope
datetimeRebaseMode
diff --git a/docs/sql-data-sources-text.md b/docs/sql-data-sources-text.md index bb485d29c396a..aed8a2e9942fb 100644 --- a/docs/sql-data-sources-text.md +++ b/docs/sql-data-sources-text.md @@ -47,7 +47,7 @@ Data source options of text can be set via: * `DataStreamWriter` * `OPTIONS` clause at [CREATE TABLE USING DATA_SOURCE](sql-ref-syntax-ddl-create-table-datasource.html) -
Property NameDefaultMeaningSince Version
spark.sql.parquet.binaryAsString
+
diff --git a/docs/sql-distributed-sql-engine-spark-sql-cli.md b/docs/sql-distributed-sql-engine-spark-sql-cli.md index a67e009b9ae10..6d506cbb09c21 100644 --- a/docs/sql-distributed-sql-engine-spark-sql-cli.md +++ b/docs/sql-distributed-sql-engine-spark-sql-cli.md @@ -62,7 +62,7 @@ For example: `/path/to/spark-sql-cli.sql` equals to `file:///path/to/spark-sql-c ## Supported comment types -
Property NameDefaultMeaningScope
wholetext
+
@@ -115,7 +115,7 @@ Use `;` (semicolon) to terminate commands. Notice: ``` However, if ';' is the end of the line, it terminates the SQL statement. The example above will be terminated into `/* This is a comment contains ` and `*/ SELECT 1`, Spark will submit these two commands separated and throw parser error (`unclosed bracketed comment` and `Syntax error at or near '*/'`). -
CommentExample
simple comment
+
diff --git a/docs/sql-error-conditions-sqlstates.md b/docs/sql-error-conditions-sqlstates.md index 5529c961b3bfb..49cfb56b36626 100644 --- a/docs/sql-error-conditions-sqlstates.md +++ b/docs/sql-error-conditions-sqlstates.md @@ -33,7 +33,7 @@ Spark SQL uses the following `SQLSTATE` classes: ## Class `0A`: feature not supported -
CommandDescription
quit or exit
+
@@ -48,7 +48,7 @@ Spark SQL uses the following `SQLSTATE` classes:
SQLSTATEDescription and issuing error classes
0A000
## Class `21`: cardinality violation - +
@@ -63,7 +63,7 @@ Spark SQL uses the following `SQLSTATE` classes:
SQLSTATEDescription and issuing error classes
21000
## Class `22`: data exception - +
@@ -168,7 +168,7 @@ Spark SQL uses the following `SQLSTATE` classes:
SQLSTATEDescription and issuing error classes
22003
## Class `23`: integrity constraint violation - +
@@ -183,7 +183,7 @@ Spark SQL uses the following `SQLSTATE` classes:
SQLSTATEDescription and issuing error classes
23505
## Class `2B`: dependent privilege descriptors still exist - +
@@ -198,7 +198,7 @@ Spark SQL uses the following `SQLSTATE` classes:
SQLSTATEDescription and issuing error classes
2BP01
## Class `38`: external routine exception - +
@@ -213,7 +213,7 @@ Spark SQL uses the following `SQLSTATE` classes:
SQLSTATEDescription and issuing error classes
38000
## Class `39`: external routine invocation exception - +
@@ -228,7 +228,7 @@ Spark SQL uses the following `SQLSTATE` classes:
SQLSTATEDescription and issuing error classes
39000
## Class `42`: syntax error or access rule violation - +
@@ -648,7 +648,7 @@ Spark SQL uses the following `SQLSTATE` classes:
SQLSTATEDescription and issuing error classes
42000
## Class `46`: java ddl 1 - +
@@ -672,7 +672,7 @@ Spark SQL uses the following `SQLSTATE` classes:
SQLSTATEDescription and issuing error classes
46110
## Class `53`: insufficient resources - +
@@ -687,7 +687,7 @@ Spark SQL uses the following `SQLSTATE` classes:
SQLSTATEDescription and issuing error classes
53200
## Class `54`: program limit exceeded - +
@@ -702,7 +702,7 @@ Spark SQL uses the following `SQLSTATE` classes:
SQLSTATEDescription and issuing error classes
54000
## Class `HY`: CLI-specific condition - +
@@ -717,7 +717,7 @@ Spark SQL uses the following `SQLSTATE` classes:
SQLSTATEDescription and issuing error classes
HY008
## Class `XX`: internal error - +
diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md index 5fc323ec1b0ea..090d96db46a1c 100644 --- a/docs/sql-migration-guide.md +++ b/docs/sql-migration-guide.md @@ -468,7 +468,7 @@ license: | ## Upgrading from Spark SQL 2.3 to 2.4 - In Spark version 2.3 and earlier, the second parameter to array_contains function is implicitly promoted to the element type of first array type parameter. This type promotion can be lossy and may cause `array_contains` function to return wrong result. This problem has been addressed in 2.4 by employing a safer type promotion mechanism. This can cause some change in behavior and are illustrated in the table below. -
SQLSTATEDescription and issuing error classes
XX000
+
@@ -582,7 +582,7 @@ license: | - Since Spark 2.3, the Join/Filter's deterministic predicates that are after the first non-deterministic predicates are also pushed down/through the child operators, if possible. In prior Spark versions, these filters are not eligible for predicate pushdown. - Partition column inference previously found incorrect common type for different inferred types, for example, previously it ended up with double type as the common type for double type and date type. Now it finds the correct common type for such conflicts. The conflict resolution follows the table below: - +
diff --git a/docs/sql-performance-tuning.md b/docs/sql-performance-tuning.md index 1467409bb500d..2dec65cc553ed 100644 --- a/docs/sql-performance-tuning.md +++ b/docs/sql-performance-tuning.md @@ -34,7 +34,7 @@ memory usage and GC pressure. You can call `spark.catalog.uncacheTable("tableNam Configuration of in-memory caching can be done using the `setConf` method on `SparkSession` or by running `SET key=value` commands using SQL. - +
@@ -62,7 +62,7 @@ Configuration of in-memory caching can be done using the `setConf` method on `Sp The following options can also be used to tune the performance of query execution. It is possible that these options will be deprecated in future release as more optimizations are performed automatically. -
Property NameDefaultMeaningSince Version
spark.sql.inMemoryColumnarStorage.compressed
+
@@ -253,7 +253,7 @@ Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that ma ### Coalescing Post Shuffle Partitions This feature coalesces the post shuffle partitions based on the map output statistics when both `spark.sql.adaptive.enabled` and `spark.sql.adaptive.coalescePartitions.enabled` configurations are true. This feature simplifies the tuning of shuffle partition number when running queries. You do not need to set a proper shuffle partition number to fit your dataset. Spark can pick the proper shuffle partition number at runtime once you set a large enough initial number of shuffle partitions via `spark.sql.adaptive.coalescePartitions.initialPartitionNum` configuration. -
Property NameDefaultMeaningSince Version
spark.sql.files.maxPartitionBytes
+
@@ -298,7 +298,7 @@ This feature coalesces the post shuffle partitions based on the map output stati
Property NameDefaultMeaningSince Version
spark.sql.adaptive.coalescePartitions.enabled
### Spliting skewed shuffle partitions - +
@@ -320,7 +320,7 @@ This feature coalesces the post shuffle partitions based on the map output stati ### Converting sort-merge join to broadcast join AQE converts sort-merge join to broadcast hash join when the runtime statistics of any join side is smaller than the adaptive broadcast hash join threshold. This is not as efficient as planning a broadcast hash join in the first place, but it's better than keep doing the sort-merge join, as we can save the sorting of both the join sides, and read shuffle files locally to save network traffic(if `spark.sql.adaptive.localShuffleReader.enabled` is true) -
Property NameDefaultMeaningSince Version
spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled
+
@@ -342,7 +342,7 @@ AQE converts sort-merge join to broadcast hash join when the runtime statistics ### Converting sort-merge join to shuffled hash join AQE converts sort-merge join to shuffled hash join when all post shuffle partitions are smaller than a threshold, the max threshold can see the config `spark.sql.adaptive.maxShuffledHashJoinLocalMapThreshold`. -
Property NameDefaultMeaningSince Version
spark.sql.adaptive.autoBroadcastJoinThreshold
+
@@ -356,7 +356,7 @@ AQE converts sort-merge join to shuffled hash join when all post shuffle partiti ### Optimizing Skew Join Data skew can severely downgrade the performance of join queries. This feature dynamically handles skew in sort-merge join by splitting (and replicating if needed) skewed tasks into roughly evenly sized tasks. It takes effect when both `spark.sql.adaptive.enabled` and `spark.sql.adaptive.skewJoin.enabled` configurations are enabled. -
Property NameDefaultMeaningSince Version
spark.sql.adaptive.maxShuffledHashJoinLocalMapThreshold
+
@@ -393,7 +393,7 @@ Data skew can severely downgrade the performance of join queries. This feature d
Property NameDefaultMeaningSince Version
spark.sql.adaptive.skewJoin.enabled
### Misc - +
diff --git a/docs/storage-openstack-swift.md b/docs/storage-openstack-swift.md index 73b21a1f7c27b..5b30786bdd7f9 100644 --- a/docs/storage-openstack-swift.md +++ b/docs/storage-openstack-swift.md @@ -60,7 +60,7 @@ required by Keystone. The following table contains a list of Keystone mandatory parameters. PROVIDER can be any (alphanumeric) name. -
Property NameDefaultMeaningSince Version
spark.sql.adaptive.optimizer.excludedRules
+
diff --git a/docs/streaming-custom-receivers.md b/docs/streaming-custom-receivers.md index 591a4415bb1a5..11a52232510fd 100644 --- a/docs/streaming-custom-receivers.md +++ b/docs/streaming-custom-receivers.md @@ -243,7 +243,7 @@ interval in the [Spark Streaming Programming Guide](streaming-programming-guide. The following table summarizes the characteristics of both types of receivers -
Property NameMeaningRequired
fs.swift.service.PROVIDER.auth.url
+
diff --git a/docs/streaming-programming-guide.md b/docs/streaming-programming-guide.md index f8f98ca54425d..4b93fb7c89ad1 100644 --- a/docs/streaming-programming-guide.md +++ b/docs/streaming-programming-guide.md @@ -433,7 +433,7 @@ Streaming core artifact `spark-streaming-xyz_{{site.SCALA_BINARY_VERSION}}` to the dependencies. For example, some of the common ones are as follows. -
Receiver Type
+
@@ -820,7 +820,7 @@ Similar to that of RDDs, transformations allow the data from the input DStream t DStreams support many of the transformations available on normal Spark RDD's. Some of the common ones are as follows. -
SourceArtifact
Kafka spark-streaming-kafka-0-10_{{site.SCALA_BINARY_VERSION}}
Kinesis
spark-streaming-kinesis-asl_{{site.SCALA_BINARY_VERSION}} [Amazon Software License]
+
@@ -1109,7 +1109,7 @@ JavaPairDStream windowedWordCounts = pairs.reduceByKeyAndWindow Some of the common window operations are as follows. All of these operations take the said two parameters - windowLength and slideInterval. -
TransformationMeaning
map(func)
+
@@ -1280,7 +1280,7 @@ Since the output operations actually allow the transformed data to be consumed b they trigger the actual execution of all the DStream transformations (similar to actions for RDDs). Currently, the following output operations are defined: -
TransformationMeaning
window(windowLength, slideInterval)
+
@@ -2485,7 +2485,7 @@ enabled](#deploying-applications) and reliable receivers, there is zero data los The following table summarizes the semantics under failures: -
Output OperationMeaning
print()
+
diff --git a/docs/structured-streaming-kafka-integration.md b/docs/structured-streaming-kafka-integration.md index 66e6efb1c8a9f..c5ffdf025b173 100644 --- a/docs/structured-streaming-kafka-integration.md +++ b/docs/structured-streaming-kafka-integration.md @@ -297,7 +297,7 @@ df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)"); Each row in the source has the following schema: -
Deployment Scenario
+
@@ -336,7 +336,7 @@ Each row in the source has the following schema: The following options must be set for the Kafka source for both batch and streaming queries. -
ColumnType
key
+
@@ -368,7 +368,7 @@ for both batch and streaming queries. The following configurations are optional: -
Optionvaluemeaning
assign
+
@@ -607,7 +607,7 @@ The caching key is built up from the following information: The following properties are available to configure the consumer pool: -
Optionvaluedefaultquery typemeaning
startingTimestamp
+
@@ -657,7 +657,7 @@ Note that it doesn't leverage Apache Commons Pool due to the difference of chara The following properties are available to configure the fetched data pool: -
Property NameDefaultMeaningSince Version
spark.kafka.consumer.cache.capacity
+
@@ -685,7 +685,7 @@ solution to remove duplicates when reading the written data could be to introduc that can be used to perform de-duplication when reading. The Dataframe being written to Kafka should have the following columns in schema: -
Property NameDefaultMeaningSince Version
spark.kafka.consumer.fetchedData.cache.timeout
+
@@ -725,7 +725,7 @@ will be used. The following options must be set for the Kafka sink for both batch and streaming queries. -
ColumnType
key (optional)
+
@@ -736,7 +736,7 @@ for both batch and streaming queries. The following configurations are optional: -
Optionvaluemeaning
kafka.bootstrap.servers
+
@@ -912,7 +912,7 @@ It will use different Kafka producer when delegation token is renewed; Kafka pro The following properties are available to configure the producer pool: -
Optionvaluedefaultquery typemeaning
topic
+
@@ -1039,7 +1039,7 @@ When none of the above applies then unsecure connection assumed. Delegation tokens can be obtained from multiple clusters and ${cluster} is an arbitrary unique identifier which helps to group different configurations. -
Property NameDefaultMeaningSince Version
spark.kafka.producer.cache.timeout
+
diff --git a/docs/structured-streaming-programming-guide.md b/docs/structured-streaming-programming-guide.md index 76a22621a0e32..864c969fda6c2 100644 --- a/docs/structured-streaming-programming-guide.md +++ b/docs/structured-streaming-programming-guide.md @@ -545,7 +545,7 @@ checkpointed offsets after a failure. See the earlier section on [fault-tolerance semantics](#fault-tolerance-semantics). Here are the details of all the sources in Spark. -
Property NameDefaultMeaningSince Version
spark.kafka.clusters.${cluster}.auth.bootstrap.servers
+
@@ -1819,7 +1819,7 @@ regarding watermark delays and whether data will be dropped or not. ##### Support matrix for joins in streaming queries -
Source
+
@@ -2307,7 +2307,7 @@ to `org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider`. Here are the configs regarding to RocksDB instance of the state store provider: -
Left Input
+
@@ -2474,7 +2474,7 @@ More information to be added in future releases. Different types of streaming queries support different output modes. Here is the compatibility matrix. -
Config Name
+
@@ -2613,7 +2613,7 @@ meant for debugging purposes only. See the earlier section on [fault-tolerance semantics](#fault-tolerance-semantics). Here are the details of all the sinks in Spark. -
Query Type
+
@@ -3201,7 +3201,7 @@ The trigger settings of a streaming query define the timing of streaming data pr the query is going to be executed as micro-batch query with a fixed batch interval or as a continuous processing query. Here are the different kinds of triggers that are supported. -
Sink
+
diff --git a/docs/submitting-applications.md b/docs/submitting-applications.md index becdfb4b18f5d..4821f883eef9d 100644 --- a/docs/submitting-applications.md +++ b/docs/submitting-applications.md @@ -159,7 +159,7 @@ export HADOOP_CONF_DIR=XXX The master URL passed to Spark can be in one of the following formats: -
Trigger Type
+
diff --git a/docs/web-ui.md b/docs/web-ui.md index 079bc6137f020..cdf62e0d8ec0b 100644 --- a/docs/web-ui.md +++ b/docs/web-ui.md @@ -380,7 +380,7 @@ operator shows the number of bytes written by a shuffle. Here is the list of SQL metrics: -
Master URLMeaning
local Run Spark locally with one worker thread (i.e. no parallelism at all).
local[K] Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).
+
SQL metricsMeaningOperators
number of output rows the number of output rows of the operator Aggregate operators, Join operators, Sample, Range, Scan operators, Filter, etc.
data size the size of broadcast/shuffled/collected data of the operator BroadcastExchange, ShuffleExchange, Subquery