diff --git a/docs/building-spark.md b/docs/building-spark.md index 90a520a62a989..23d6f49a4fe81 100644 --- a/docs/building-spark.md +++ b/docs/building-spark.md @@ -284,7 +284,7 @@ If use an individual repository or a repository on GitHub Enterprise, export bel ### Related environment variables - +
diff --git a/docs/cluster-overview.md b/docs/cluster-overview.md index 119412f96094d..c2145e35f7f24 100644 --- a/docs/cluster-overview.md +++ b/docs/cluster-overview.md @@ -89,7 +89,7 @@ The [job scheduling overview](job-scheduling.html) describes this in more detail The following table summarizes terms you'll see used to refer to cluster concepts: -
Variable NameDefaultMeaning
SPARK_PROJECT_URL
+
diff --git a/docs/configuration.md b/docs/configuration.md index 75f597fdb4c6c..b13250a7786e6 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -135,7 +135,7 @@ of the most common options to set are: ### Application Properties -
TermMeaning
+
@@ -528,7 +528,7 @@ Apart from these, the following properties are also available, and may be useful ### Runtime Environment -
Property NameDefaultMeaningSince Version
spark.app.name
+
@@ -915,7 +915,7 @@ Apart from these, the following properties are also available, and may be useful ### Shuffle Behavior -
Property NameDefaultMeaningSince Version
spark.driver.extraClassPath
+
@@ -1290,7 +1290,7 @@ Apart from these, the following properties are also available, and may be useful ### Spark UI -
Property NameDefaultMeaningSince Version
spark.reducer.maxSizeInFlight
+
@@ -1682,7 +1682,7 @@ Apart from these, the following properties are also available, and may be useful ### Compression and Serialization -
Property NameDefaultMeaningSince Version
spark.eventLog.logBlockUpdates.enabled
+
@@ -1880,7 +1880,7 @@ Apart from these, the following properties are also available, and may be useful ### Memory Management -
Property NameDefaultMeaningSince Version
spark.broadcast.compress
+
@@ -2005,7 +2005,7 @@ Apart from these, the following properties are also available, and may be useful ### Execution Behavior -
Property NameDefaultMeaningSince Version
spark.memory.fraction
+
@@ -2250,7 +2250,7 @@ Apart from these, the following properties are also available, and may be useful ### Executor Metrics -
Property NameDefaultMeaningSince Version
spark.broadcast.blockSize
+
@@ -2318,7 +2318,7 @@ Apart from these, the following properties are also available, and may be useful ### Networking -
Property NameDefaultMeaningSince Version
spark.eventLog.logStageExecutorMetrics
+
@@ -2481,7 +2481,7 @@ Apart from these, the following properties are also available, and may be useful ### Scheduling -
Property NameDefaultMeaningSince Version
spark.rpc.message.maxSize
+
@@ -2962,7 +2962,7 @@ Apart from these, the following properties are also available, and may be useful ### Barrier Execution Mode -
Property NameDefaultMeaningSince Version
spark.cores.max
+
@@ -3009,7 +3009,7 @@ Apart from these, the following properties are also available, and may be useful ### Dynamic Allocation -
Property NameDefaultMeaningSince Version
spark.barrier.sync.timeout
+
@@ -3151,7 +3151,7 @@ finer granularity starting from driver and executor. Take RPC module as example like shuffle, just replace "rpc" with "shuffle" in the property names except spark.{driver|executor}.rpc.netty.dispatcher.numThreads, which is only for RPC module. -
Property NameDefaultMeaningSince Version
spark.dynamicAllocation.enabled
+
@@ -3294,7 +3294,7 @@ External users can query the static sql config values via `SparkSession.conf` or ### Spark Streaming -
Property NameDefaultMeaningSince Version
spark.{driver|executor}.rpc.io.serverThreads
+
@@ -3426,7 +3426,7 @@ External users can query the static sql config values via `SparkSession.conf` or ### SparkR -
Property NameDefaultMeaningSince Version
spark.streaming.backpressure.enabled
+
@@ -3482,7 +3482,7 @@ External users can query the static sql config values via `SparkSession.conf` or ### GraphX -
Property NameDefaultMeaningSince Version
spark.r.numRBackendThreads
+
@@ -3519,7 +3519,7 @@ copy `conf/spark-env.sh.template` to create it. Make sure you make the copy exec The following variables can be set in `spark-env.sh`: -
Property NameDefaultMeaningSince Version
spark.graphx.pregel.checkpointInterval
+
@@ -3656,7 +3656,7 @@ Push-based shuffle helps improve the reliability and performance of spark shuffl ### External Shuffle service(server) side configuration options -
Environment VariableMeaning
JAVA_HOME
+
@@ -3690,7 +3690,7 @@ Push-based shuffle helps improve the reliability and performance of spark shuffl ### Client side configuration options -
Property NameDefaultMeaningSince Version
spark.shuffle.push.server.mergedShuffleFileManagerImpl
+
diff --git a/docs/css/custom.css b/docs/css/custom.css index c4388c9650bf4..71de2b8c7803f 100644 --- a/docs/css/custom.css +++ b/docs/css/custom.css @@ -1111,5 +1111,18 @@ img { table { width: 100%; overflow-wrap: normal; + border-collapse: collapse; /* Ensures that the borders collapse into a single border */ } +table th, table td { + border: 1px solid #cccccc; /* Adds a border to each table header and data cell */ + padding: 6px 13px; /* Optional: Adds padding inside each cell for better readability */ +} + +table tr { + background-color: white; /* Sets a default background color for all rows */ +} + +table tr:nth-child(2n) { + background-color: #F1F4F5; /* Sets a different background color for even rows */ +} diff --git a/docs/ml-classification-regression.md b/docs/ml-classification-regression.md index d184f4fe0257c..604b3245272fc 100644 --- a/docs/ml-classification-regression.md +++ b/docs/ml-classification-regression.md @@ -703,7 +703,7 @@ others. ### Available families -
Property NameDefaultMeaningSince Version
spark.shuffle.push.enabled
+
@@ -1224,7 +1224,7 @@ All output columns are optional; to exclude an output column, set its correspond ### Input Columns -
Family
+
@@ -1251,7 +1251,7 @@ All output columns are optional; to exclude an output column, set its correspond ### Output Columns -
Param name
+
@@ -1326,7 +1326,7 @@ All output columns are optional; to exclude an output column, set its correspond #### Input Columns -
Param name
+
@@ -1353,7 +1353,7 @@ All output columns are optional; to exclude an output column, set its correspond #### Output Columns (Predictions) -
Param name
+
@@ -1407,7 +1407,7 @@ All output columns are optional; to exclude an output column, set its correspond #### Input Columns -
Param name
+
@@ -1436,7 +1436,7 @@ Note that `GBTClassifier` currently only supports binary labels. #### Output Columns (Predictions) -
Param name
+
diff --git a/docs/ml-clustering.md b/docs/ml-clustering.md index 00a156b6645ce..fdb8173ce3bbe 100644 --- a/docs/ml-clustering.md +++ b/docs/ml-clustering.md @@ -40,7 +40,7 @@ called [kmeans||](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf). ### Input Columns -
Param name
+
@@ -61,7 +61,7 @@ called [kmeans||](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf). ### Output Columns -
Param name
+
@@ -204,7 +204,7 @@ model. ### Input Columns -
Param name
+
@@ -225,7 +225,7 @@ model. ### Output Columns -
Param name
+
diff --git a/docs/mllib-classification-regression.md b/docs/mllib-classification-regression.md index 10cb85e392029..b3305314abc56 100644 --- a/docs/mllib-classification-regression.md +++ b/docs/mllib-classification-regression.md @@ -26,7 +26,7 @@ classification](http://en.wikipedia.org/wiki/Multiclass_classification), and [regression analysis](http://en.wikipedia.org/wiki/Regression_analysis). The table below outlines the supported algorithms for each type of problem. -
Param name
+
diff --git a/docs/mllib-decision-tree.md b/docs/mllib-decision-tree.md index 174255c48b699..0d9886315e288 100644 --- a/docs/mllib-decision-tree.md +++ b/docs/mllib-decision-tree.md @@ -51,7 +51,7 @@ The *node impurity* is a measure of the homogeneity of the labels at the node. T implementation provides two impurity measures for classification (Gini impurity and entropy) and one impurity measure for regression (variance). -
Problem TypeSupported Methods
+
diff --git a/docs/mllib-ensembles.md b/docs/mllib-ensembles.md index b1006f2730db5..fdad7ae68dd49 100644 --- a/docs/mllib-ensembles.md +++ b/docs/mllib-ensembles.md @@ -191,7 +191,7 @@ Note that each loss is applicable to one of classification or regression, not bo Notation: $N$ = number of instances. $y_i$ = label of instance $i$. $x_i$ = features of instance $i$. $F(x_i)$ = model's predicted label for instance $i$. -
ImpurityTaskFormulaDescription
+
diff --git a/docs/mllib-evaluation-metrics.md b/docs/mllib-evaluation-metrics.md index f82f6a01136b9..30acc3dc634be 100644 --- a/docs/mllib-evaluation-metrics.md +++ b/docs/mllib-evaluation-metrics.md @@ -76,7 +76,7 @@ plots (recall, false positive rate) points. **Available metrics** -
LossTaskFormulaDescription
+
@@ -179,7 +179,7 @@ For this section, a modified delta function $\hat{\delta}(x)$ will prove useful $$\hat{\delta}(x) = \begin{cases}1 & \text{if $x = 0$}, \\ 0 & \text{otherwise}.\end{cases}$$ -
MetricDefinition
+
@@ -296,7 +296,7 @@ The following definition of indicator function $I_A(x)$ on a set $A$ will be nec $$I_A(x) = \begin{cases}1 & \text{if $x \in A$}, \\ 0 & \text{otherwise}.\end{cases}$$ -
MetricDefinition
+
@@ -447,7 +447,7 @@ documents, returns a relevance score for the recommended document. $$rel_D(r) = \begin{cases}1 & \text{if $r \in D$}, \\ 0 & \text{otherwise}.\end{cases}$$ -
MetricDefinition
+
@@ -553,7 +553,7 @@ variable from a number of independent variables. **Available metrics** -
MetricDefinitionNotes
+
diff --git a/docs/mllib-linear-methods.md b/docs/mllib-linear-methods.md index b535d2de307a9..448d881f794a5 100644 --- a/docs/mllib-linear-methods.md +++ b/docs/mllib-linear-methods.md @@ -72,7 +72,7 @@ training error) and minimizing model complexity (i.e., to avoid overfitting). The following table summarizes the loss functions and their gradients or sub-gradients for the methods `spark.mllib` supports: -
MetricDefinition
+
@@ -105,7 +105,7 @@ The purpose of the encourage simple models and avoid overfitting. We support the following regularizers in `spark.mllib`: -
loss function $L(\wv; \x, y)$gradient or sub-gradient
+
diff --git a/docs/mllib-pmml-model-export.md b/docs/mllib-pmml-model-export.md index e20d7c2fe4e17..02b5fda7a36df 100644 --- a/docs/mllib-pmml-model-export.md +++ b/docs/mllib-pmml-model-export.md @@ -28,7 +28,7 @@ license: | The table below outlines the `spark.mllib` models that can be exported to PMML and their equivalent PMML model. -
regularizer $R(\wv)$gradient or sub-gradient
+
diff --git a/docs/monitoring.md b/docs/monitoring.md index 7336be9bb67e0..056543deb0946 100644 --- a/docs/monitoring.md +++ b/docs/monitoring.md @@ -69,7 +69,7 @@ The history server can be configured as follows: ### Environment Variables -
spark.mllib modelPMML model
+
@@ -145,7 +145,7 @@ Use it with caution. Security options for the Spark History Server are covered more detail in the [Security](security.html#web-ui) page. -
Environment VariableMeaning
SPARK_DAEMON_MEMORY
+
@@ -470,7 +470,7 @@ only for applications in cluster mode, not applications in client mode. Applicat can be identified by their `[attempt-id]`. In the API listed below, when running in YARN cluster mode, `[app-id]` will actually be `[base-app-id]/[attempt-id]`, where `[base-app-id]` is the YARN application ID. -
Property Name
+
@@ -669,7 +669,7 @@ The REST API exposes the values of the Task Metrics collected by Spark executors of task execution. The metrics can be used for performance troubleshooting and workload characterization. A list of the available metrics, with a short description: -
EndpointMeaning
/applications
+
@@ -827,7 +827,7 @@ In addition, aggregated per-stage peak values of the executor memory metrics are Executor memory metrics are also exposed via the Spark metrics system based on the [Dropwizard metrics library](https://metrics.dropwizard.io/4.2.0). A list of the available metrics, with a short description: -
Spark Executor Task Metric name
+
diff --git a/docs/rdd-programming-guide.md b/docs/rdd-programming-guide.md index 7764f0bbb5f8f..b92b3da09c5c5 100644 --- a/docs/rdd-programming-guide.md +++ b/docs/rdd-programming-guide.md @@ -378,7 +378,7 @@ resulting Java objects using [pickle](https://github.com/irmen/pickle/). When sa PySpark does the reverse. It unpickles Python objects into Java objects and then converts them to Writables. The following Writables are automatically converted: -
Executor Level Metric name Short description
+
@@ -954,7 +954,7 @@ and pair RDD functions doc [Java](api/java/index.html?org/apache/spark/api/java/JavaPairRDD.html)) for details. -
Writable TypePython Type
Textstr
IntWritableint
+
@@ -1069,7 +1069,7 @@ and pair RDD functions doc [Java](api/java/index.html?org/apache/spark/api/java/JavaPairRDD.html)) for details. -
TransformationMeaning
map(func)
+
@@ -1214,7 +1214,7 @@ to `persist()`. The `cache()` method is a shorthand for using the default storag which is `StorageLevel.MEMORY_ONLY` (store deserialized objects in memory). The full set of storage levels is: -
ActionMeaning
reduce(func)
+
diff --git a/docs/running-on-kubernetes.md b/docs/running-on-kubernetes.md index 24e0575d83e4d..cc70c025792f7 100644 --- a/docs/running-on-kubernetes.md +++ b/docs/running-on-kubernetes.md @@ -592,7 +592,7 @@ See the [configuration page](configuration.html) for information on Spark config #### Spark Properties -
Storage LevelMeaning
MEMORY_ONLY
+
@@ -1658,7 +1658,7 @@ See the below table for the full list of pod specifications that will be overwri ### Pod Metadata -
Property NameDefaultMeaningSince Version
spark.kubernetes.context
+
@@ -1694,7 +1694,7 @@ See the below table for the full list of pod specifications that will be overwri ### Pod Spec -
Pod metadata keyModified valueDescription
name
+
@@ -1747,7 +1747,7 @@ See the below table for the full list of pod specifications that will be overwri The following affect the driver and executor containers. All other containers in the pod spec will be unaffected. -
Pod spec keyModified valueDescription
imagePullSecrets
+
diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md index 11ed7e9e87737..52afb178a5156 100644 --- a/docs/running-on-yarn.md +++ b/docs/running-on-yarn.md @@ -143,7 +143,7 @@ To use a custom metrics.properties for the application master and executors, upd #### Spark Properties -
Container spec keyModified valueDescription
env
+
@@ -696,7 +696,7 @@ To use a custom metrics.properties for the application master and executors, upd #### Available patterns for SHS custom executor log URL -
Property NameDefaultMeaningSince Version
spark.yarn.am.memory
+
@@ -783,7 +783,7 @@ staging directory of the Spark application. ## YARN-specific Kerberos Configuration -
PatternMeaning
{{HTTP_SCHEME}}
+
@@ -882,7 +882,7 @@ to avoid garbage collection issues during shuffle. The following extra configuration options are available when the shuffle service is running on YARN: -
Property NameDefaultMeaningSince Version
spark.kerberos.keytab
+
diff --git a/docs/security.md b/docs/security.md index 2a1105fea33fe..755c7ce8b430d 100644 --- a/docs/security.md +++ b/docs/security.md @@ -60,7 +60,7 @@ distributing the shared secret. Each application will use a unique shared secret the case of YARN, this feature relies on YARN RPC encryption being enabled for the distribution of secrets to be secure. -
Property NameDefaultMeaning
spark.yarn.shuffle.stopOnFailure
+
@@ -82,7 +82,7 @@ that any user that can list pods in the namespace where the Spark application is also see their authentication secret. Access control rules should be properly set up by the Kubernetes admin to ensure that Spark authentication is secure. -
Property NameDefaultMeaningSince Version
spark.yarn.shuffle.server.recovery.disabled
+
@@ -103,7 +103,7 @@ Kubernetes admin to ensure that Spark authentication is secure. Alternatively, one can mount authentication secrets using files and Kubernetes secrets that the user mounts into their pods. -
Property NameDefaultMeaningSince Version
spark.authenticate
+
@@ -178,7 +178,7 @@ is still required when talking to shuffle services from Spark versions older tha The following table describes the different options available for configuring this feature. -
Property NameDefaultMeaningSince Version
spark.authenticate.secret.file
+
@@ -249,7 +249,7 @@ encrypting output data generated by applications with APIs such as `saveAsHadoop The following settings cover enabling encryption for data written to disk: -
Property NameDefaultMeaningSince Version
spark.network.crypto.enabled
+
@@ -317,7 +317,7 @@ below. The following options control the authentication of Web UIs: -
Property NameDefaultMeaningSince Version
spark.io.encryption.enabled
+
@@ -421,7 +421,7 @@ servlet filters. To enable authorization in the SHS, a few extra options are used: -
Property NameDefaultMeaningSince Version
spark.ui.allowFramingFrom
+
@@ -472,7 +472,7 @@ are inherited this way, *except* for `spark.ssl.rpc.enabled` which must be expli The following table describes the SSL configuration namespaces: -
Property NameDefaultMeaningSince Version
spark.history.ui.acls.enable
+
@@ -507,7 +507,7 @@ The following table describes the SSL configuration namespaces: The full breakdown of available SSL options can be found below. The `${ns}` placeholder should be replaced with one of the above namespaces. -
Config Namespace
+
@@ -726,7 +726,7 @@ Apache Spark can be configured to include HTTP headers to aid in preventing Cros (XSS), Cross-Frame Scripting (XFS), MIME-Sniffing, and also to enforce HTTP Strict Transport Security. -
Property NameDefaultMeaningSupported Namespaces
${ns}.enabled
+
@@ -782,7 +782,7 @@ configure those ports. ## Standalone mode only -
Property NameDefaultMeaningSince Version
spark.ui.xXssProtection
+
FromToDefault PortPurposeConfiguration @@ -833,7 +833,7 @@ configure those ports. ## All cluster managers - +
FromToDefault PortPurposeConfiguration @@ -909,7 +909,7 @@ deployment-specific page for more information. The following options provides finer-grained control for this feature: - +
diff --git a/docs/spark-standalone.md b/docs/spark-standalone.md index 2ab68d2a8049f..7a89c8124bdfe 100644 --- a/docs/spark-standalone.md +++ b/docs/spark-standalone.md @@ -53,7 +53,7 @@ You should see the new node listed there, along with its number of CPUs and memo Finally, the following configuration options can be passed to the master and worker: -
Property NameDefaultMeaningSince Version
spark.security.credentials.${service}.enabled
+
@@ -116,7 +116,7 @@ Note that these scripts must be executed on the machine you want to run the Spar You can optionally configure the cluster further by setting environment variables in `conf/spark-env.sh`. Create this file by starting with the `conf/spark-env.sh.template`, and _copy it to all your worker machines_ for the settings to take effect. The following settings are available: -
ArgumentMeaning
-h HOST, --host HOST
+
@@ -188,7 +188,7 @@ You can optionally configure the cluster further by setting environment variable SPARK_MASTER_OPTS supports the following system properties: -
Environment VariableMeaning
SPARK_MASTER_HOST
+
@@ -386,7 +386,7 @@ SPARK_MASTER_OPTS supports the following system properties: SPARK_WORKER_OPTS supports the following system properties: -
Property NameDefaultMeaningSince Version
spark.master.ui.port
+
@@ -501,7 +501,7 @@ You can also pass an option `--total-executor-cores ` to control the n Spark applications supports the following configuration properties specific to standalone mode: -
Property NameDefaultMeaningSince Version
spark.worker.cleanup.enabled
+
@@ -551,7 +551,7 @@ via http://[host:port]/[version]/submissions/[action] where version is a protocol version, v1 as of today, and action is one of the following supported actions. -
Property NameDefault ValueMeaningSince Version
spark.standalone.submit.waitAppCompletion
+
@@ -730,7 +730,7 @@ ZooKeeper is the best way to go for production-level high availability, but if y In order to enable this recovery mode, you can set SPARK_DAEMON_JAVA_OPTS in spark-env using this configuration: -
CommandDescriptionHTTP METHODSince Version
create
+
diff --git a/docs/sparkr.md b/docs/sparkr.md index 8e6a98e40b680..a34a1200c4c00 100644 --- a/docs/sparkr.md +++ b/docs/sparkr.md @@ -77,7 +77,7 @@ sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = "2g The following Spark driver properties can be set in `sparkConfig` with `sparkR.session` from RStudio: -
System propertyDefault ValueMeaningSince Version
spark.deploy.recoveryMode
+
@@ -588,7 +588,7 @@ The following example shows how to save/load a MLlib model by SparkR. {% include_example read_write r/ml/ml.R %} # Data type mapping between R and Spark -
Property NameProperty groupspark-submit equivalent
spark.master
+
@@ -728,7 +728,7 @@ function is masking another function. The following functions are masked by the SparkR package: -
RSpark
byte
+
diff --git a/docs/sql-data-sources-avro.md b/docs/sql-data-sources-avro.md index 72741b0e9d1c1..82f876eae2c56 100644 --- a/docs/sql-data-sources-avro.md +++ b/docs/sql-data-sources-avro.md @@ -233,7 +233,7 @@ Data source options of Avro can be set via: * the `.option` method on `DataFrameReader` or `DataFrameWriter`. * the `options` parameter in function `from_avro`. -
Masked functionHow to Access
cov in package:stats
+
@@ -331,7 +331,7 @@ Data source options of Avro can be set via: ## Configuration Configuration of Avro can be done using the `setConf` method on SparkSession or by running `SET key=value` commands using SQL. -
Property NameDefaultMeaningScopeSince Version
avroSchema
+
@@ -418,7 +418,7 @@ Submission Guide for more details. ## Supported types for Avro -> Spark SQL conversion Currently Spark supports reading all [primitive types](https://avro.apache.org/docs/1.11.3/specification/#primitive-types) and [complex types](https://avro.apache.org/docs/1.11.3/specification/#complex-types) under records of Avro. -
Property NameDefaultMeaningSince Version
spark.sql.legacy.replaceDatabricksSparkAvro.enabled
+
@@ -483,7 +483,7 @@ All other union types are considered complex. They will be mapped to StructType It also supports reading the following Avro [logical types](https://avro.apache.org/docs/1.11.3/specification/#logical-types): -
Avro typeSpark SQL type
boolean
+
@@ -516,7 +516,7 @@ At the moment, it ignores docs, aliases and other properties present in the Avro ## Supported types for Spark SQL -> Avro conversion Spark supports writing of all Spark SQL types into Avro. For most types, the mapping from Spark types to Avro types is straightforward (e.g. IntegerType gets converted to int); however, there are a few special cases which are listed below: -
Avro logical typeAvro typeSpark SQL type
date
+
@@ -552,7 +552,7 @@ Spark supports writing of all Spark SQL types into Avro. For most types, the map You can also specify the whole output Avro schema with the option `avroSchema`, so that Spark SQL types can be converted into other Avro types. The following conversions are not applied by default and require user specified Avro schema: -
Spark SQL typeAvro typeAvro logical type
ByteType
+
diff --git a/docs/sql-data-sources-csv.md b/docs/sql-data-sources-csv.md index 721563d1681e5..a7bb3633d64cb 100644 --- a/docs/sql-data-sources-csv.md +++ b/docs/sql-data-sources-csv.md @@ -52,7 +52,7 @@ Data source options of CSV can be set via: * `OPTIONS` clause at [CREATE TABLE USING DATA_SOURCE](sql-ref-syntax-ddl-create-table-datasource.html) -
Spark SQL typeAvro typeAvro logical type
BinaryType
+
diff --git a/docs/sql-data-sources-hive-tables.md b/docs/sql-data-sources-hive-tables.md index a2c37e69f9a42..0d16272ed6f86 100644 --- a/docs/sql-data-sources-hive-tables.md +++ b/docs/sql-data-sources-hive-tables.md @@ -75,7 +75,7 @@ format("serde", "input format", "output format"), e.g. `CREATE TABLE src(id int) By default, we will read the table files as plain text. Note that, Hive storage handler is not supported yet when creating table, you can create a table using storage handler at Hive side, and use Spark SQL to read it. -
Property NameDefaultMeaningScope
sep
+
@@ -123,7 +123,7 @@ will compile against built-in Hive and use those classes for internal execution The following options can be used to configure the version of Hive that is used to retrieve metadata: -
Property NameMeaning
fileFormat
+
diff --git a/docs/sql-data-sources-jdbc.md b/docs/sql-data-sources-jdbc.md index f96776514c672..edcdef4bf0084 100644 --- a/docs/sql-data-sources-jdbc.md +++ b/docs/sql-data-sources-jdbc.md @@ -51,7 +51,7 @@ For connection properties, users can specify the JDBC connection properties in t user and password are normally provided as connection properties for logging into the data sources. -
Property NameDefaultMeaningSince Version
spark.sql.hive.metastore.version
+
diff --git a/docs/sql-data-sources-json.md b/docs/sql-data-sources-json.md index 881a69cb1cea4..4ade5170a6d81 100644 --- a/docs/sql-data-sources-json.md +++ b/docs/sql-data-sources-json.md @@ -109,7 +109,7 @@ Data source options of JSON can be set via: * `schema_of_json` * `OPTIONS` clause at [CREATE TABLE USING DATA_SOURCE](sql-ref-syntax-ddl-create-table-datasource.html) -
Property NameDefaultMeaningScope
url
+
diff --git a/docs/sql-data-sources-load-save-functions.md b/docs/sql-data-sources-load-save-functions.md index 6bdb74acedea1..b42f6e84076d2 100644 --- a/docs/sql-data-sources-load-save-functions.md +++ b/docs/sql-data-sources-load-save-functions.md @@ -224,7 +224,7 @@ present. It is important to realize that these save modes do not utilize any loc atomic. Additionally, when performing an `Overwrite`, the data will be deleted before writing out the new data. -
Property NameDefaultMeaningScope
+
diff --git a/docs/sql-data-sources-orc.md b/docs/sql-data-sources-orc.md index 4e492598f595d..561f601aa4e56 100644 --- a/docs/sql-data-sources-orc.md +++ b/docs/sql-data-sources-orc.md @@ -129,7 +129,7 @@ When reading from Hive metastore ORC tables and inserting to Hive metastore ORC ### Configuration -
Scala/JavaAny LanguageMeaning
SaveMode.ErrorIfExists (default)
+
@@ -230,7 +230,7 @@ Data source options of ORC can be set via: * `DataStreamWriter` * `OPTIONS` clause at [CREATE TABLE USING DATA_SOURCE](sql-ref-syntax-ddl-create-table-datasource.html) -
Property NameDefaultMeaningSince Version
spark.sql.orc.impl
+
diff --git a/docs/sql-data-sources-parquet.md b/docs/sql-data-sources-parquet.md index c2af58248ea66..20f6d556cdf76 100644 --- a/docs/sql-data-sources-parquet.md +++ b/docs/sql-data-sources-parquet.md @@ -386,7 +386,7 @@ Data source options of Parquet can be set via: * `DataStreamWriter` * `OPTIONS` clause at [CREATE TABLE USING DATA_SOURCE](sql-ref-syntax-ddl-create-table-datasource.html) -
Property NameDefaultMeaningScope
mergeSchema
+
@@ -434,7 +434,7 @@ Other generic options can be found in +
Property NameDefaultMeaningScope
datetimeRebaseMode
diff --git a/docs/sql-data-sources-protobuf.md b/docs/sql-data-sources-protobuf.md index c8ee139e344fe..28e3e83bef7c7 100644 --- a/docs/sql-data-sources-protobuf.md +++ b/docs/sql-data-sources-protobuf.md @@ -279,7 +279,7 @@ StreamingQuery query = output Currently Spark supports reading [protobuf scalar types](https://developers.google.com/protocol-buffers/docs/proto3#scalar), [enum types](https://developers.google.com/protocol-buffers/docs/proto3#enum), [nested type](https://developers.google.com/protocol-buffers/docs/proto3#nested), and [maps type](https://developers.google.com/protocol-buffers/docs/proto3#maps) under messages of Protobuf. In addition to the these types, `spark-protobuf` also introduces support for Protobuf `OneOf` fields. which allows you to handle messages that can have multiple possible sets of fields, but only one set can be present at a time. This is useful for situations where the data you are working with is not always in the same format, and you need to be able to handle messages with different sets of fields without encountering errors. -
Property NameDefaultMeaningSince Version
spark.sql.parquet.binaryAsString
+
@@ -333,7 +333,7 @@ In addition to the these types, `spark-protobuf` also introduces support for Pro It also supports reading the following Protobuf types [Timestamp](https://developers.google.com/protocol-buffers/docs/reference/google.protobuf#timestamp) and [Duration](https://developers.google.com/protocol-buffers/docs/reference/google.protobuf#duration) -
Protobuf typeSpark SQL type
boolean
+
@@ -351,7 +351,7 @@ It also supports reading the following Protobuf types [Timestamp](https://develo Spark supports the writing of all Spark SQL types into Protobuf. For most types, the mapping from Spark types to Protobuf types is straightforward (e.g. IntegerType gets converted to int); -
Protobuf logical typeProtobuf schemaSpark SQL type
duration
+
diff --git a/docs/sql-data-sources-text.md b/docs/sql-data-sources-text.md index bb485d29c396a..aed8a2e9942fb 100644 --- a/docs/sql-data-sources-text.md +++ b/docs/sql-data-sources-text.md @@ -47,7 +47,7 @@ Data source options of text can be set via: * `DataStreamWriter` * `OPTIONS` clause at [CREATE TABLE USING DATA_SOURCE](sql-ref-syntax-ddl-create-table-datasource.html) -
Spark SQL typeProtobuf type
BooleanType
+
diff --git a/docs/sql-data-sources-xml.md b/docs/sql-data-sources-xml.md index 4537ade43d2cf..c21d56bdc3719 100644 --- a/docs/sql-data-sources-xml.md +++ b/docs/sql-data-sources-xml.md @@ -52,7 +52,7 @@ Data source options of XML can be set via: * `schema_of_xml` * `OPTIONS` clause at [CREATE TABLE USING DATA_SOURCE](sql-ref-syntax-ddl-create-table-datasource.html) -
Property NameDefaultMeaningScope
wholetext
+
diff --git a/docs/sql-distributed-sql-engine-spark-sql-cli.md b/docs/sql-distributed-sql-engine-spark-sql-cli.md index a67e009b9ae10..6d506cbb09c21 100644 --- a/docs/sql-distributed-sql-engine-spark-sql-cli.md +++ b/docs/sql-distributed-sql-engine-spark-sql-cli.md @@ -62,7 +62,7 @@ For example: `/path/to/spark-sql-cli.sql` equals to `file:///path/to/spark-sql-c ## Supported comment types -
Property NameDefaultMeaningScope
rowTag
+
@@ -115,7 +115,7 @@ Use `;` (semicolon) to terminate commands. Notice: ``` However, if ';' is the end of the line, it terminates the SQL statement. The example above will be terminated into `/* This is a comment contains ` and `*/ SELECT 1`, Spark will submit these two commands separated and throw parser error (`unclosed bracketed comment` and `Syntax error at or near '*/'`). -
CommentExample
simple comment
+
diff --git a/docs/sql-error-conditions-sqlstates.md b/docs/sql-error-conditions-sqlstates.md index 5529c961b3bfb..49cfb56b36626 100644 --- a/docs/sql-error-conditions-sqlstates.md +++ b/docs/sql-error-conditions-sqlstates.md @@ -33,7 +33,7 @@ Spark SQL uses the following `SQLSTATE` classes: ## Class `0A`: feature not supported -
CommandDescription
quit or exit
+
@@ -48,7 +48,7 @@ Spark SQL uses the following `SQLSTATE` classes:
SQLSTATEDescription and issuing error classes
0A000
## Class `21`: cardinality violation - +
@@ -63,7 +63,7 @@ Spark SQL uses the following `SQLSTATE` classes:
SQLSTATEDescription and issuing error classes
21000
## Class `22`: data exception - +
@@ -168,7 +168,7 @@ Spark SQL uses the following `SQLSTATE` classes:
SQLSTATEDescription and issuing error classes
22003
## Class `23`: integrity constraint violation - +
@@ -183,7 +183,7 @@ Spark SQL uses the following `SQLSTATE` classes:
SQLSTATEDescription and issuing error classes
23505
## Class `2B`: dependent privilege descriptors still exist - +
@@ -198,7 +198,7 @@ Spark SQL uses the following `SQLSTATE` classes:
SQLSTATEDescription and issuing error classes
2BP01
## Class `38`: external routine exception - +
@@ -213,7 +213,7 @@ Spark SQL uses the following `SQLSTATE` classes:
SQLSTATEDescription and issuing error classes
38000
## Class `39`: external routine invocation exception - +
@@ -228,7 +228,7 @@ Spark SQL uses the following `SQLSTATE` classes:
SQLSTATEDescription and issuing error classes
39000
## Class `42`: syntax error or access rule violation - +
@@ -648,7 +648,7 @@ Spark SQL uses the following `SQLSTATE` classes:
SQLSTATEDescription and issuing error classes
42000
## Class `46`: java ddl 1 - +
@@ -672,7 +672,7 @@ Spark SQL uses the following `SQLSTATE` classes:
SQLSTATEDescription and issuing error classes
46110
## Class `53`: insufficient resources - +
@@ -687,7 +687,7 @@ Spark SQL uses the following `SQLSTATE` classes:
SQLSTATEDescription and issuing error classes
53200
## Class `54`: program limit exceeded - +
@@ -702,7 +702,7 @@ Spark SQL uses the following `SQLSTATE` classes:
SQLSTATEDescription and issuing error classes
54000
## Class `HY`: CLI-specific condition - +
@@ -717,7 +717,7 @@ Spark SQL uses the following `SQLSTATE` classes:
SQLSTATEDescription and issuing error classes
HY008
## Class `XX`: internal error - +
diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md index 664bccf26651b..cb4f59323c3ae 100644 --- a/docs/sql-migration-guide.md +++ b/docs/sql-migration-guide.md @@ -478,7 +478,7 @@ license: | ## Upgrading from Spark SQL 2.3 to 2.4 - In Spark version 2.3 and earlier, the second parameter to array_contains function is implicitly promoted to the element type of first array type parameter. This type promotion can be lossy and may cause `array_contains` function to return wrong result. This problem has been addressed in 2.4 by employing a safer type promotion mechanism. This can cause some change in behavior and are illustrated in the table below. -
SQLSTATEDescription and issuing error classes
XX000
+
@@ -592,7 +592,7 @@ license: | - Since Spark 2.3, the Join/Filter's deterministic predicates that are after the first non-deterministic predicates are also pushed down/through the child operators, if possible. In prior Spark versions, these filters are not eligible for predicate pushdown. - Partition column inference previously found incorrect common type for different inferred types, for example, previously it ended up with double type as the common type for double type and date type. Now it finds the correct common type for such conflicts. The conflict resolution follows the table below: - +
diff --git a/docs/sql-performance-tuning.md b/docs/sql-performance-tuning.md index 1467409bb500d..2dec65cc553ed 100644 --- a/docs/sql-performance-tuning.md +++ b/docs/sql-performance-tuning.md @@ -34,7 +34,7 @@ memory usage and GC pressure. You can call `spark.catalog.uncacheTable("tableNam Configuration of in-memory caching can be done using the `setConf` method on `SparkSession` or by running `SET key=value` commands using SQL. - +
@@ -62,7 +62,7 @@ Configuration of in-memory caching can be done using the `setConf` method on `Sp The following options can also be used to tune the performance of query execution. It is possible that these options will be deprecated in future release as more optimizations are performed automatically. -
Property NameDefaultMeaningSince Version
spark.sql.inMemoryColumnarStorage.compressed
+
@@ -253,7 +253,7 @@ Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that ma ### Coalescing Post Shuffle Partitions This feature coalesces the post shuffle partitions based on the map output statistics when both `spark.sql.adaptive.enabled` and `spark.sql.adaptive.coalescePartitions.enabled` configurations are true. This feature simplifies the tuning of shuffle partition number when running queries. You do not need to set a proper shuffle partition number to fit your dataset. Spark can pick the proper shuffle partition number at runtime once you set a large enough initial number of shuffle partitions via `spark.sql.adaptive.coalescePartitions.initialPartitionNum` configuration. -
Property NameDefaultMeaningSince Version
spark.sql.files.maxPartitionBytes
+
@@ -298,7 +298,7 @@ This feature coalesces the post shuffle partitions based on the map output stati
Property NameDefaultMeaningSince Version
spark.sql.adaptive.coalescePartitions.enabled
### Spliting skewed shuffle partitions - +
@@ -320,7 +320,7 @@ This feature coalesces the post shuffle partitions based on the map output stati ### Converting sort-merge join to broadcast join AQE converts sort-merge join to broadcast hash join when the runtime statistics of any join side is smaller than the adaptive broadcast hash join threshold. This is not as efficient as planning a broadcast hash join in the first place, but it's better than keep doing the sort-merge join, as we can save the sorting of both the join sides, and read shuffle files locally to save network traffic(if `spark.sql.adaptive.localShuffleReader.enabled` is true) -
Property NameDefaultMeaningSince Version
spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled
+
@@ -342,7 +342,7 @@ AQE converts sort-merge join to broadcast hash join when the runtime statistics ### Converting sort-merge join to shuffled hash join AQE converts sort-merge join to shuffled hash join when all post shuffle partitions are smaller than a threshold, the max threshold can see the config `spark.sql.adaptive.maxShuffledHashJoinLocalMapThreshold`. -
Property NameDefaultMeaningSince Version
spark.sql.adaptive.autoBroadcastJoinThreshold
+
@@ -356,7 +356,7 @@ AQE converts sort-merge join to shuffled hash join when all post shuffle partiti ### Optimizing Skew Join Data skew can severely downgrade the performance of join queries. This feature dynamically handles skew in sort-merge join by splitting (and replicating if needed) skewed tasks into roughly evenly sized tasks. It takes effect when both `spark.sql.adaptive.enabled` and `spark.sql.adaptive.skewJoin.enabled` configurations are enabled. -
Property NameDefaultMeaningSince Version
spark.sql.adaptive.maxShuffledHashJoinLocalMapThreshold
+
@@ -393,7 +393,7 @@ Data skew can severely downgrade the performance of join queries. This feature d
Property NameDefaultMeaningSince Version
spark.sql.adaptive.skewJoin.enabled
### Misc - +
diff --git a/docs/storage-openstack-swift.md b/docs/storage-openstack-swift.md index 52e2d11b99126..18e879ff78da7 100644 --- a/docs/storage-openstack-swift.md +++ b/docs/storage-openstack-swift.md @@ -60,7 +60,7 @@ required by Keystone. The following table contains a list of Keystone mandatory parameters. PROVIDER can be any (alphanumeric) name. -
Property NameDefaultMeaningSince Version
spark.sql.adaptive.optimizer.excludedRules
+
diff --git a/docs/streaming-custom-receivers.md b/docs/streaming-custom-receivers.md index 591a4415bb1a5..11a52232510fd 100644 --- a/docs/streaming-custom-receivers.md +++ b/docs/streaming-custom-receivers.md @@ -243,7 +243,7 @@ interval in the [Spark Streaming Programming Guide](streaming-programming-guide. The following table summarizes the characteristics of both types of receivers -
Property NameMeaningRequired
fs.swift.service.PROVIDER.auth.url
+
diff --git a/docs/streaming-programming-guide.md b/docs/streaming-programming-guide.md index 7814de818d4cf..e5053f1af3626 100644 --- a/docs/streaming-programming-guide.md +++ b/docs/streaming-programming-guide.md @@ -433,7 +433,7 @@ Streaming core artifact `spark-streaming-xyz_{{site.SCALA_BINARY_VERSION}}` to the dependencies. For example, some of the common ones are as follows. -
Receiver Type
+
@@ -820,7 +820,7 @@ Similar to that of RDDs, transformations allow the data from the input DStream t DStreams support many of the transformations available on normal Spark RDD's. Some of the common ones are as follows. -
SourceArtifact
Kafka spark-streaming-kafka-0-10_{{site.SCALA_BINARY_VERSION}}
Kinesis
spark-streaming-kinesis-asl_{{site.SCALA_BINARY_VERSION}} [Amazon Software License]
+
@@ -1109,7 +1109,7 @@ JavaPairDStream windowedWordCounts = pairs.reduceByKeyAndWindow Some of the common window operations are as follows. All of these operations take the said two parameters - windowLength and slideInterval. -
TransformationMeaning
map(func)
+
@@ -1280,7 +1280,7 @@ Since the output operations actually allow the transformed data to be consumed b they trigger the actual execution of all the DStream transformations (similar to actions for RDDs). Currently, the following output operations are defined: -
TransformationMeaning
window(windowLength, slideInterval)
+
@@ -2470,7 +2470,7 @@ enabled](#deploying-applications) and reliable receivers, there is zero data los The following table summarizes the semantics under failures: -
Output OperationMeaning
print()
+
diff --git a/docs/structured-streaming-kafka-integration.md b/docs/structured-streaming-kafka-integration.md index 66e6efb1c8a9f..c5ffdf025b173 100644 --- a/docs/structured-streaming-kafka-integration.md +++ b/docs/structured-streaming-kafka-integration.md @@ -297,7 +297,7 @@ df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)"); Each row in the source has the following schema: -
Deployment Scenario
+
@@ -336,7 +336,7 @@ Each row in the source has the following schema: The following options must be set for the Kafka source for both batch and streaming queries. -
ColumnType
key
+
@@ -368,7 +368,7 @@ for both batch and streaming queries. The following configurations are optional: -
Optionvaluemeaning
assign
+
@@ -607,7 +607,7 @@ The caching key is built up from the following information: The following properties are available to configure the consumer pool: -
Optionvaluedefaultquery typemeaning
startingTimestamp
+
@@ -657,7 +657,7 @@ Note that it doesn't leverage Apache Commons Pool due to the difference of chara The following properties are available to configure the fetched data pool: -
Property NameDefaultMeaningSince Version
spark.kafka.consumer.cache.capacity
+
@@ -685,7 +685,7 @@ solution to remove duplicates when reading the written data could be to introduc that can be used to perform de-duplication when reading. The Dataframe being written to Kafka should have the following columns in schema: -
Property NameDefaultMeaningSince Version
spark.kafka.consumer.fetchedData.cache.timeout
+
@@ -725,7 +725,7 @@ will be used. The following options must be set for the Kafka sink for both batch and streaming queries. -
ColumnType
key (optional)
+
@@ -736,7 +736,7 @@ for both batch and streaming queries. The following configurations are optional: -
Optionvaluemeaning
kafka.bootstrap.servers
+
@@ -912,7 +912,7 @@ It will use different Kafka producer when delegation token is renewed; Kafka pro The following properties are available to configure the producer pool: -
Optionvaluedefaultquery typemeaning
topic
+
@@ -1039,7 +1039,7 @@ When none of the above applies then unsecure connection assumed. Delegation tokens can be obtained from multiple clusters and ${cluster} is an arbitrary unique identifier which helps to group different configurations. -
Property NameDefaultMeaningSince Version
spark.kafka.producer.cache.timeout
+
diff --git a/docs/structured-streaming-programming-guide.md b/docs/structured-streaming-programming-guide.md index 33b9453a18c37..7a4249f9d6fc6 100644 --- a/docs/structured-streaming-programming-guide.md +++ b/docs/structured-streaming-programming-guide.md @@ -545,7 +545,7 @@ checkpointed offsets after a failure. See the earlier section on [fault-tolerance semantics](#fault-tolerance-semantics). Here are the details of all the sources in Spark. -
Property NameDefaultMeaningSince Version
spark.kafka.clusters.${cluster}.auth.bootstrap.servers
+
@@ -1819,7 +1819,7 @@ regarding watermark delays and whether data will be dropped or not. ##### Support matrix for joins in streaming queries -
Source
+
@@ -2307,7 +2307,7 @@ to `org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider`. Here are the configs regarding to RocksDB instance of the state store provider: -
Left Input
+
@@ -2496,7 +2496,7 @@ More information to be added in future releases. Different types of streaming queries support different output modes. Here is the compatibility matrix. -
Config Name
+
@@ -2635,7 +2635,7 @@ meant for debugging purposes only. See the earlier section on [fault-tolerance semantics](#fault-tolerance-semantics). Here are the details of all the sinks in Spark. -
Query Type
+
@@ -3223,7 +3223,7 @@ The trigger settings of a streaming query define the timing of streaming data pr the query is going to be executed as micro-batch query with a fixed batch interval or as a continuous processing query. Here are the different kinds of triggers that are supported. -
Sink
+
diff --git a/docs/structured-streaming-state-data-source.md b/docs/structured-streaming-state-data-source.md index a9353861c532c..ae323f6b0c141 100644 --- a/docs/structured-streaming-state-data-source.md +++ b/docs/structured-streaming-state-data-source.md @@ -83,7 +83,7 @@ Dataset df = spark Each row in the source has the following schema: -
Trigger Type
+
@@ -107,7 +107,7 @@ Users are encouraged to query about the schema via df.schema() / df.printSchema( The following options must be set for the source. -
ColumnTypeNote
key
+
@@ -118,7 +118,7 @@ The following options must be set for the source. The following configurations are optional: -
Optionvaluemeaning
path
+
@@ -203,7 +203,7 @@ Dataset df = spark Each row in the source has the following schema: -
Optionvaluedefaultmeaning
batchId
+
diff --git a/docs/submitting-applications.md b/docs/submitting-applications.md index 30da917339cc9..61517d5feacd7 100644 --- a/docs/submitting-applications.md +++ b/docs/submitting-applications.md @@ -148,7 +148,7 @@ export HADOOP_CONF_DIR=XXX The master URL passed to Spark can be in one of the following formats: -
ColumnTypeNote
operatorId
+
diff --git a/docs/web-ui.md b/docs/web-ui.md index 079bc6137f020..cdf62e0d8ec0b 100644 --- a/docs/web-ui.md +++ b/docs/web-ui.md @@ -380,7 +380,7 @@ operator shows the number of bytes written by a shuffle. Here is the list of SQL metrics: -
Master URLMeaning
local Run Spark locally with one worker thread (i.e. no parallelism at all).
local[K] Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).
+
SQL metricsMeaningOperators
number of output rows the number of output rows of the operator Aggregate operators, Join operators, Sample, Range, Scan operators, Filter, etc.
data size the size of broadcast/shuffled/collected data of the operator BroadcastExchange, ShuffleExchange, Subquery