Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions docs/_data/menu-sql.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@
url: sql-getting-started.html#creating-datasets
- text: Interoperating with RDDs
url: sql-getting-started.html#interoperating-with-rdds
- text: Scalar Functions
url: sql-getting-started.html#scalar-functions
- text: Aggregations
url: sql-getting-started.html#aggregations
- text: Data Sources
Expand All @@ -34,6 +36,8 @@
url: sql-data-sources-jdbc.html
- text: Avro Files
url: sql-data-sources-avro.html
- text: Whole Binary Files
url: sql-data-sources-binaryFile.html
- text: Troubleshooting
url: sql-data-sources-troubleshooting.html
- text: Performance Tuning
Expand All @@ -43,8 +47,8 @@
url: sql-performance-tuning.html#caching-data-in-memory
- text: Other Configuration Options
url: sql-performance-tuning.html#other-configuration-options
- text: Broadcast Hint for SQL Queries
url: sql-performance-tuning.html#broadcast-hint-for-sql-queries
- text: Join Strategy Hints for SQL Queries
url: sql-performance-tuning.html#join-strategy-hints-for-sql-queries
- text: Distributed SQL Engine
url: sql-distributed-sql-engine.html
subitems:
Expand Down
2 changes: 1 addition & 1 deletion docs/sql-migration-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -519,7 +519,7 @@ license: |

Note that, for <b>DecimalType(38,0)*</b>, the table above intentionally does not cover all other combinations of scales and precisions because currently we only infer decimal type like `BigInteger`/`BigInt`. For example, 1.1 is inferred as double type.

- Since Spark 2.3, when either broadcast hash join or broadcast nested loop join is applicable, we prefer to broadcasting the table that is explicitly specified in a broadcast hint. For details, see the section [Broadcast Hint](sql-performance-tuning.html#broadcast-hint-for-sql-queries) and [SPARK-22489](https://issues.apache.org/jira/browse/SPARK-22489).
- Since Spark 2.3, when either broadcast hash join or broadcast nested loop join is applicable, we prefer to broadcasting the table that is explicitly specified in a broadcast hint. For details, see the section [Join Strategy Hints for SQL Queries](sql-performance-tuning.html#join-strategy-hints-for-sql-queries) and [SPARK-22489](https://issues.apache.org/jira/browse/SPARK-22489).

- Since Spark 2.3, when all inputs are binary, `functions.concat()` returns an output as binary. Otherwise, it returns as a string. Until Spark 2.3, it always returns as a string despite of input types. To keep the old behavior, set `spark.sql.function.concatBinaryAsString` to `true`.

Expand Down
26 changes: 19 additions & 7 deletions docs/sql-performance-tuning.md
Original file line number Diff line number Diff line change
Expand Up @@ -129,26 +129,23 @@ a specific strategy may not support all join types.
<div data-lang="scala" markdown="1">

{% highlight scala %}
import org.apache.spark.sql.functions.broadcast
broadcast(spark.table("src")).join(spark.table("records"), "key").show()
spark.table("src").join(spark.table("records").hint("broadcast"), "key").show()
{% endhighlight %}

</div>

<div data-lang="java" markdown="1">

{% highlight java %}
import static org.apache.spark.sql.functions.broadcast;
broadcast(spark.table("src")).join(spark.table("records"), "key").show();
spark.table("src").join(spark.table("records").hint("broadcast"), "key").show();
{% endhighlight %}

</div>

<div data-lang="python" markdown="1">

{% highlight python %}
from pyspark.sql.functions import broadcast
broadcast(spark.table("src")).join(spark.table("records"), "key").show()
spark.table("src").join(spark.table("records").hint("broadcast"), "key").show()
{% endhighlight %}

</div>
Expand All @@ -158,7 +155,7 @@ broadcast(spark.table("src")).join(spark.table("records"), "key").show()
{% highlight r %}
src <- sql("SELECT * FROM src")
records <- sql("SELECT * FROM records")
head(join(broadcast(src), records, src$key == records$key))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we remove the broadcast method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the broadcast method still available:

spark/R/pkg/R/DataFrame.R

Lines 4129 to 4134 in 18431c7

setMethod("broadcast",
signature(x = "SparkDataFrame"),
function(x) {
sdf <- callJStatic("org.apache.spark.sql.functions", "broadcast", x@sdf)
dataFrame(sdf)
})

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so what's the recommended API to add hint? df.hint(...) or hint(df, ...)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hint(df, ...) is for R, there's no df.hint(...) in R?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah nvm, didn't realize it's R code

head(join(src, hint(records, "broadcast"), src$key == records$key))
{% endhighlight %}

</div>
Expand All @@ -172,3 +169,18 @@ SELECT /*+ BROADCAST(r) */ * FROM records r JOIN src s ON r.key = s.key

</div>
</div>

## Coalesce Hints for SQL Queries

Coalesce hints allows the Spark SQL users to control the number of output files just like the
`coalesce`, `repartition` and `repartitionByRange` in Dataset API, they can be used for performance
tuning and reducing the number of output files. The "COALESCE" hint only has a partition number as a
parameter. The "REPARTITION" hint has a partition number, columns, or both of them as parameters.
The "REPARTITION_BY_RANGE" hint must have column names and a partition number is optional.

SELECT /*+ COALESCE(3) */ * FROM t
SELECT /*+ REPARTITION(3) */ * FROM t
SELECT /*+ REPARTITION(c) */ * FROM t
SELECT /*+ REPARTITION(3, c) */ * FROM t
SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t