[SPARK-16475][SQL] Broadcast Hint for SQL Queries #14426

dongjoon-hyun · 2016-07-30T23:57:50Z

What changes were proposed in this pull request?

This PR aims to achieve the following two goals in Spark SQL.

1. Generic Hint Syntax
The generic hints are parsed and transformed into concrete hints by SubstituteHints of Analyzer. The unknown hints are removed, too. For example, Hint("MAPJOIN") is transformed into BroadcastJoin and other hints are removed currently.

SELECT /*+ MAPJOIN(t) */ * FROM t
SELECT /*+ STREAMTABLE(a,b,c) */ * FROM t
SELECT /*+ INDEX(t emp_job_ix) */ * FROM t

Unlink Hive, NEWMAPJOIN(t) is allowed for accepting new Spark Hints.

2. Broadcast Hints
The followings are recognized. Technically, broadcast hints are matched UnresolvedRelation to support Hive MetastoreRelation. The style of database_name.table_name is not allowed in this PR.

SELECT /*+ MAPJOIN(t) */ * FROM t JOIN u ON t.id = u.id
SELECT /*+ BROADCAST(u) */ * FROM t JOIN u ON t.id = u.id
SELECT /*+ BROADCASTJOIN(u) */ * FROM t JOIN u ON t.id = u.id

Examples

scala> spark.range(1000000000).createOrReplaceTempView("t")
scala> spark.range(1000000000).createOrReplaceTempView("u")

scala> sql("SELECT * FROM t JOIN u ON t.id = u.id").explain
== Physical Plan ==
*SortMergeJoin [id#0L], [id#4L], Inner
:- *Sort [id#0L ASC], false, 0
:  +- Exchange hashpartitioning(id#0L, 200)
:     +- *Range (0, 1000000000, splits=8)
+- *Sort [id#4L ASC], false, 0
   +- ReusedExchange [id#4L], Exchange hashpartitioning(id#0L, 200)

scala> sql("SELECT /*+ MAPJOIN(t) */ * FROM t JOIN u ON t.id = u.id").explain
== Physical Plan ==
*BroadcastHashJoin [id#0L], [id#4L], Inner, BuildLeft
:- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
:  +- *Range (0, 1000000000, splits=8)
+- *Range (0, 1000000000, splits=8)

scala> sql("SELECT /*+ MAPJOIN(u) */ * FROM t JOIN u ON t.id = u.id").explain
== Physical Plan ==
*BroadcastHashJoin [id#0L], [id#4L], Inner, BuildRight
:- *Range (0, 1000000000, splits=8)
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
   +- *Range (0, 1000000000, splits=8)

scala> sql("CREATE TABLE hive_t(id INT)")
res5: org.apache.spark.sql.DataFrame = []

scala> sql("CREATE TABLE hive_u(id INT)")
res6: org.apache.spark.sql.DataFrame = []

scala> sql("SELECT /*+ MAPJOIN(hive_u) */ * FROM hive_t JOIN hive_u ON hive_t.id = hive_u.id").explain
== Physical Plan ==
*BroadcastHashJoin [id#28], [id#29], Inner, BuildRight
:- *Filter isnotnull(id#28)
:  +- HiveTableScan [id#28], MetastoreRelation default, hive_t
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
   +- *Filter isnotnull(id#29)
      +- HiveTableScan [id#29], MetastoreRelation default, hive_u

scala> sql("SELECT * FROM hive_t JOIN hive_u ON hive_t.id = hive_u.id").explain
== Physical Plan ==
*SortMergeJoin [id#36], [id#37], Inner
:- *Sort [id#36 ASC], false, 0
:  +- Exchange hashpartitioning(id#36, 200)
:     +- *Filter isnotnull(id#36)
:        +- HiveTableScan [id#36], MetastoreRelation default, hive_t
+- *Sort [id#37 ASC], false, 0
   +- Exchange hashpartitioning(id#37, 200)
      +- *Filter isnotnull(id#37)
         +- HiveTableScan [id#37], MetastoreRelation default, hive_u

The many previous discussions on this issue are at #14132 .

How was this patch tested?

Pass the Jenkins tests with new testcases.

SparkQA · 2016-07-31T01:58:49Z

Test build #63048 has finished for PR 14426 at commit ee8bb14.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

rxin · 2016-07-31T05:44:15Z

Why creating a new pull request? All the discussions were in the other pull request.

dongjoon-hyun · 2016-07-31T17:00:09Z

Oh, it's just because the previous PR page with 363 comments becomes too slow to view in my laptop. Since the most recent discussion was ended two days ago with implementations. I think it's safe and better for being review again here. I can move some summary of the previous discussion decision into here, too.

SparkQA · 2016-08-08T08:31:03Z

Test build #63348 has finished for PR 14426 at commit 79c91be.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

SparkQA · 2016-08-12T00:15:49Z

Test build #63639 has finished for PR 14426 at commit 67330a7.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

SparkQA · 2016-08-14T21:44:58Z

Test build #63756 has finished for PR 14426 at commit ff4f428.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

dongjoon-hyun · 2016-08-15T05:22:59Z

Hi, @cloud-fan .
Could you review this PR about HINT when you have some time?

SparkQA · 2016-08-19T04:37:30Z

Test build #64042 has finished for PR 14426 at commit d722be2.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

SparkQA · 2016-08-20T00:06:44Z

Test build #64108 has finished for PR 14426 at commit 71954e2.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

dongjoon-hyun · 2016-08-20T06:30:32Z

Resolve conflicts.

SparkQA · 2016-08-20T08:46:01Z

Test build #64132 has finished for PR 14426 at commit 351dfef.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

watermen · 2016-08-22T01:54:14Z

Can this PR support multiple JOINs(SELECT * FROM t1 JOIN t2 ON t1.key = t2.key JOIN t3 ON t1.key = t3.key)?

dongjoon-hyun · 2016-08-23T16:10:57Z

Oh, sorry for late response, @watermen . I missed your message. Yes. This supports multiple joins.

dongjoon-hyun · 2016-08-23T16:30:00Z

Hi, @watermen . You can try that like this.

scala> spark.range(1000000000).createOrReplaceTempView("t1")
scala> spark.range(1000000000).createOrReplaceTempView("t2")
scala> spark.range(1000000000).createOrReplaceTempView("t3")
scala> sql("SELECT * FROM t1 JOIN t2 ON t1.id = t2.id JOIN t3 ON t1.id = t3.id").explain
== Physical Plan ==
*SortMergeJoin [id#0L], [id#8L], Inner
:- *SortMergeJoin [id#0L], [id#4L], Inner
:  :- *Sort [id#0L ASC], false, 0
:  :  +- Exchange hashpartitioning(id#0L, 200)
:  :     +- *Range (0, 1000000000, splits=8)
:  +- *Sort [id#4L ASC], false, 0
:     +- ReusedExchange [id#4L], Exchange hashpartitioning(id#0L, 200)
+- *Sort [id#8L ASC], false, 0
   +- ReusedExchange [id#8L], Exchange hashpartitioning(id#0L, 200)

scala> sql("SELECT /*+ MAPJOIN(t1) */ * FROM t1 JOIN t2 ON t1.id = t2.id JOIN t3 ON t1.id = t3.id").explain
== Physical Plan ==
*SortMergeJoin [id#0L], [id#8L], Inner
:- *Sort [id#0L ASC], false, 0
:  +- Exchange hashpartitioning(id#0L, 200)
:     +- *BroadcastHashJoin [id#0L], [id#4L], Inner, BuildLeft
:        :- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
:        :  +- *Range (0, 1000000000, splits=8)
:        +- *Range (0, 1000000000, splits=8)
+- *Sort [id#8L ASC], false, 0
   +- Exchange hashpartitioning(id#8L, 200)
      +- *Range (0, 1000000000, splits=8)

scala> sql("SELECT /*+ MAPJOIN(t2) */ * FROM t1 JOIN t2 ON t1.id = t2.id JOIN t3 ON t1.id = t3.id").explain
== Physical Plan ==
*SortMergeJoin [id#0L], [id#8L], Inner
:- *Sort [id#0L ASC], false, 0
:  +- Exchange hashpartitioning(id#0L, 200)
:     +- *BroadcastHashJoin [id#0L], [id#4L], Inner, BuildRight
:        :- *Range (0, 1000000000, splits=8)
:        +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
:           +- *Range (0, 1000000000, splits=8)
+- *Sort [id#8L ASC], false, 0
   +- Exchange hashpartitioning(id#8L, 200)
      +- *Range (0, 1000000000, splits=8)

scala> sql("SELECT /*+ MAPJOIN(t3) */ * FROM t1 JOIN t2 ON t1.id = t2.id JOIN t3 ON t1.id = t3.id").explain
== Physical Plan ==
*BroadcastHashJoin [id#0L], [id#8L], Inner, BuildRight
:- *SortMergeJoin [id#0L], [id#4L], Inner
:  :- *Sort [id#0L ASC], false, 0
:  :  +- Exchange hashpartitioning(id#0L, 200)
:  :     +- *Range (0, 1000000000, splits=8)
:  +- *Sort [id#4L ASC], false, 0
:     +- ReusedExchange [id#4L], Exchange hashpartitioning(id#0L, 200)
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
   +- *Range (0, 1000000000, splits=8)

SparkQA · 2016-08-24T17:40:50Z

Test build #64360 has finished for PR 14426 at commit ef1abc7.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

SparkQA · 2016-08-27T11:53:11Z

Test build #64529 has finished for PR 14426 at commit 6461e41.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

SparkQA · 2016-08-29T19:08:34Z

Test build #64574 has finished for PR 14426 at commit 377b625.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

SparkQA · 2016-09-02T22:52:58Z

Test build #64875 has finished for PR 14426 at commit 2cc19b3.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

SparkQA · 2016-09-06T05:25:47Z

Test build #64966 has finished for PR 14426 at commit 42248a1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

SparkQA · 2016-09-07T07:02:25Z

Test build #65023 has finished for PR 14426 at commit 0d19e28.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

SparkQA · 2016-09-09T08:43:15Z

Test build #65142 has finished for PR 14426 at commit 01105f5.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

SparkQA · 2016-09-12T10:48:31Z

Test build #65249 has finished for PR 14426 at commit 47d98e7.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

dongjoon-hyun · 2016-09-15T07:48:17Z

Rebased to resolve conflicts.

SparkQA · 2016-09-15T09:54:26Z

Test build #65432 has finished for PR 14426 at commit b1f314c.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

SparkQA · 2016-09-20T18:20:49Z

Test build #65663 has finished for PR 14426 at commit 5ac4457.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

SparkQA · 2016-09-26T20:38:32Z

Test build #65927 has finished for PR 14426 at commit e9ba01e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

SparkQA · 2016-10-01T17:57:06Z

Test build #66220 has finished for PR 14426 at commit b180019.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

dongjoon-hyun · 2016-10-01T19:01:50Z

Hi, @rxin .
Could you give me some guide for this Broadcast Hint for SQL Queries if you have sometime?

dongjoon-hyun · 2016-10-07T18:12:21Z

Hi, @gatorsmile .
Could you review this PR when you have some time?

SparkQA · 2016-10-07T20:41:52Z

Test build #66515 has finished for PR 14426 at commit 5290081.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

dongjoon-hyun · 2016-10-08T00:04:26Z

Rebased to resolve the conflicts.

SparkQA · 2016-10-08T02:15:47Z

Test build #66546 has finished for PR 14426 at commit 778cede.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

SparkQA · 2016-10-10T04:42:58Z

Test build #66619 has finished for PR 14426 at commit 57adfd3.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

SparkQA · 2016-10-19T12:58:31Z

Test build #67189 has finished for PR 14426 at commit 7483889.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

SparkQA · 2016-10-24T05:41:57Z

Test build #67425 has finished for PR 14426 at commit dfe6a3e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

dongjoon-hyun · 2016-11-03T21:06:54Z

Resolve the conflicts.

SparkQA · 2016-11-03T22:06:29Z

Test build #68088 has finished for PR 14426 at commit 539782d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

dongjoon-hyun · 2016-11-03T23:00:12Z

Retest this please

SparkQA · 2016-11-04T01:44:24Z

Test build #68092 has finished for PR 14426 at commit 539782d.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

rxin · 2016-11-04T07:36:53Z

Mind closing this for now? Let's open it once we've done the view canonicalization work. This pr will be much simpler. We should do both in 2.2.

dongjoon-hyun · 2016-11-04T15:21:47Z

Thank you for guide.

rxin · 2017-02-14T11:40:03Z

@dongjoon-hyun do you have time to update the pull request now the view canonicalization work is done? Basically we can remove all the SQL generation stuff.

[SPARK-16475][SQL] Broadcast Hint for SQL Queries

rxin · 2017-02-14T12:17:16Z

Actually I have some time. I will submit a pr based on this.

dongjoon-hyun · 2017-02-14T16:40:34Z

Oh.

## What changes were proposed in this pull request? This pull request introduces a simple hint infrastructure to SQL and implements broadcast join hint using the infrastructure. The hint syntax looks like the following: ``` SELECT /*+ BROADCAST(t) */ * FROM t ``` For broadcast hint, we accept "BROADCAST", "BROADCASTJOIN", and "MAPJOIN", and a sequence of relation aliases can be specified in the hint. A broadcast hint plan node will be inserted on top of any relation (that is not aliased differently), subquery, or common table expression that match the specified name. The hint resolution works by recursively traversing down the query plan to find a relation or subquery that matches one of the specified broadcast aliases. The traversal does not go past beyond any existing broadcast hints, subquery aliases. This rule happens before common table expressions. Note that there was an earlier patch in apache#14426. This is a rewrite of that patch, with different semantics and simpler test cases. ## How was this patch tested? Added a new unit test suite for the broadcast hint rule (SubstituteHintsSuite) and new test cases for parser change (in PlanParserSuite). Also added end-to-end test case in BroadcastSuite. Author: Reynold Xin <rxin@databricks.com> Author: Dongjoon Hyun <dongjoon@apache.org> Closes apache#16925 from rxin/SPARK-16475-broadcast-hint.

dongjoon-hyun mentioned this pull request Jul 30, 2016

[SPARK-16475][SQL] Broadcast Hint for SQL Queries #14132

Closed

ericl mentioned this pull request Sep 3, 2016

[SPARK-16980][SQL] Load only catalog table partition metadata required to answer a query #14690

Closed

[SPARK-16475][SQL] Broadcast Hint for SQL Queries

539782d

dongjoon-hyun closed this Nov 4, 2016

rxin added a commit to rxin/spark that referenced this pull request Feb 14, 2017

Merge pull request apache#14426 from dongjoon-hyun/SPARK-16475-HINT

318bc03

[SPARK-16475][SQL] Broadcast Hint for SQL Queries

rxin mentioned this pull request Feb 14, 2017

[SPARK-16475][SQL] Broadcast hint for SQL Queries #16925

Closed

dongjoon-hyun deleted the SPARK-16475-HINT branch January 7, 2019 07:03

[SPARK-16475][SQL] Broadcast Hint for SQL Queries #14426

[SPARK-16475][SQL] Broadcast Hint for SQL Queries #14426

Uh oh!

Conversation

dongjoon-hyun commented Jul 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jul 31, 2016

Uh oh!

rxin commented Jul 31, 2016

Uh oh!

dongjoon-hyun commented Jul 31, 2016

Uh oh!

SparkQA commented Aug 8, 2016

Uh oh!

SparkQA commented Aug 12, 2016

Uh oh!

SparkQA commented Aug 14, 2016

Uh oh!

dongjoon-hyun commented Aug 15, 2016

Uh oh!

SparkQA commented Aug 19, 2016

Uh oh!

SparkQA commented Aug 20, 2016

Uh oh!

dongjoon-hyun commented Aug 20, 2016

Uh oh!

SparkQA commented Aug 20, 2016

Uh oh!

watermen commented Aug 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Aug 23, 2016

Uh oh!

dongjoon-hyun commented Aug 23, 2016

Uh oh!

SparkQA commented Aug 24, 2016

Uh oh!

SparkQA commented Aug 27, 2016

Uh oh!

SparkQA commented Aug 29, 2016

Uh oh!

SparkQA commented Sep 2, 2016

Uh oh!

SparkQA commented Sep 6, 2016

Uh oh!

SparkQA commented Sep 7, 2016

Uh oh!

SparkQA commented Sep 9, 2016

Uh oh!

SparkQA commented Sep 12, 2016

Uh oh!

dongjoon-hyun commented Sep 15, 2016

Uh oh!

SparkQA commented Sep 15, 2016

Uh oh!

SparkQA commented Sep 20, 2016

Uh oh!

SparkQA commented Sep 26, 2016

Uh oh!

SparkQA commented Oct 1, 2016

Uh oh!

dongjoon-hyun commented Oct 1, 2016

Uh oh!

dongjoon-hyun commented Oct 7, 2016

Uh oh!

SparkQA commented Oct 7, 2016

Uh oh!

dongjoon-hyun commented Oct 8, 2016

Uh oh!

SparkQA commented Oct 8, 2016

Uh oh!

SparkQA commented Oct 10, 2016

Uh oh!

SparkQA commented Oct 19, 2016

Uh oh!

SparkQA commented Oct 24, 2016

Uh oh!

dongjoon-hyun commented Jul 30, 2016 •

edited

Loading

watermen commented Aug 22, 2016 •

edited

Loading