Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Jul 30, 2016

What changes were proposed in this pull request?

This PR aims to achieve the following two goals in Spark SQL.

1. Generic Hint Syntax
The generic hints are parsed and transformed into concrete hints by SubstituteHints of Analyzer. The unknown hints are removed, too. For example, Hint("MAPJOIN") is transformed into BroadcastJoin and other hints are removed currently.

SELECT /*+ MAPJOIN(t) */ * FROM t
SELECT /*+ STREAMTABLE(a,b,c) */ * FROM t
SELECT /*+ INDEX(t emp_job_ix) */ * FROM t

Unlink Hive, NEWMAPJOIN(t) is allowed for accepting new Spark Hints.

2. Broadcast Hints
The followings are recognized. Technically, broadcast hints are matched UnresolvedRelation to support Hive MetastoreRelation. The style of database_name.table_name is not allowed in this PR.

SELECT /*+ MAPJOIN(t) */ * FROM t JOIN u ON t.id = u.id
SELECT /*+ BROADCAST(u) */ * FROM t JOIN u ON t.id = u.id
SELECT /*+ BROADCASTJOIN(u) */ * FROM t JOIN u ON t.id = u.id

Examples

scala> spark.range(1000000000).createOrReplaceTempView("t")
scala> spark.range(1000000000).createOrReplaceTempView("u")

scala> sql("SELECT * FROM t JOIN u ON t.id = u.id").explain
== Physical Plan ==
*SortMergeJoin [id#0L], [id#4L], Inner
:- *Sort [id#0L ASC], false, 0
:  +- Exchange hashpartitioning(id#0L, 200)
:     +- *Range (0, 1000000000, splits=8)
+- *Sort [id#4L ASC], false, 0
   +- ReusedExchange [id#4L], Exchange hashpartitioning(id#0L, 200)

scala> sql("SELECT /*+ MAPJOIN(t) */ * FROM t JOIN u ON t.id = u.id").explain
== Physical Plan ==
*BroadcastHashJoin [id#0L], [id#4L], Inner, BuildLeft
:- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
:  +- *Range (0, 1000000000, splits=8)
+- *Range (0, 1000000000, splits=8)

scala> sql("SELECT /*+ MAPJOIN(u) */ * FROM t JOIN u ON t.id = u.id").explain
== Physical Plan ==
*BroadcastHashJoin [id#0L], [id#4L], Inner, BuildRight
:- *Range (0, 1000000000, splits=8)
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
   +- *Range (0, 1000000000, splits=8)

scala> sql("CREATE TABLE hive_t(id INT)")
res5: org.apache.spark.sql.DataFrame = []

scala> sql("CREATE TABLE hive_u(id INT)")
res6: org.apache.spark.sql.DataFrame = []

scala> sql("SELECT /*+ MAPJOIN(hive_u) */ * FROM hive_t JOIN hive_u ON hive_t.id = hive_u.id").explain
== Physical Plan ==
*BroadcastHashJoin [id#28], [id#29], Inner, BuildRight
:- *Filter isnotnull(id#28)
:  +- HiveTableScan [id#28], MetastoreRelation default, hive_t
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
   +- *Filter isnotnull(id#29)
      +- HiveTableScan [id#29], MetastoreRelation default, hive_u

scala> sql("SELECT * FROM hive_t JOIN hive_u ON hive_t.id = hive_u.id").explain
== Physical Plan ==
*SortMergeJoin [id#36], [id#37], Inner
:- *Sort [id#36 ASC], false, 0
:  +- Exchange hashpartitioning(id#36, 200)
:     +- *Filter isnotnull(id#36)
:        +- HiveTableScan [id#36], MetastoreRelation default, hive_t
+- *Sort [id#37 ASC], false, 0
   +- Exchange hashpartitioning(id#37, 200)
      +- *Filter isnotnull(id#37)
         +- HiveTableScan [id#37], MetastoreRelation default, hive_u

The many previous discussions on this issue are at #14132 .

How was this patch tested?

Pass the Jenkins tests with new testcases.

@SparkQA
Copy link

SparkQA commented Jul 31, 2016

Test build #63048 has finished for PR 14426 at commit ee8bb14.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

@rxin
Copy link
Contributor

rxin commented Jul 31, 2016

Why creating a new pull request? All the discussions were in the other pull request.

@dongjoon-hyun
Copy link
Member Author

Oh, it's just because the previous PR page with 363 comments becomes too slow to view in my laptop. Since the most recent discussion was ended two days ago with implementations. I think it's safe and better for being review again here. I can move some summary of the previous discussion decision into here, too.

@SparkQA
Copy link

SparkQA commented Aug 8, 2016

Test build #63348 has finished for PR 14426 at commit 79c91be.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

@SparkQA
Copy link

SparkQA commented Aug 12, 2016

Test build #63639 has finished for PR 14426 at commit 67330a7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

@SparkQA
Copy link

SparkQA commented Aug 14, 2016

Test build #63756 has finished for PR 14426 at commit ff4f428.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

@dongjoon-hyun
Copy link
Member Author

Hi, @cloud-fan .
Could you review this PR about HINT when you have some time?

@SparkQA
Copy link

SparkQA commented Aug 19, 2016

Test build #64042 has finished for PR 14426 at commit d722be2.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

@SparkQA
Copy link

SparkQA commented Aug 20, 2016

Test build #64108 has finished for PR 14426 at commit 71954e2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

@dongjoon-hyun
Copy link
Member Author

Resolve conflicts.

@SparkQA
Copy link

SparkQA commented Aug 20, 2016

Test build #64132 has finished for PR 14426 at commit 351dfef.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

@watermen
Copy link
Contributor

watermen commented Aug 22, 2016

Can this PR support multiple JOINs(SELECT * FROM t1 JOIN t2 ON t1.key = t2.key JOIN t3 ON t1.key = t3.key)?

@dongjoon-hyun
Copy link
Member Author

Oh, sorry for late response, @watermen . I missed your message. Yes. This supports multiple joins.

@dongjoon-hyun
Copy link
Member Author

Hi, @watermen . You can try that like this.

scala> spark.range(1000000000).createOrReplaceTempView("t1")
scala> spark.range(1000000000).createOrReplaceTempView("t2")
scala> spark.range(1000000000).createOrReplaceTempView("t3")
scala> sql("SELECT * FROM t1 JOIN t2 ON t1.id = t2.id JOIN t3 ON t1.id = t3.id").explain
== Physical Plan ==
*SortMergeJoin [id#0L], [id#8L], Inner
:- *SortMergeJoin [id#0L], [id#4L], Inner
:  :- *Sort [id#0L ASC], false, 0
:  :  +- Exchange hashpartitioning(id#0L, 200)
:  :     +- *Range (0, 1000000000, splits=8)
:  +- *Sort [id#4L ASC], false, 0
:     +- ReusedExchange [id#4L], Exchange hashpartitioning(id#0L, 200)
+- *Sort [id#8L ASC], false, 0
   +- ReusedExchange [id#8L], Exchange hashpartitioning(id#0L, 200)

scala> sql("SELECT /*+ MAPJOIN(t1) */ * FROM t1 JOIN t2 ON t1.id = t2.id JOIN t3 ON t1.id = t3.id").explain
== Physical Plan ==
*SortMergeJoin [id#0L], [id#8L], Inner
:- *Sort [id#0L ASC], false, 0
:  +- Exchange hashpartitioning(id#0L, 200)
:     +- *BroadcastHashJoin [id#0L], [id#4L], Inner, BuildLeft
:        :- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
:        :  +- *Range (0, 1000000000, splits=8)
:        +- *Range (0, 1000000000, splits=8)
+- *Sort [id#8L ASC], false, 0
   +- Exchange hashpartitioning(id#8L, 200)
      +- *Range (0, 1000000000, splits=8)

scala> sql("SELECT /*+ MAPJOIN(t2) */ * FROM t1 JOIN t2 ON t1.id = t2.id JOIN t3 ON t1.id = t3.id").explain
== Physical Plan ==
*SortMergeJoin [id#0L], [id#8L], Inner
:- *Sort [id#0L ASC], false, 0
:  +- Exchange hashpartitioning(id#0L, 200)
:     +- *BroadcastHashJoin [id#0L], [id#4L], Inner, BuildRight
:        :- *Range (0, 1000000000, splits=8)
:        +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
:           +- *Range (0, 1000000000, splits=8)
+- *Sort [id#8L ASC], false, 0
   +- Exchange hashpartitioning(id#8L, 200)
      +- *Range (0, 1000000000, splits=8)

scala> sql("SELECT /*+ MAPJOIN(t3) */ * FROM t1 JOIN t2 ON t1.id = t2.id JOIN t3 ON t1.id = t3.id").explain
== Physical Plan ==
*BroadcastHashJoin [id#0L], [id#8L], Inner, BuildRight
:- *SortMergeJoin [id#0L], [id#4L], Inner
:  :- *Sort [id#0L ASC], false, 0
:  :  +- Exchange hashpartitioning(id#0L, 200)
:  :     +- *Range (0, 1000000000, splits=8)
:  +- *Sort [id#4L ASC], false, 0
:     +- ReusedExchange [id#4L], Exchange hashpartitioning(id#0L, 200)
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
   +- *Range (0, 1000000000, splits=8)

@SparkQA
Copy link

SparkQA commented Aug 24, 2016

Test build #64360 has finished for PR 14426 at commit ef1abc7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

@SparkQA
Copy link

SparkQA commented Aug 27, 2016

Test build #64529 has finished for PR 14426 at commit 6461e41.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

@SparkQA
Copy link

SparkQA commented Aug 29, 2016

Test build #64574 has finished for PR 14426 at commit 377b625.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

@SparkQA
Copy link

SparkQA commented Sep 2, 2016

Test build #64875 has finished for PR 14426 at commit 2cc19b3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

@SparkQA
Copy link

SparkQA commented Sep 6, 2016

Test build #64966 has finished for PR 14426 at commit 42248a1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

@SparkQA
Copy link

SparkQA commented Sep 7, 2016

Test build #65023 has finished for PR 14426 at commit 0d19e28.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

@SparkQA
Copy link

SparkQA commented Sep 9, 2016

Test build #65142 has finished for PR 14426 at commit 01105f5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

@SparkQA
Copy link

SparkQA commented Sep 12, 2016

Test build #65249 has finished for PR 14426 at commit 47d98e7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

@dongjoon-hyun
Copy link
Member Author

Rebased to resolve conflicts.

@SparkQA
Copy link

SparkQA commented Sep 15, 2016

Test build #65432 has finished for PR 14426 at commit b1f314c.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

@SparkQA
Copy link

SparkQA commented Sep 20, 2016

Test build #65663 has finished for PR 14426 at commit 5ac4457.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

@SparkQA
Copy link

SparkQA commented Sep 26, 2016

Test build #65927 has finished for PR 14426 at commit e9ba01e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

@SparkQA
Copy link

SparkQA commented Oct 1, 2016

Test build #66220 has finished for PR 14426 at commit b180019.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

@dongjoon-hyun
Copy link
Member Author

Hi, @rxin .
Could you give me some guide for this Broadcast Hint for SQL Queries if you have sometime?

@dongjoon-hyun
Copy link
Member Author

Hi, @gatorsmile .
Could you review this PR when you have some time?

@SparkQA
Copy link

SparkQA commented Oct 7, 2016

Test build #66515 has finished for PR 14426 at commit 5290081.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

@dongjoon-hyun
Copy link
Member Author

Rebased to resolve the conflicts.

@SparkQA
Copy link

SparkQA commented Oct 8, 2016

Test build #66546 has finished for PR 14426 at commit 778cede.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

@SparkQA
Copy link

SparkQA commented Oct 10, 2016

Test build #66619 has finished for PR 14426 at commit 57adfd3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

@SparkQA
Copy link

SparkQA commented Oct 19, 2016

Test build #67189 has finished for PR 14426 at commit 7483889.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

@SparkQA
Copy link

SparkQA commented Oct 24, 2016

Test build #67425 has finished for PR 14426 at commit dfe6a3e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

@dongjoon-hyun
Copy link
Member Author

Resolve the conflicts.

@SparkQA
Copy link

SparkQA commented Nov 3, 2016

Test build #68088 has finished for PR 14426 at commit 539782d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

@dongjoon-hyun
Copy link
Member Author

Retest this please

@SparkQA
Copy link

SparkQA commented Nov 4, 2016

Test build #68092 has finished for PR 14426 at commit 539782d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

@rxin
Copy link
Contributor

rxin commented Nov 4, 2016

Mind closing this for now? Let's open it once we've done the view canonicalization work. This pr will be much simpler. We should do both in 2.2.

@dongjoon-hyun
Copy link
Member Author

Thank you for guide.

@rxin
Copy link
Contributor

rxin commented Feb 14, 2017

@dongjoon-hyun do you have time to update the pull request now the view canonicalization work is done? Basically we can remove all the SQL generation stuff.

rxin added a commit to rxin/spark that referenced this pull request Feb 14, 2017
[SPARK-16475][SQL] Broadcast Hint for SQL Queries
@rxin
Copy link
Contributor

rxin commented Feb 14, 2017

Actually I have some time. I will submit a pr based on this.

@dongjoon-hyun
Copy link
Member Author

Oh.

ghost pushed a commit to dbtsai/spark that referenced this pull request Feb 14, 2017
## What changes were proposed in this pull request?
This pull request introduces a simple hint infrastructure to SQL and implements broadcast join hint using the infrastructure.

The hint syntax looks like the following:
```
SELECT /*+ BROADCAST(t) */ * FROM t
```

For broadcast hint, we accept "BROADCAST", "BROADCASTJOIN", and "MAPJOIN", and a sequence of relation aliases can be specified in the hint. A broadcast hint plan node will be inserted on top of any relation (that is not aliased differently), subquery, or common table expression that match the specified name.

The hint resolution works by recursively traversing down the query plan to find a relation or subquery that matches one of the specified broadcast aliases. The traversal does not go past beyond any existing broadcast hints, subquery aliases. This rule happens before common table expressions.

Note that there was an earlier patch in apache#14426. This is a rewrite of that patch, with different semantics and simpler test cases.

## How was this patch tested?
Added a new unit test suite for the broadcast hint rule (SubstituteHintsSuite) and new test cases for parser change (in PlanParserSuite). Also added end-to-end test case in BroadcastSuite.

Author: Reynold Xin <rxin@databricks.com>
Author: Dongjoon Hyun <dongjoon@apache.org>

Closes apache#16925 from rxin/SPARK-16475-broadcast-hint.
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
## What changes were proposed in this pull request?
This pull request introduces a simple hint infrastructure to SQL and implements broadcast join hint using the infrastructure.

The hint syntax looks like the following:
```
SELECT /*+ BROADCAST(t) */ * FROM t
```

For broadcast hint, we accept "BROADCAST", "BROADCASTJOIN", and "MAPJOIN", and a sequence of relation aliases can be specified in the hint. A broadcast hint plan node will be inserted on top of any relation (that is not aliased differently), subquery, or common table expression that match the specified name.

The hint resolution works by recursively traversing down the query plan to find a relation or subquery that matches one of the specified broadcast aliases. The traversal does not go past beyond any existing broadcast hints, subquery aliases. This rule happens before common table expressions.

Note that there was an earlier patch in apache#14426. This is a rewrite of that patch, with different semantics and simpler test cases.

## How was this patch tested?
Added a new unit test suite for the broadcast hint rule (SubstituteHintsSuite) and new test cases for parser change (in PlanParserSuite). Also added end-to-end test case in BroadcastSuite.

Author: Reynold Xin <rxin@databricks.com>
Author: Dongjoon Hyun <dongjoon@apache.org>

Closes apache#16925 from rxin/SPARK-16475-broadcast-hint.
@dongjoon-hyun dongjoon-hyun deleted the SPARK-16475-HINT branch January 7, 2019 07:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants