Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Jul 11, 2016

What changes were proposed in this pull request?

This PR aims to achieve the following two goals in Spark SQL.

1. Generic Hint Syntax
The generic hints are parsed and transformed into concrete hints by SubstituteHints of Analyzer. The unknown hints are removed, too. For example, Hint("MAPJOIN") is transformed into BroadcastJoin and other hints are removed currently.

SELECT /*+ MAPJOIN(t) */ * FROM t
SELECT /*+ STREAMTABLE(a,b,c) */ * FROM t
SELECT /*+ INDEX(t emp_job_ix) */ * FROM t

Unlink Hive, NEWMAPJOIN(t) is allowed for accepting new Spark Hints.

2. Broadcast Hints
The followings are recognized. Technically, broadcast hints are matched UnresolvedRelation to support Hive MetastoreRelation. The style of database_name.table_name is not allowed in this PR.

SELECT /*+ MAPJOIN(t) */ * FROM t JOIN u ON t.id = u.id
SELECT /*+ BROADCAST(u) */ * FROM t JOIN u ON t.id = u.id
SELECT /*+ BROADCASTJOIN(u) */ * FROM t JOIN u ON t.id = u.id

Examples

scala> spark.range(1000000000).createOrReplaceTempView("t")
scala> spark.range(1000000000).createOrReplaceTempView("u")

scala> sql("SELECT * FROM t JOIN u ON t.id = u.id").explain
== Physical Plan ==
*SortMergeJoin [id#0L], [id#4L], Inner
:- *Sort [id#0L ASC], false, 0
:  +- Exchange hashpartitioning(id#0L, 200)
:     +- *Range (0, 1000000000, splits=8)
+- *Sort [id#4L ASC], false, 0
   +- ReusedExchange [id#4L], Exchange hashpartitioning(id#0L, 200)

scala> sql("SELECT /*+ MAPJOIN(t) */ * FROM t JOIN u ON t.id = u.id").explain
== Physical Plan ==
*BroadcastHashJoin [id#0L], [id#4L], Inner, BuildLeft
:- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
:  +- *Range (0, 1000000000, splits=8)
+- *Range (0, 1000000000, splits=8)

scala> sql("SELECT /*+ MAPJOIN(u) */ * FROM t JOIN u ON t.id = u.id").explain
== Physical Plan ==
*BroadcastHashJoin [id#0L], [id#4L], Inner, BuildRight
:- *Range (0, 1000000000, splits=8)
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
   +- *Range (0, 1000000000, splits=8)

scala> sql("CREATE TABLE hive_t(id INT)")
res5: org.apache.spark.sql.DataFrame = []

scala> sql("CREATE TABLE hive_u(id INT)")
res6: org.apache.spark.sql.DataFrame = []

scala> sql("SELECT /*+ MAPJOIN(hive_u) */ * FROM hive_t JOIN hive_u ON hive_t.id = hive_u.id").explain
== Physical Plan ==
*BroadcastHashJoin [id#28], [id#29], Inner, BuildRight
:- *Filter isnotnull(id#28)
:  +- HiveTableScan [id#28], MetastoreRelation default, hive_t
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
   +- *Filter isnotnull(id#29)
      +- HiveTableScan [id#29], MetastoreRelation default, hive_u

scala> sql("SELECT * FROM hive_t JOIN hive_u ON hive_t.id = hive_u.id").explain
== Physical Plan ==
*SortMergeJoin [id#36], [id#37], Inner
:- *Sort [id#36 ASC], false, 0
:  +- Exchange hashpartitioning(id#36, 200)
:     +- *Filter isnotnull(id#36)
:        +- HiveTableScan [id#36], MetastoreRelation default, hive_t
+- *Sort [id#37 ASC], false, 0
   +- Exchange hashpartitioning(id#37, 200)
      +- *Filter isnotnull(id#37)
         +- HiveTableScan [id#37], MetastoreRelation default, hive_u

How was this patch tested?

Pass the Jenkins tests with new testcases.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-16475][SQL] Broadcast Hint for SQL Queries [SPARK-16475][SQL][WIP] Broadcast Hint for SQL Queries Jul 11, 2016
@dongjoon-hyun
Copy link
Member Author

cc @rxin and @hvanhovell .

@dongjoon-hyun
Copy link
Member Author

If the direction is right, I can move on adding BROADCAST JOIN syntax.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this because the BRACKETED_COMMENT rule is now expecting at least one character?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Could you give me some workaround advice?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My advice would be to add the HINT_PREFIX rule ('/*+')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try to add this as a case (| '/**/' -> channel(HIDDEN)) to the BRACKETED_COMMENT rule?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. It seems we can not do that due to channel(HIDDEN).

->command in lexer rule BRACKETED_COMMENT must be last element of single outermost alt

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok nvm....

@hvanhovell
Copy link
Contributor

This looks pretty good!

@dongjoon-hyun
Copy link
Member Author

Thank you for quick review! I'll let you know after updating.

@SparkQA
Copy link

SparkQA commented Jul 11, 2016

Test build #62082 has finished for PR 14132 at commit 6bd704e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Oops. Five errors, too. I'll fix these tomorrow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are we trying to support here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can also do this (is easier in the AST builder): | hintName=identifier '(' parameters+=identifier parameters+=identifier ')'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It supports 'INDEX(t idx_emp)' style. For example, I included one in the test.

@SparkQA
Copy link

SparkQA commented Jul 12, 2016

Test build #62162 has finished for PR 14132 at commit edec2e4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode

@dongjoon-hyun
Copy link
Member Author

For the HINT_PREFIX, I tried at the first, but still have some problem. So, I couldn't include the very previous commit.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Jul 12, 2016

Thank you always, @hvanhovell . And, sorry for the delay. Since last Saturday, I need to do some important personal stuff offline, so the time is limited for me. :(

@dongjoon-hyun
Copy link
Member Author

I'm back. I'll resolve them.

@dongjoon-hyun
Copy link
Member Author

So far, I couldn't do the following two advices. I just inform you that I'm still working these. :)

  • HINT_PREFIX
  • | hintName=identifier '(' parameters+=identifier parameters+=identifier ')'

@dongjoon-hyun
Copy link
Member Author

Now, only minor HINT_PREFIX remains.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-16475][SQL][WIP] Broadcast Hint for SQL Queries [SPARK-16475][SQL] Broadcast Hint for SQL Queries Jul 12, 2016
@dongjoon-hyun
Copy link
Member Author

Hi, @hvanhovell . So far, I tried in the following way for HINT_PREFIX.
Due to the prefix overlapping, BRACKETED_COMMENT seems to eat the HINT.
Could you give me some advice more?

 hint
-    : '/*+' hintStatement '*/'
+    : HINT_PREFIX hintStatement '*/'
     ;

 hintStatement
@@ -961,12 +961,12 @@ SIMPLE_COMMENT
     : '--' ~[\r\n]* '\r'? '\n'? -> channel(HIDDEN)
     ;

-BRACKETED_EMPTY_COMMENT
-    : '/**/' -> channel(HIDDEN)
+HINT_PREFIX
+    : '/*+'
     ;

 BRACKETED_COMMENT
-    : '/*' ~[+] .*? '*/' -> channel(HIDDEN)
+    : '/*' .*? '*/' -> channel(HIDDEN)             <--- The original one.
     ;

@SparkQA
Copy link

SparkQA commented Jul 13, 2016

Test build #62196 has finished for PR 14132 at commit 5c76e25.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 13, 2016

Test build #62198 has finished for PR 14132 at commit 61e4b1b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @rxin .
Could you review this PR again when you have some time?
All the features are implemented and Spark Parser accepts general hints now.

@dongjoon-hyun
Copy link
Member Author

Hi, @hvanhovell .
In this PR, if possible, could we skip adding HINT_PREFIX rule? I finished all the other things.
Actually, there is no change on the behavior of parser if we can use HINT_PREFIX.

@hvanhovell
Copy link
Contributor

@dongjoon-hyun sure. It was merely a suggestion to get rid of the BRACKETED_EMPTY_COMMENT rule.

@dongjoon-hyun
Copy link
Member Author

Thank you, @hvanhovell .
For the other, any more comments are welcome!

@dongjoon-hyun
Copy link
Member Author

PR description is updated.

@dongjoon-hyun
Copy link
Member Author

Oh, the one failure is due to a new MAPJOIN testcase in the master branch (I added yesterday.)
I'll rebase and update that, too.

@SparkQA
Copy link

SparkQA commented Jul 28, 2016

Test build #62975 has finished for PR 14132 at commit 4023d97.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @cloud-fan .
Could you review this PR again please?

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Jul 30, 2016

This PR grows too much. Sometime, scrolling is too slow. I close this and open a new one #14426 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants