Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Feb 25, 2016

JIRA: https://issues.apache.org/jira/browse/SPARK-13321

What changes were proposed in this pull request?

Looks like the following SQL can be parsed now with new ANTLR4 parser:

SELECT  `u_1`.`id` FROM (((SELECT  `t0`.`id` FROM `default`.`t0`) UNION ALL (SELECT  `t0`.`id` FROM `default`.`t0`)) UNION ALL (SELECT  `t0`.`id` FROM `default`.`t0`)) AS u_1

We just need to add test cases.

How was this patch tested?

New tests are added to PlanParserSuit and HiveQuerySuite.

@viirya
Copy link
Member Author

viirya commented Feb 25, 2016

cc @rxin

I am not sure if HiveCompatibilitySuite.union16 was hanging from this. Because I copy the same queries from union16 to HiveQuerySuite and they are working locally. I will see how jenkins outputs from this change as it now should run HiveCompatibilitySuite.

@SparkQA
Copy link

SparkQA commented Feb 25, 2016

Test build #51928 has finished for PR 11361 at commit 5ff5ac2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Feb 25, 2016

@rxin Looks like HiveCompatibilitySuite.union16 doesn't hang from this. But it actually takes long time to finish that test ([info] - union16 (13 minutes, 21 seconds)). I don't know if it is different than previous run before this pr. Because this only change parser rule, I think it should not modify the time to run the union test.

@rxin
Copy link
Contributor

rxin commented Feb 25, 2016

Can you take a look why would 2 queries take 13 mins?

EXPLAIN
SELECT count(1) FROM (
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL

  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL

  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL

  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL

  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src) src;


SELECT count(1) FROM (
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL

  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL

  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL

  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL

  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src) src;

When I was running this, this was running in parser forever.

@viirya
Copy link
Member Author

viirya commented Feb 25, 2016

@rxin ok. I got why it takes so long to finish the test.

The original query:

SELECT count(1) FROM (
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL

  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL

  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL

  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL

  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src UNION ALL
  SELECT key, value FROM src) src;

will result an analyzed plan like:

== Analyzed Logical Plan ==
count(1): bigint
Aggregate [(count(1),mode=Complete,isDistinct=false) AS count(1)#393L]
+- SubqueryAlias src
   +- Union
      :- Union
      :  :- Union
      :  :  :- Union
      :  :  :  :- Union
      :  :  :  :  :- Union
      :  :  :  :  :  :- Union
      :  :  :  :  :  :  :- Union
      :  :  :  :  :  :  :  :- Union
      :  :  :  :  :  :  :  :  :- Union
      :  :  :  :  :  :  :  :  :  :- Union
      :  :  :  :  :  :  :  :  :  :  :- Union
      :  :  :  :  :  :  :  :  :  :  :  :- Union
      :  :  :  :  :  :  :  :  :  :  :  :  :- Union
      :  :  :  :  :  :  :  :  :  :  :  :  :  :- Union
      :  :  :  :  :  :  :  :  :  :  :  :  :  :  :- Union
      :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :- Union
      :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :- Union
      :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :- Union
      :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :- Union
      :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :- Union
      :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :  :- Union
      ...(skip)

In HiveComparisonTest, we will try to use SQLBuilder to convert analyzed plan back to sql query.

Because PR #11195 adds a () to wrap sql queries for union's children, it will generate a deeply nested sql query for union16 query:

SELECT count(1) AS `count(1)` FROM (((((((((((((((((((((((((SELECT `src`.`key`, `src`.`value` FROM `default`.`src`) UNION ALL (SELECT `src`.`key`, `src`.`value` FROM `default`.`src`)) UNION ALL (SELECT `src`.`key`, `src`.`value` FROM `default`.`src`)) UNION ALL (SELECT `src`.`key`, `src`.`value` FROM `default`.`src`)) UNION ALL (SELECT `src`.`key`, `src`.`value` FROM `default`.`src`)) UNION ALL (SELECT `src`.`key`, `src`.`value` FROM `default`.`src`)) UNION ALL (SELECT `src`.`key`, `src`.`value` FROM `default`.`src`)) UNION ALL (SELECT `src`.`key`, `src`.`value` FROM `default`.`src`)) UNION ALL (SELECT `src`.`key`, `src`.`value` FROM `default`.`src`)) UNION ALL (SELECT `src`.`key`, `src`.`value` FROM `default`.`src`)) UNION ALL (SELECT `src`.`key`, `src`.`value` FROM `default`.`src`)) UNION ALL (SELECT `src`.`key`, `src`.`value` FROM `default`.`src`)) UNION ALL (SELECT `src`.`key`, `src`.`value` FROM `default`.`src`)) UNION ALL (SELECT `src`.`key`, `src`.`value` FROM `default`.`src`)) UNION ALL (SELECT `src`.`key`, `src`.`value` FROM `default`.`src`)) UNION ALL (SELECT `src`.`key`, `src`.`value` FROM `default`.`src`)) UNION ALL (SELECT `src`.`key`, `src`.`value` FROM `default`.`src`)) UNION ALL (SELECT `src`.`key`, `src`.`value` FROM `default`.`src`)) UNION ALL (SELECT `src`.`key`, `src`.`value` FROM `default`.`src`)) UNION ALL (SELECT `src`.`key`, `src`.`value` FROM `default`.`src`)) UNION ALL (SELECT `src`.`key`, `src`.`value` FROM `default`.`src`)) UNION ALL (SELECT `src`.`key`, `src`.`value` FROM `default`.`src`)) UNION ALL (SELECT `src`.`key`, `src`.`value` FROM `default`.`src`)) UNION ALL (SELECT `src`.`key`, `src`.`value` FROM `default`.`src`)) UNION ALL (SELECT `src`.`key`, `src`.`value` FROM `default`.`src`)) AS src

Basically the parser processes nested union query with a recursive approach, to parse such deeply nested query cost much time. That is why union16 takes so long to finish.

If we remove the () from the sql queries for union's children in SQLBuilder, the generated sql query would be:

SELECT count(1) AS `count(1)` FROM (SELECT `src`.`key`, `src`.`value` FROM `default`.`src` UNION ALL SELECT `src`.`key`, `src`.`value` FROM `default`.`src` UNION ALL SELECT `src`.`key`, `src`.`value` FROM `default`.`src` UNION ALL SELECT `src`.`key`, `src`.`value` FROM `default`.`src` UNION ALL SELECT `src`.`key`, `src`.`value` FROM `default`.`src` UNION ALL SELECT `src`.`key`, `src`.`value` FROM `default`.`src` UNION ALL SELECT `src`.`key`, `src`.`value` FROM `default`.`src` UNION ALL SELECT `src`.`key`, `src`.`value` FROM `default`.`src` UNION ALL SELECT `src`.`key`, `src`.`value` FROM `default`.`src` UNION ALL SELECT `src`.`key`, `src`.`value` FROM `default`.`src` UNION ALL SELECT `src`.`key`, `src`.`value` FROM `default`.`src` UNION ALL SELECT `src`.`key`, `src`.`value` FROM `default`.`src` UNION ALL SELECT `src`.`key`, `src`.`value` FROM `default`.`src` UNION ALL SELECT `src`.`key`, `src`.`value` FROM `default`.`src` UNION ALL SELECT `src`.`key`, `src`.`value` FROM `default`.`src` UNION ALL SELECT `src`.`key`, `src`.`value` FROM `default`.`src` UNION ALL SELECT `src`.`key`, `src`.`value` FROM `default`.`src` UNION ALL SELECT `src`.`key`, `src`.`value` FROM `default`.`src` UNION ALL SELECT `src`.`key`, `src`.`value` FROM `default`.`src` UNION ALL SELECT `src`.`key`, `src`.`value` FROM `default`.`src` UNION ALL SELECT `src`.`key`, `src`.`value` FROM `default`.`src` UNION ALL SELECT `src`.`key`, `src`.`value` FROM `default`.`src` UNION ALL SELECT `src`.`key`, `src`.`value` FROM `default`.`src` UNION ALL SELECT `src`.`key`, `src`.`value` FROM `default`.`src` UNION ALL SELECT `src`.`key`, `src`.`value` FROM `default`.`src`) AS src

Then the union16 can normally finish under this patch.

@viirya
Copy link
Member Author

viirya commented Feb 25, 2016

@rxin I don't think we should convert union plan back to nested sql query. I would like to remove the () from SQLBuilder for union. What do you think?

@rxin
Copy link
Contributor

rxin commented Feb 25, 2016

Any other thing we can do for this perf problem? It's only 25 levels of nesting. It seems strange to me that the parser would take mins to parse this ... It's hard for me to believe it's just because of some recursion. Is there some exponential complexity here?

cc @hvanhovell

@viirya
Copy link
Member Author

viirya commented Feb 25, 2016

ok. I will continue to see if we can improve the performance of parsing nested union.

@hvanhovell
Copy link
Contributor

@viirya I am currently working ANTLR4 based version of the parsers (see my repo for a few initial commits). It is basically a port of the presto parser. I need another week or so to get most of the HQL functionality working. Perhaps we should wait with this until the new parser is ready.

(edited)

@viirya
Copy link
Member Author

viirya commented Feb 26, 2016

@hvanhovell Great to see your initial work. It looks promising. I think this can wait until the new parser. Besides, are we going to retire ANTLR3 used now?

@hvanhovell
Copy link
Contributor

@viirya I really don't see any reason to keep ANTLR3 around after we migrate the parser.

viirya added 2 commits April 1, 2016 06:33
Conflicts:
	sql/catalyst/src/main/antlr3/org/apache/spark/sql/catalyst/parser/SparkSqlParser.g
	sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/CatalystQlSuite.scala
@viirya viirya changed the title [SPARK-13321][SQL] Support nested UNION in parser [SPARK-13321][SQL] Add nested union test cases Apr 1, 2016
@viirya
Copy link
Member Author

viirya commented Apr 1, 2016

cc @hvanhovell @rxin Because new ANTLR4 parser seems can support this syntax. I updated this to add test cases only. Please take a look. Thanks!

|(SELECT `t0`.`id` FROM `default`.`t0`)) AS u_1
""".stripMargin)

val expected = Project(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: @viirya could you update this use the DSL and assertEqual equals? It makes this a bit easier to read.

BTW this test is very similar to the following test case: https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/PlanParserSuite.scala#L384-L392

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm. indeed. If so, I think I can close this pr now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hvanhovell Is new ANTLR4 parser natively to solve this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya the new parser handles nested queries a lot better. This is mainly due to ANTLR4's better parsing algorithms.

@viirya viirya closed this Apr 1, 2016
@SparkQA
Copy link

SparkQA commented Apr 1, 2016

Test build #54689 has finished for PR 11361 at commit 6775020.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya viirya deleted the nested-union branch December 27, 2023 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants