[SPARK-31026] [SPARK-31060] [SQL] [test-hive1.2] Parquet predicate pushdown on columns with dots #27780

dbtsai · 2020-03-03T23:56:02Z

What changes were proposed in this pull request?

Parquet predicate pushdown on columns with dots is disabled in SPARK-20364 due to the limitation of Parquet APIs.

A new set of APIs is purposed in PARQUET-1809 to generalize the support for both cols containing dot and nested cols.

This PR implements a new Parquet filter APIs that supports both column names containing dot and nested columns. We will remove those code from Spark codebase once we upgrade to a new release of Parquet that contains this implementation.

Why are the changes needed?

Many tables in production are using dot as part of the column names, and without predicate pushdown on those columns, the performance is suffering.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing tests and one new test.

dbtsai · 2020-03-03T23:57:02Z

This depends on #27778 . Once the other one is merged, I will rebase against master. Thanks!

dongjoon-hyun · 2020-03-04T00:30:41Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

Maybe, a test left-over? Shall we remove this?

Oops. Thanks.

HyukjinKwon · 2020-03-04T01:10:32Z

sql/core/src/main/java/org/apache/parquet/filter2/predicate/SparkFilterApi.java

Maybe I am remembering wrongly but I initially tried to allow filters with dots with the similar approach here (#18000). It was suggested simply to disable it so I did it, and @rdblue didn't like it either. Am I correct, @rdblue?

This looks more useful now as it can not only support column name with dots, but also nested fields.

SparkQA · 2020-03-04T04:44:08Z

Test build #119262 has finished for PR 27780 at commit f7326f9.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public final class SparkFilterApi

SparkQA · 2020-03-04T05:31:49Z

Test build #119265 has finished for PR 27780 at commit ecf1f9d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-03-04T17:10:12Z

Please rebase to the master because the related sub-PR is merged now.

SparkQA · 2020-03-04T22:14:27Z

Test build #119328 has finished for PR 27780 at commit 5ebafd5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public final class SparkFilterApi

dongjoon-hyun · 2020-03-05T03:49:48Z

...e/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

Can we test both vectorized and non-vectorized reader?

+1, we can merge this to the code block above, which is inside a Seq(true, false).foreach { vectorized =>

Done. Thanks.

dongjoon-hyun · 2020-03-05T04:47:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Implicits.scala

Could you make another PR for this renaming first because this is orthogonal to this PR?

Here is a PR for renaming and consolidating two quoteIfNeeded implementations. #27814

SparkQA · 2020-03-05T06:27:35Z

Test build #119349 has finished for PR 27780 at commit 5b77ecf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-03-05T06:40:46Z

Test build #119354 has finished for PR 27780 at commit b6229e7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-03-05T13:19:28Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

to confirm, this PR doesn't support nested fields yet, right?

This PR doesn't support nested fields yet, but it's a one step forward.

cloud-fan

LGTM except one comment in test, thanks for cleaning this up and fix it!

SparkQA · 2020-03-06T05:24:06Z

Test build #119435 has finished for PR 27780 at commit 4cc2ff6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public final class SparkFilterApi

dbtsai · 2020-03-06T21:27:31Z

This depends on https://github.com/apache/spark/pull/27817/files

SparkQA · 2020-03-07T02:22:55Z

Test build #119489 has finished for PR 27780 at commit 0e94952.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-03-07T02:26:25Z

Test build #119491 has finished for PR 27780 at commit 77bf26e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-03-08T03:23:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/sources/filters.scala

This one is actually a pretty breaking change. Not all implementations of the data sources will have the syntax to handle backquotes - there are so many non-DBMS implementations outside like elasticsearch, mongodb, etc. which I see relevant tickets in Spark JIRAs time to time.

In particular, this is a stable API. Can we update the migration guide at the very least?

HyukjinKwon · 2020-03-08T03:25:45Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala

nit: protected[sql] -> protected[orc]

dbtsai · 2020-03-10T17:42:19Z

Closing it and merging with https://github.com/apache/spark/pull/27728/files Thanks all for reviewing.

SparkQA · 2020-03-10T22:06:50Z

Test build #119626 has finished for PR 27780 at commit 1e00859.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai requested review from cloud-fan, dongjoon-hyun, gengliangwang and wangyum March 3, 2020 23:56

dbtsai requested a review from HyukjinKwon March 3, 2020 23:57

dongjoon-hyun reviewed Mar 4, 2020

View reviewed changes

dbtsai changed the title ~~[SPARK-31026] [SQL] Parquet predicate pushdown on columns with dots~~ [SPARK-31026] [SQL] [test-hive1.2] Parquet predicate pushdown on columns with dots Mar 4, 2020

dbtsai requested a review from rdblue March 4, 2020 00:49

This was referenced Mar 4, 2020

[SPARK-20364][SQL] Disable Parquet predicate pushdown for fields having dots in the names #18000

Closed

[SPARK-20364][SQL] Support Parquet predicate pushdown on columns with dots #17680

Closed

HyukjinKwon reviewed Mar 4, 2020

View reviewed changes

dongjoon-hyun added the SQL label Mar 4, 2020

dbtsai force-pushed the SPARK-31026 branch from ecf1f9d to 5ebafd5 Compare March 4, 2020 17:51

dongjoon-hyun reviewed Mar 5, 2020

View reviewed changes

cloud-fan reviewed Mar 5, 2020

View reviewed changes

dbtsai force-pushed the SPARK-31026 branch from b6229e7 to 4cc2ff6 Compare March 6, 2020 00:49

cloud-fan approved these changes Mar 6, 2020

View reviewed changes

dbtsai mentioned this pull request Mar 6, 2020

[SPARK-31060][SQL] Handle column names containing dots in data source Filter #27817

Closed

dbtsai force-pushed the SPARK-31026 branch from 4cc2ff6 to 0e94952 Compare March 6, 2020 21:27

dbtsai force-pushed the SPARK-31026 branch from 6d10926 to 77bf26e Compare March 6, 2020 21:34

HyukjinKwon reviewed Mar 8, 2020

View reviewed changes

dbtsai changed the title ~~[SPARK-31026] [SQL] [test-hive1.2] Parquet predicate pushdown on columns with dots~~ [SPARK-31026] SPARK-31060 [SQL] [test-hive1.2] Parquet predicate pushdown on columns with dots Mar 10, 2020

dbtsai changed the title ~~[SPARK-31026] SPARK-31060 [SQL] [test-hive1.2] Parquet predicate pushdown on columns with dots~~ [SPARK-31026] [SPARK-31060] [SQL] [test-hive1.2] Parquet predicate pushdown on columns with dots Mar 10, 2020

SPARK-31026

1e00859

dbtsai force-pushed the SPARK-31026 branch from 77bf26e to 1e00859 Compare March 10, 2020 17:33

dbtsai closed this Mar 10, 2020

[SPARK-31026] [SPARK-31060] [SQL] [test-hive1.2] Parquet predicate pushdown on columns with dots #27780

[SPARK-31026] [SPARK-31060] [SQL] [test-hive1.2] Parquet predicate pushdown on columns with dots #27780

Uh oh!

Conversation

dbtsai commented Mar 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

dbtsai commented Mar 3, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Mar 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 4, 2020

Uh oh!

SparkQA commented Mar 4, 2020

Uh oh!

dongjoon-hyun commented Mar 4, 2020

Uh oh!

SparkQA commented Mar 4, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Mar 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Mar 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dbtsai Mar 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 5, 2020

Uh oh!

SparkQA commented Mar 5, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 6, 2020

Uh oh!

dbtsai commented Mar 6, 2020

Uh oh!

SparkQA commented Mar 7, 2020

Uh oh!

SparkQA commented Mar 7, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dbtsai commented Mar 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Mar 10, 2020

Uh oh!

Reviewers

Assignees

Labels

dbtsai commented Mar 3, 2020 •

edited

Loading

HyukjinKwon Mar 4, 2020 •

edited

Loading

cloud-fan Mar 6, 2020 •

edited

Loading

dongjoon-hyun Mar 5, 2020 •

edited

Loading

dbtsai Mar 5, 2020 •

edited

Loading

dbtsai commented Mar 10, 2020 •

edited

Loading