Skip to content

Conversation

@dbtsai
Copy link
Member

@dbtsai dbtsai commented Mar 3, 2020

What changes were proposed in this pull request?

Parquet predicate pushdown on columns with dots is disabled in SPARK-20364 due to the limitation of Parquet APIs.

A new set of APIs is purposed in PARQUET-1809 to generalize the support for both cols containing dot and nested cols.

This PR implements a new Parquet filter APIs that supports both column names containing dot and nested columns. We will remove those code from Spark codebase once we upgrade to a new release of Parquet that contains this implementation.

Why are the changes needed?

Many tables in production are using dot as part of the column names, and without predicate pushdown on those columns, the performance is suffering.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing tests and one new test.

@dbtsai
Copy link
Member Author

dbtsai commented Mar 3, 2020

This depends on #27778 . Once the other one is merged, I will rebase against master. Thanks!

@dbtsai dbtsai requested a review from HyukjinKwon March 3, 2020 23:57
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, a test left-over? Shall we remove this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Thanks.

@dbtsai dbtsai changed the title [SPARK-31026] [SQL] Parquet predicate pushdown on columns with dots [SPARK-31026] [SQL] [test-hive1.2] Parquet predicate pushdown on columns with dots Mar 4, 2020
@dbtsai dbtsai requested a review from rdblue March 4, 2020 00:49
Copy link
Member

@HyukjinKwon HyukjinKwon Mar 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I am remembering wrongly but I initially tried to allow filters with dots with the similar approach here (#18000). It was suggested simply to disable it so I did it, and @rdblue didn't like it either. Am I correct, @rdblue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks more useful now as it can not only support column name with dots, but also nested fields.

@SparkQA
Copy link

SparkQA commented Mar 4, 2020

Test build #119262 has finished for PR 27780 at commit f7326f9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public final class SparkFilterApi

@SparkQA
Copy link

SparkQA commented Mar 4, 2020

Test build #119265 has finished for PR 27780 at commit ecf1f9d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Please rebase to the master because the related sub-PR is merged now.

@SparkQA
Copy link

SparkQA commented Mar 4, 2020

Test build #119328 has finished for PR 27780 at commit 5ebafd5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public final class SparkFilterApi

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we test both vectorized and non-vectorized reader?

Copy link
Contributor

@cloud-fan cloud-fan Mar 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, we can merge this to the code block above, which is inside a Seq(true, false).foreach { vectorized =>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks.

Copy link
Member

@dongjoon-hyun dongjoon-hyun Mar 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make another PR for this renaming first because this is orthogonal to this PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member Author

@dbtsai dbtsai Mar 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a PR for renaming and consolidating two quoteIfNeeded implementations. #27814

@SparkQA
Copy link

SparkQA commented Mar 5, 2020

Test build #119349 has finished for PR 27780 at commit 5b77ecf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 5, 2020

Test build #119354 has finished for PR 27780 at commit b6229e7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to confirm, this PR doesn't support nested fields yet, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR doesn't support nested fields yet, but it's a one step forward.

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except one comment in test, thanks for cleaning this up and fix it!

@SparkQA
Copy link

SparkQA commented Mar 6, 2020

Test build #119435 has finished for PR 27780 at commit 4cc2ff6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public final class SparkFilterApi

@dbtsai
Copy link
Member Author

dbtsai commented Mar 6, 2020

@SparkQA
Copy link

SparkQA commented Mar 7, 2020

Test build #119489 has finished for PR 27780 at commit 0e94952.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 7, 2020

Test build #119491 has finished for PR 27780 at commit 77bf26e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is actually a pretty breaking change. Not all implementations of the data sources will have the syntax to handle backquotes - there are so many non-DBMS implementations outside like elasticsearch, mongodb, etc. which I see relevant tickets in Spark JIRAs time to time.

In particular, this is a stable API. Can we update the migration guide at the very least?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: protected[sql] -> protected[orc]

@dbtsai dbtsai changed the title [SPARK-31026] [SQL] [test-hive1.2] Parquet predicate pushdown on columns with dots [SPARK-31026] SPARK-31060 [SQL] [test-hive1.2] Parquet predicate pushdown on columns with dots Mar 10, 2020
@dbtsai dbtsai changed the title [SPARK-31026] SPARK-31060 [SQL] [test-hive1.2] Parquet predicate pushdown on columns with dots [SPARK-31026] [SPARK-31060] [SQL] [test-hive1.2] Parquet predicate pushdown on columns with dots Mar 10, 2020
@dbtsai
Copy link
Member Author

dbtsai commented Mar 10, 2020

Closing it and merging with https://github.com/apache/spark/pull/27728/files Thanks all for reviewing.

@dbtsai dbtsai closed this Mar 10, 2020
@SparkQA
Copy link

SparkQA commented Mar 10, 2020

Test build #119626 has finished for PR 27780 at commit 1e00859.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants