-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-31026] [SPARK-31060] [SQL] [test-hive1.2] Parquet predicate pushdown on columns with dots #27780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This depends on #27778 . Once the other one is merged, I will rebase against master. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, a test left-over? Shall we remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks more useful now as it can not only support column name with dots, but also nested fields.
|
Test build #119262 has finished for PR 27780 at commit
|
|
Test build #119265 has finished for PR 27780 at commit
|
|
Please rebase to the master because the related sub-PR is merged now. |
|
Test build #119328 has finished for PR 27780 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we test both vectorized and non-vectorized reader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, we can merge this to the code block above, which is inside a Seq(true, false).foreach { vectorized =>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make another PR for this renaming first because this is orthogonal to this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a PR for renaming and consolidating two quoteIfNeeded implementations. #27814
|
Test build #119349 has finished for PR 27780 at commit
|
|
Test build #119354 has finished for PR 27780 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to confirm, this PR doesn't support nested fields yet, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR doesn't support nested fields yet, but it's a one step forward.
cloud-fan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except one comment in test, thanks for cleaning this up and fix it!
|
Test build #119435 has finished for PR 27780 at commit
|
|
This depends on https://github.com/apache/spark/pull/27817/files |
|
Test build #119489 has finished for PR 27780 at commit
|
|
Test build #119491 has finished for PR 27780 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is actually a pretty breaking change. Not all implementations of the data sources will have the syntax to handle backquotes - there are so many non-DBMS implementations outside like elasticsearch, mongodb, etc. which I see relevant tickets in Spark JIRAs time to time.
In particular, this is a stable API. Can we update the migration guide at the very least?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: protected[sql] -> protected[orc]
|
Closing it and merging with https://github.com/apache/spark/pull/27728/files Thanks all for reviewing. |
|
Test build #119626 has finished for PR 27780 at commit
|
What changes were proposed in this pull request?
Parquet predicate pushdown on columns with dots is disabled in SPARK-20364 due to the limitation of Parquet APIs.
A new set of APIs is purposed in PARQUET-1809 to generalize the support for both cols containing
dotand nested cols.This PR implements a new Parquet filter APIs that supports both column names containing
dotand nested columns. We will remove those code from Spark codebase once we upgrade to a new release of Parquet that contains this implementation.Why are the changes needed?
Many tables in production are using
dotas part of the column names, and without predicate pushdown on those columns, the performance is suffering.Does this PR introduce any user-facing change?
No
How was this patch tested?
Existing tests and one new test.