-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-4502][SQL] Parquet nested column pruning - foundation (2nd attempt) #21889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-4502][SQL] Parquet nested column pruning - foundation (2nd attempt) #21889
Conversation
|
ok to test |
|
Test build #93646 has finished for PR 21889 at commit
|
|
If there's no objection within few days, let me get this in cc @cloud-fan and @gatorsmile and make other works and comments separate. @mallman, if we are all happy here, mind taking a look #21320 (comment) and #21320 (comment) I will fix my own comments as a followup by myself. Will credit this to you FWIW. |
I'm working on the former. I don't understand what kind of "second PR" @gatorsmile is referring to in the latter comment. I know we need to do something to fix the ignored tests, but those tests are in this PR. Am I supposed to create a PR on this PR? |
If this one got merged, yes, we could fix them in a separate PR to enable them. I remember there need some fixes in |
|
Just FYI, we are unable to merge it if it has a correctness bug. |
|
@gatorsmile, just for clarification, you mean some regressions about correctness bug in existing features, right? |
|
I mean #21320 (comment) |
|
Ah, I think that's a bug when this feature is enabled. How about this: since this PR is open already and that's a bug in the new feature, we can get this in, I address my comments and he works on that comment and fixes in |
|
Hi @gatorsmile. Where do you see us at this point? Do you still want to get this into Spark 2.4? |
|
We are still targeting this to 2.4, but we need to fix all the identified bugs before merging it. |
|
Normally, we change the default to false or revert the whole PR if the bugs are found during the RC (release candidate) stage. |
|
In that case, I hope my comments are addressed together before merging it in. They are non trivial and I don't usually go with them if there isn't special reason for it. |
|
Where does that leave both of these PRs? Do we still want this one with the code refactoring or to go back to the original? Are there any comments for this PR that would block merging? I've set the default to false in this PR. |
|
If the revert is very risky, we normally change the default from true to false when the bug is identified in the RC stage. |
@ajacques Do you want to rebase your PR off my last commit? Then I think it's up to @HyukjinKwon and @gatorsmile on how to proceed. |
|
@mallman, sounds good I'll get this PR updated with your latest changes as soon as I can. |
|
Test build #93925 has finished for PR 21889 at commit
|
|
Test build #93932 has finished for PR 21889 at commit
|
Hi @ajacques. I'm not sure if you're aware of this, but you can run the scalastyle checks locally with |
|
Test build #93952 has finished for PR 21889 at commit
|
|
These test failures are in Spark streaming. Is this just an intermittent test failure or actually caused by this PR? |
I was able to run the first failing test successfully. Can we get a retest, please? |
@ajacques I just rebased and pushed my branch off of master. Perhaps the easiest thing to do would be for you to rebase again off of my branch and do another force-push to your branch. |
|
Test build #94037 has finished for PR 21889 at commit
|
|
Test build #94073 has finished for PR 21889 at commit
|
|
Anybody else able to reproduce this failure? It succeeded on my developer machine. |
|
retest this please |
It worked for me, too. Let's see what a retest does. |
|
Test build #94101 has finished for PR 21889 at commit
|
|
Test build #94536 has finished for PR 21889 at commit
|
I spent some time yesterday digging deeper into why the hack I wrote worked, and I think I understand now. Practically speaking, my follow-on PR will be about the same as the commit I removed. However, I can support it with some explanatory comments instead of just "this throws an exception sometimes". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we really avoid AttributeReference(name, _, _, _) pattern per https://github.com/databricks/scala-style-guide#pattern-matching?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For instance, these , _, _, _), _, _, looks excessive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
private def buildPrunedDataSchema(
fileDataSchema: StructType,
requestedRootFields: Seq[RootField]) = {per https://github.com/databricks/scala-style-guide#spacing-and-indentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix all this. Seems like this is an opportunity to catch some of this with static analysis to avoid wasting the reviewer's time. Is Scalastyle not configured to catch this stuff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I know, no. I use IntelliJ (integration with scalastyle) to catch basic style nits but rather then that, I do it manually. It took me a while to get used to it before FWIW.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[[..]] syntax is not needed in inlined comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation looks weird here per https://github.com/databricks/scala-style-guide#documentation-style
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we have multiple relations with split schema for each? It wouldn't necessarily have one big deep nested schema which makes hard to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Array.empty[FullName]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a definition in a constructor, so I don't think we can do:
case class Contact(
[...]
friends: Array.empty[FullName],
relatives: Map.empty[String, FullName]
)Scala wants a colon, so I opted for:
case class Contact(
[...]
friends: Array[FullName] = Array.empty,
relatives: Map[String, FullName] = Map.empty
)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Map.empty[String, FullName]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: avoid infix notation per https://github.com/databricks/scala-style-guide#infix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would avoid unrelated refactoring and revert this back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we really unable to make some private functions and split some logics at least?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation
|
@ajacques I added a commit to enable schema pruning by default. It's a little more complete than your commit to do the same. Please rebase off my branch and remove your commit. |
|
ok to test |
|
Test build #94731 has finished for PR 21889 at commit
|
|
Test build #94785 has finished for PR 21889 at commit
|
|
retest this please |
|
Test build #94790 has finished for PR 21889 at commit
|
That would be my preference, yes, especially if it means less work for you. |
|
Test build #94805 has finished for PR 21889 at commit
|
|
Test build #4278 has finished for PR 21889 at commit
|
This PR doesn't only include stylistic changes. Since stylistic changes do not usually block a PR, mind fixing the PR description? |
|
Essentially, this PR was created to take the management of #21320 out of my hands, with a view towards facilitating its incorporation into Spark 2.4. It was my suggestion, one based in frustration. In hindsight, I no longer believe this strategy is the best—or most expedient—approach towards progress. Indeed, I believe the direction of this PR has become orthogonal to its motivating goal, becoming a dispute between myself and @HyukjinKwon rather than a means to move things along. I believe I can shepherd #21320 in a way that will promote greater progress. @ajacques, I mean no disrespect, and I thank you for volunteering your time, patience and effort for the sake of all that are interested in seeing this patch become a part of Spark. And I apologize for letting you down, letting everyone down. In my conduct leading up to the creation of this PR I did not act with the greatest maturity or patience. And I did not act in the best interests of the community. No one has spent more time or more effort, taken more responsibility or exhibited more patience with this 2+ year patch-set-in-the-making than myself. I respectfully submit it is mine to present and manage, and no one else's. Insofar as I have expressed otherwise in the past, I admit my error—one made in frustration—and recant in hindsight. @ajacques, at this point I respectfully assert that managing the patch set I submitted in #21320 is not your responsibility, nor is it anyone else's but mine. I ask you to close this PR so that we can resume the review in #21320. As I stated there, you are welcome to open a PR on https://github.com/VideoAmp/spark-public/tree/spark-4502-parquet_column_pruning-foundation to submit the changes you've made for review. Thank you. |
|
Thanks for the response all. @mailman If it's really your preference, I will create a PR against that branch and close this one. My intention was never to take away from your efforts, and I still consider my work here to be just minor stylistic tweaks on top of your work. I did this as service to help bridge the divide and hopefully alleviate frustrations. But this has been a bit frustrating being stuck between two sides of this PR/changing merge strategies often and don't wish to continue playing this role. As such, I will create a PR, but hope it does not dragged out to settle any differences in opinions between maintainers and submitters. My goal is to make sure this valuable feature gets merged so many can benefit. |
(Link to Jira: https://issues.apache.org/jira/browse/SPARK-4502)
This is a restart of #21320. Most of the discussion has taken place over there. I've only taken it as a based to make changes based on the code review to help move things along.
Due to the urgency of the upcoming 2.4 code freeze, I'm going to open this PR to collect any feedback. This can be closed if you prefer to continue to the work in the original PR.
What changes were proposed in this pull request?
One of the hallmarks of a column-oriented data storage format is the ability to read data from a subset of columns, efficiently skipping reads from other columns. Spark has long had support for pruning unneeded top-level schema fields from the scan of a parquet file. For example, consider a table, contacts, backed by parquet with the following Spark SQL schema:
Parquet stores this table's data in three physical columns: name.first, name.last and address. To answer the query
select address from contactsSpark will read only from the address column of parquet data. However, to answer the query
select name.first from contactsSpark will read name.first and name.last from parquet.
This PR modifies Spark SQL to support a finer-grain of schema pruning. With this patch, Spark reads only the name.first column to answer the previous query.
Implementation
There are two main components of this patch. First, there is a ParquetSchemaPruning optimizer rule for gathering the required schema fields of a PhysicalOperation over a parquet file, constructing a new schema based on those required fields and rewriting the plan in terms of that pruned schema. The pruned schema fields are pushed down to the parquet requested read schema. ParquetSchemaPruning uses a new ProjectionOverSchema extractor for rewriting a catalyst expression in terms of a pruned schema.
Second, the ParquetRowConverter has been patched to ensure the ordinals of the parquet columns read are correct for the pruned schema. ParquetReadSupport has been patched to address a compatibility mismatch between Spark's built in vectorized reader and the parquet-mr library's reader.
Limitation
Among the complex Spark SQL data types, this patch supports parquet column pruning of nested sequences of struct fields only.
How was this patch tested?
Care has been taken to ensure correctness and prevent regressions. A more advanced version of this patch incorporating optimizations for rewriting queries involving aggregations and joins has been running on a production Spark cluster at VideoAmp for several years. In that time, one bug was found and fixed early on, and we added a regression test for that bug.
We forward-ported this patch to Spark master in June 2016 and have been running this patch against Spark 2.x branches on ad-hoc clusters since then.
Known Issues
Highlighted in #21320 (comment)