-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-11955][SQL] Mark optional fields in merging schema for safely pushdowning filters in Parquet #9940
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #46611 has finished for PR 9940 at commit
|
|
retest this please. |
|
Test build #46626 has finished for PR 9940 at commit
|
|
Test build #46656 has finished for PR 9940 at commit
|
|
ping @liancheng @yhuai |
|
I don't really get the idea of this one, what exactly does "oneSide" mean? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation of the map block is off, please help fixing it. Thanks.
|
@liancheng the naming might be a little confusing. It means that the fields only exist in one of the merging schemas. For example, when we want to merge two schema: {a: Int, b: String, c: Double} and {a: Int, c: Double}, then the b: String is only existing in one schema. In other words, this patch tried to mark the differences between merged schemas. |
|
Oh I see, so basically it indicates that a field doesn't exist in schemata of all part-files, and we try to skip filters that involve this kind of fields when doing filter push-down. Right? |
|
Yes, you are right. |
|
Test build #46848 has finished for PR 9940 at commit
|
|
Is it a regression? |
|
Hi @yhuai Can you explain more? What you meant about regression? |
|
oh, I thought it's a bug fix (so, I was wondering if it's a regression from 1.5 or not). But, it is actually an improvement? |
|
@yhuai I think so. We completely skip pushdown filters now if schema merging is enabled. This patch is to improve that. |
|
@yhuai Here is the background: Parquet filter push-down is disabled when schema merging is turned on because of PARQUET-389. I'm somewhat hesitant to have this change. On one hand, I do want to have filter push-down in case of schema merging since it's generally a very useful optimization. On the other hand, this change is a little bit hacky, since schema metadata is not intended to be used in this way. Anyway, at least, let's have two more updates for this PR:
|
|
Renamed I will update this for the second point later. |
|
Test build #46881 has finished for PR 9940 at commit
|
|
Test build #47229 has finished for PR 9940 at commit
|
|
Test build #47241 has finished for PR 9940 at commit
|
|
@viirya Thanks for your efforts! Would you mind me revisiting this after 1.6 release? I would like see whether we can have PARQUET-389 fixed in Parquet community ASAP, so that we may not need to work around it in Spark 1.7/2.0. |
|
@liancheng Sure. Thank you! |
|
@liancheng Can we revisit this now? Or we want to wait a bit longer? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we resort to equalsIgnoreCompatibleNullability because extra metadata in merged? Can we also add assertion for added metadata instead of working around it using equalsIgnoreCompatibleNullability?
|
Overall LGTM except for several minor issues. Another thing is that, we probably want to use a more special name (something like |
|
Test build #50190 has finished for PR 9940 at commit
|
|
ping @liancheng Please see if latest updates are proper for you. Thanks. |
|
Thanks! I'm going to merging this to master. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we add clear() to MetadataBuilder, this can be lifted above the fields.map. Inside the map operation we just clear the MetadataBuilder.
What do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is mostly a workaround for a parquet-mr bug (PARQUET-389), and I'd assume that it will be fixed in the near future. Then we can remove this workaround. So it doesn't seem to be worth modifying MetadataBuilder, which is part of public API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This workaround may be taken out in the future.
However, use of MetadataBuilder occurs in many other places:
http://pastebin.com/nVjNfrgp
I feel adding clear() to MetadataBuilder would help in current and future use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately unless we have a timeline to actually fix the Parquet bug, I don't think we can expect the workaround will be removed in the near future. It's been almost half a year since the original patch is in, and the patch is still necessary.
|
Hey guys -- this patch is severely under documented. It also isn't great to introduce something in the metadata to tag a column as optional. |
|
Also do you actually see a real use case for this issue, or are you just thinking it might be useful to optimize this? |
| protected[sql] def fromAttributes(attributes: Seq[Attribute]): StructType = | ||
| StructType(attributes.map(a => StructField(a.name, a.dataType, a.nullable, a.metadata))) | ||
|
|
||
| def removeMetadata(key: String, dt: DataType): DataType = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we adding a public API for a patch fix?
|
FYI #14074 I'm actually thinking we should perhaps revert this patch, but I don't want to do it when we are so close to rc. |
|
This is introduced for real use case and it is very useful. Actually this patch is developed to fix the problem when I was in previous company which is a heavy Spark user for machine learning and ETL. In practice, schema merging is common because of schema expansion. Under such use cases, they can't have the advantages of predicate pushdown. Instead of reverting this, I would suggest we can fix this issues (documenting and hide the API you just did). It surely causes performance regression in such cases without this patch. |
|
OK thanks. |
JIRA: https://issues.apache.org/jira/browse/SPARK-11955
Currently we simply skip pushdowning filters in parquet if we enable schema merging.
However, we can actually mark particular fields in merging schema for safely pushdowning filters in parquet.