-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-27707][SQL] Prune unnecessary nested fields from Generate #24637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #105516 has finished for PR 24637 at commit
|
|
@viirya - looks great! exactly what I had in mind but wasn't sure how to implement it. |
|
Test build #105522 has finished for PR 24637 at commit
|
|
retest this please. |
|
Test build #105526 has finished for PR 24637 at commit
|
|
cc @dbtsai |
| case Project(projectList, child) | ||
| if SQLConf.get.nestedSchemaPruningEnabled && canProjectPushThrough(child) => | ||
| getAliasSubMap(projectList) | ||
| case Project(projectList, child) => getAliasSubMap(projectList) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya . Sorry, but this is a regression on all the existing code. We should avoid getAliasSubMap invocation. https://github.com/apache/spark/pull/24637/files#diff-a636a87d8843eeccca90140be91d4fafR635 doesn't prevent getAliasSubMap invocation inside unapply, does it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. If so, I need to make a little change to prevent it. Will change it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/MiscBenchmark.scala
Show resolved
Hide resolved
|
Test build #105564 has finished for PR 24637 at commit
|
|
Test build #105568 has finished for PR 24637 at commit
|
| p.copy(child = g.copy(child = newChild, unrequiredChildIndex = unrequiredIndices)) | ||
|
|
||
| // prune unrequired nested fields | ||
| case p @ Project(projectList, g: Generate) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need to special case Generate? I see there is a general case for nested column pruning below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is more general case you are looking for:
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
Lines 623 to 624 in 163a6e2
| case p @ NestedColumnAliasing(nestedFieldToAlias, attrToAliases) => | |
| NestedColumnAliasing.replaceToAliases(p, nestedFieldToAlias, attrToAliases) |
In the general case, it's doing pruning only when the flag (nestedSchemaPruningEnabled) is enabled. The case considers some operators that nested project can be pushed through. Generate isn't one of them. So the general case doesn't work on this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generate is special for another reason. We can't prune an output from its child even just a nested field of the output is used in the top project list. The generator could use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we add if SQLConf.get.nestedSchemaPruningEnabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nestedSchemaPruningEnabled is for pruning nested fields from logical relation. But this fix isn't due to the same cause. For the data sources can't be pruned nested fields, it is also useful to apply this fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, why don't we need another configuration then if spark.sql.optimizer.nestedSchemaPruning.enabled isn't part of this optimization, or fixing the doc of spark.sql.optimizer.nestedSchemaPruning.enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks a general fix no matter the setting of nestedSchemaPruning is. Do we need a config to disable this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My impression was that we need a configuration but I think you or @dongjoon-hyun have more context then me about nested pruning stuff. @cloud-fan, @dongjoon-hyun, @gatorsmile, can you make a call here if we need a config or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nestedSchemaPruning is vague as it can point to nested pruning in scan, or general nested pruning in other operators. Currently the config affects scan nested pruning. I feel it is not tightly related to the fix here, because the fix isn't for scan.
If we are considering a config here, I think a different config or fix the doc of nestedSchemaPruning to explicitly indicate it also helps general nested pruning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me fix the doc of nestedSchemaPruning and apply this pruning when nestedSchemaPruning is enabled.
|
Test build #105650 has finished for PR 24637 at commit
|
|
Test build #106494 has finished for PR 24637 at commit
|
|
Retest this please. |
|
Test build #106546 has finished for PR 24637 at commit
|
|
Retest this please. |
|
Test build #106651 has finished for PR 24637 at commit
|
...alyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala
Show resolved
Hide resolved
| range/limit/sum: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative | ||
| ------------------------------------------------------------------------------------------------------------------------ | ||
| range/limit/sum wholestage off 191 205 19 2738.4 0.4 1.0X | ||
| range/limit/sum wholestage on 112 124 13 4699.4 0.2 1.7X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ur, this is irrelevant, but the ratio looks weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be improved by some reason. This is consistently better in @viirya and my tests.
[info] OpenJDK 64-Bit Server VM 1.8.0_212-b04 on Linux 3.10.0-862.3.2.el7.x86_64
[info] Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
[info] range/limit/sum: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] range/limit/sum wholestage off 222 226 6 2359.5 0.4 1.0X
[info] range/limit/sum wholestage on 114 121 8 4608.4 0.2 2.0X
|
Thanks @dongjoon-hyun for the advice! I will add more few test cases targeting other Generator. |
|
Thank you so much, @viirya ! |
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ColumnPruningSuite.scala
Show resolved
Hide resolved
|
Thank you for adding a new test. Are you going to add more tests since One another approach is simply reducing the scope to the original goal. We can match only- Later, to cover more patterns, I think we need Since this PR is here for a long time, how about finishing here with |
|
Test build #107788 has finished for PR 24637 at commit
|
|
@dongjoon-hyun I added more generators. I think existing generators should be found in the test. |
|
Test build #107827 has finished for PR 24637 at commit
|
|
retest this please |
|
Test build #107834 has finished for PR 24637 at commit
|
|
Thank you for adding more. Then, please do a white list approach for those four expressions. cc @cloud-fan and @gatorsmile |
|
I'm fine to add a white-list, thro I think this approach is not generator-specific. It is more conservative and safer, anyway. |
|
Test build #107870 has finished for PR 24637 at commit
|
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
Thank you so much for working on this.
Merged to master.
|
We hit an exception caused by this rule. The plan becomes invalid after optimization We generate @viirya Can you help to take a look? thanks! |
|
@cloud-fan Yea, will look at it tomorrow. Do you have test case? If no, I may try to reproduce it. |
|
It's a very long query and I'm trying to minimize. Let's see if there is some clue about the query plan. |
|
Thank you for reporting, @cloud-fan ! |
|
Could you file a JIRA for that, @cloud-fan ? |
|
Is |
|
Re-checked the current rule. Still cannot find clue from it and and above query plan. At first glance, I suspect if there are nested column access at the top Project but not in Generate. I tested it and re-checked the rule, looks it is good. @cloud-fan May you have more clues? |
What changes were proposed in this pull request?
Performance issue using explode was found when a complex field contains huge array is to get duplicated as the number of exploded array elements. Given example:
The explode causes
stto be duplicated as many as the exploded elements.Benchmarks it:
The query plan:
This patch takes nested column pruning approach to prune unnecessary nested fields. It adds a projection of the needed nested fields as aliases on the child of
Generate, and substitutes them by alias attributes on the projection on top ofGenerate.Benchmarks it after the change:
The query plan:
This behavior is controlled by a SQL config
spark.sql.optimizer.expression.nestedPruning.enabled.How was this patch tested?
Added benchmark.