-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-45760][SQL] Add With expression to avoid duplicating expressions #43623
Conversation
} | ||
|
||
var exprsToAdd = commonExprs.toSeq | ||
val newChildren = p.children.map { child => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't be?
val newChildren = p.children.map { child => | |
val newChildren = newPlan.children.map { child => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's the same, the new plan was produced by transformExpressionsDown
so the children won't change.
} | ||
|
||
if (exprsToAdd.nonEmpty) { | ||
// If we cannot rewrite the common expressions, force to inline them so that the query |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we cannot rewrite them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. It is like an explicit/manual CSE? I wonder if we want to expose this as a expression function?
@viirya yes we can add SQL syntax in the future, following https://github.com/google/zetasql/blob/a745bef47b315bb11fecab4eeefa2bcc41be5951/docs/operators.md?plain=1#L2865 |
Why not add an optimizer rule to find the common expressions and insert a |
We need this |
BTW, I feel it's useful to have a way to do explicit/manual CSE, instead of relying on optimizer features or codegen features. |
object Helper extends RuleExecutor[LogicalPlan] { | ||
val batches = | ||
Batch("Finish Analysis", Once, ReplaceExpressions) :: | ||
Batch("Rewrite With expression", FixedPoint(10), RewriteWithExpression) :: Nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix the Scala style.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Seems there are still test failures?
Looks good to me, but I think we can make the new rule idempotent with a small refactor: cloud-fan#19 |
What's the longer term goal with this, especially in the context of all the attempts to add CSE to conditional expressions? Is the idea to make a Just out of curiosity, since this is basically manual CSE at the optimizer stage, would it make sense to do the existing fully recursive, automatic CSE at the optimizer stage instead of the physical stage to achieve a similar affect? |
I think the final state should be implementing CSE at the logical plan level, so that it works for both codegen backend and native (vectorized) backend. But we still have gaps now. The |
* make RewriteWithExpression idempotent * restore def index usage in alias, minor change to shorten code
The doc generation issue is unrelated to my PR
I think we need to upgrade pandas version on GA machines. cc @HyukjinKwon @LuciferYang |
I'm merging it to master, thanks for the reviews! |
Already upgrade: #43689 :) |
### What changes were proposed in this pull request? This is a followup of #43623 to fix a regression. For `With` inside conditional branches, they may not be evaluated at all and we should not pull out the common expressions into a `Project`, but just inline. ### Why are the changes needed? avoid perf regression ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? No Closes #43978 from cloud-fan/with. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? This is a followup of apache/spark#43623 to fix a regression. For `With` inside conditional branches, they may not be evaluated at all and we should not pull out the common expressions into a `Project`, but just inline. ### Why are the changes needed? avoid perf regression ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? No Closes #43978 from cloud-fan/with. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
Sometimes we need to duplicate expressions when rewriting the plan. It's OK for small query, as codegen has common-subexpression-elimination (CSE) to avoid evaluating the same expression. However, when the query is big, duplicating expressions can lead to a very big expression tree and make catalyst rules very slow, or even OOM when updating a leaf node (need to copy all tree nodes).
This PR introduces a new expression to do expression-level CTE: it adds a Project to pre-evaluate the common expressions, so that they appear only once on the query plan tree, and are evaluated only once.
NullIf
now uses this new expression to avoid duplicating theleft
child expression.Why are the changes needed?
make catalyst more efficient.
Does this PR introduce any user-facing change?
No
How was this patch tested?
new test suite
Was this patch authored or co-authored using generative AI tooling?
No