-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make common expression alias human-readable #10333
Conversation
I don't think removing the accumulation and a couple of clones will increase perf, but let's see /benchmark |
Benchmark resultsBenchmarks comparing b412dba (main) and dc5cf09 (PR)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @MohamedAbdeen21 I went through this code carefully and it seems really nice to me -- not only is it more efficient (less copying of Identifier
) the code is also simpler.
I think it is good to go from my perspective given all the existing tests pass 🚀
cc @wiedld as I think you have worked on this code recently and @waynexia and @peter-toth who I think is working to improve this code in terms of copies
Thanks for the review @alamb. I was trying to implement it duckdb style with The biggest problem I found was that the rule is applied twice, meaning that we need a shared/global mapping of expr -> id between rule calls. Also, the ordering of the created projections should be incremental ( |
can we please wait with merging this PR until we sort out the issues of the rule in #10396 |
Per the discussion on #10396, we merged that one first and now this one needs to be rebased / resolved. Marking it as draft until we can do that I believe @MohamedAbdeen21 plans to do it this weekend (per #10396 (comment)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense (again) to me -- thank you @MohamedAbdeen21 and @peter-toth
let mut desc = String::new(); | ||
|
||
while let Some(item) = self.visit_stack.pop() { | ||
fn pop_enter_mark(&mut self) -> Option<usize> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shoudn't change this part.
The logic that builds up an identifier using visit_stack
/ 3 kinds of VisitRecord
is neccessary and actually a very clever and way to build up an identifier from the current node and sub-identifiers. (An identifier to be a String
was not that a clever decision and will be fixed in #10426, but that's a different issue).
This PR shouldn't change what an identifier is / how it is built up otherwise we end up with identifier colliding bugs again. The IdArray
, ExprStats
and CommonExprs
datastructures require an identifier to represent a full expression subtreee. This means that:
fn expr_identifier(expr: &Expr) -> Identifier {
format!("#{{{expr}}}")
}
would cause bugs as shown in 1. of #10396.
I.e. if we encountered both col("a") + col("b")
and col("a + b")
in the expression list to be CSEd and we used "{expr}"
(the non-unique stringified representation) as identifiers then the equal identifier ("a + b"
) of those 2 different expressions would collide and we counted 2 for the occurance of one of the 2 expressions (and the other expression's count would be lost) resulting wrong CSE.
Please note that currently the identifier of col("a") + col("b")
is "{a + b|b|a}"
so it doesn't collide with col("a + b")
's identifier: "{a + b}"
.
Again, this is hard to test now because of the resolution bug: #10413.
I.e. if we write a test where we have
select a + b, "a + b" from (
select 1 as a, 2 as b, 1 as "a + b"
)
then currently it gets resolved as
select "a + b", "a + b" from (
select 1 as a, 2 as b, 1 as "a + b"
)
and this prevents us to create a test case for CSE identifier collision.
(Please note that I'm simplifying the identifier collision exmple as simple columns (col("a + b")
) are not subject to CSE.)
What this PR can do is to change the aliases (use something else than identifiers) to make the plans more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@peter-toth -- thank you -- I see now the danger of the lack of test coverage in this area.
I am still a little unclear on if the bug you described above is something that can actually be hit in practice (due to lack of a test case), or if it will be masked by #10413
What shall we do now? I can see three possibilities:
- Revert this PR and try and make a follow on one that doesn't change it
- Create a PR to revert just ID change
- You can fix the ID change in your fix for Make
CommonSubexprEliminate
faster by stop copying so many strings #10426
Please let me know your thoughts. I can potentially help with 1 or 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still a little unclear on if the bug you described above is something that can actually be hit in practice (due to lack of a test case), or if it will be masked by #10413
That's a good question. I think once #10413 gets resolved we can hit the identifier collision issue due to this PR. But actually, let's try to add an identifier collision test case after #10413.
I would suggest 1., revert the commit and a new version of the PR where only the aliases are made simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revert PR is merged: #10436
And I am pretty stoked that @jonahgao is looking at #10413 (comment)
@@ -885,15 +876,15 @@ mod test { | |||
)?; | |||
|
|||
let expected = vec![ | |||
(8, "{(SUM(a + Int32(1)) - AVG(c)) * Int32(2)|{Int32(2)}|{SUM(a + Int32(1)) - AVG(c)|{AVG(c)|{c}}|{SUM(a + Int32(1))|{a + Int32(1)|{Int32(1)}|{a}}}}}"), | |||
(6, "{SUM(a + Int32(1)) - AVG(c)|{AVG(c)|{c}}|{SUM(a + Int32(1))|{a + Int32(1)|{Int32(1)}|{a}}}}"), | |||
(8, "#{(SUM(a + Int32(1)) - AVG(c)) * Int32(2)}"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IdArray
shouldn't change in this PR. Actually, new fields can be added to the tuples but the identifier field should remain as it is.
At this point I think we'd better revert this, close it and revisit it after #10413. @alamb @peter-toth |
Sounds like the consensus is to revert this PR -- could you possible make a revert PR @MohamedAbdeen21 ? |
This reverts commit 2a15614.
This reverts commit 2a15614.
Revert PR: #10436 |
* use only current expr symbol + remove visit stack * Fix tpch tests/benchmarks * update docs * removing some clones * fix clippy * update docs * rebase * restore docs * fix clippy
…10436) * Revert "make common expression alias human-readable (apache#10333)" This reverts commit 2a15614. * keep some ok parts
Which issue does this PR close?
Closes #10280.
Thanks @JasonLi-cn for pointing out the relevant code snippet and the detailed description, saved me some time ❤️
Rationale for this change
Making the plans human-readable when common subexpression elimination is applied (check example in ticket and edited tests in this PR).
What changes are included in this PR?
Changing
alias_symbol
fromcurr_expr_identifier + sub_expr_identifier
to#{curr_expr_identifier}
, making it more readable.Also, since we no longer need the sub_expr for the alias, we can remove the entire visit stack + visit record enum.
Are these changes tested?
Changed existing tests to reflect the change.
Are there any user-facing changes?
Readable plans when CSE is applied