-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support no distinct aggregate sum/min/max in single_distinct_to_group_by
rule
#8266
Conversation
Thank you @haohuaijin -- I plan to review this PR tomorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @haohuaijin -- this looks really neat.
I had some small improvement suggestions that are not required as well as a a few additional tests to add
I am also running a full ClickBench benchmark run to see how this branch compares. I'll post the results to this PR
Please let me know if you are willing to make changes to this PR or if I should merge it in as is.
Thanks again, this is really neat
for e in args { | ||
fields_set.insert(e.canonical_name()); | ||
} | ||
} else if !matches!(fun, Sum | Min | Max) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add a comment here explaining why this is checking for only Sum
, Min
and Max
I think it is because these functions have the property that they can be used to combine themselves -- so like SUM(SUM(x))
is the same as SUM(x)
I think this transformation could be made more general by using a different combination. For example we could support COUNT
by using SUM(COUNT(x))
SELECT a, COUNT(DINSTINCT b), COUNT(c)
FROM t
GROUP BY a
SELECT a, COUNT(alias1), SUM(alias2) -- <-- This is SUM, not COUNT
FROM (
SELECT a, b as alias1, COUNT(c) as alias2
FROM t
GROUP BY a, b
)
GROUP BY a
We can do this as a follow on PR ( I can file a ticket)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also do crazier stuff for AVG
like
SELECT a, COUNT(DINSTINCT b), AVG(c)
FROM t
GROUP BY a
SELECT a, COUNT(alias1), SUM(alias2) / SUM(alias3) -- <-- This is combining partial sum / counts to compute AVG
FROM (
SELECT a, b as alias1, SUM(c) as alias2, COUNT(c) as alias3,
FROM t
GROUP BY a, b
)
GROUP BY a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rewrite Count
requires us to implement Sum0
. you can refer to the discussion in this link for details.
vec![ | ||
sum(col("c")), | ||
count_distinct(col("b")), | ||
Expr::AggregateFunction(expr::AggregateFunction::new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this could use https://docs.rs/datafusion/latest/datafusion/logical_expr/expr_fn/fn.max.html ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because we only have count_distinct
interface. Should we also add sum_distinct
, max_distinct
(maybe no sense, because the result of max_distinct
is equal to max
) and more?
Thanks for review @alamb , those suggestions are very helpful. I will apply those suggestions tonight or tomorrow. |
Thanks again @haohuaijin |
Which issue does this PR close?
Closes #8123
Rationale for this change
Generate test data
Compare
in this pr(release mode)
in main 393e48f (release mode)
What changes are included in this PR?
add no-distinct sum/min/max aggregate support in single_distinct_to_group_by rule
Are these changes tested?
yes, add some tests
Are there any user-facing changes?