-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce Partitioned
aggregation mode
#6892
Comments
cc @mingmwang who I think was considering something similar recently |
Working on this one |
Probably #6937 makes more sense to do than doing this statically, closing this issue |
@Dandandan @alamb I'm not sure for TPC-H queries whether this will help or not, but for some TPC-DS queries, this is quite useful. |
I think at least q17 has a high cardinality grouping that it might help for (in the subquery), for what it is worth |
Yes that is the idea
My hope is better for #6937 which I think might be similar to the "adaptive partial aggregation" of snowflake / teradata? |
Based on the experiment I think the only way to do this reliable enough during planning is to have cardinality statistics on the group by columns, so the cardinality of the aggregation can be estimated. |
I think it is for this reason that most people argue for doing the adaptation at runtime (not plan time) because the statistics always have edge cases (like data skew) that can result in poor plans without any degree of predictability |
Is your feature request related to a problem or challenge?
Currently, we often use two modes in aggregations:
Partial
+FinalPartitioned
.FinalPartitioned
requires the input to be hash-repartioned, so aRepartionExec
is added in between.In certain cases, like when only aggregating on a single column, it is faster to skip the
Partial
aggregation and directly perform the aggregation on hash-partitioned input, as doing thePartial
+RepartionExec
+FinalPartitioned
will be more work than doing the aggregation in one step (RepartionExec
+Partitioned
).Reasoning: RepartionExec (hash) is itself faster than
Partial
(although it doesn't reduce the output) and is necessary forFinalPartitioned
.Describe the solution you'd like
Partitioned
aggregation mode that requires input to be partitioned on the groupby-keys.SELECT COUNT(DISTINCT a) FROM T
)Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: