-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Improve performance of last_value by implementing special GroupsAccumulator
#15542
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @UBarney I think this is a right direction, I would suggest though to split the PR into smaller. The fix itself is important however there is bunch of renames/code moves, etc. It would be nice to start with a PR with just a fix a performance benefits description?
| } | ||
|
|
||
| struct FirstPrimitiveGroupsAccumulator<T> | ||
| fn create_group_acc( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| fn create_group_acc( | |
| fn create_group_accumulator( |
| fn create_group_acc( | ||
| args: AccumulatorArgs, | ||
| pick_first_in_group: bool, | ||
| ) -> std::result::Result<Box<dyn GroupsAccumulator>, DataFusionError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can use DFResult instead of Result with DataFusionError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function has been deleted
|
@comphead Thanks for reviewing. I have split this PR. This PR only contains performance improvements. After this PR is merged, I will start a refactor PR to handle renames and code moves |
| } | ||
| } | ||
|
|
||
| // TODO: rename to PrimitiveGroupsAccumulator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to include this kind of improvement in the same PR to avoid confusion. Only changes that are highly independent should be considered for splitting into smaller PRs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @UBarney and @jayzhan211
| 03)----StreamingTableExec: partition_sizes=1, projection=[a, b, c], infinite_source=true, output_ordering=[a@0 ASC NULLS LAST, b@1 ASC NULLS LAST, c@2 ASC NULLS LAST] | ||
|
|
||
| query III | ||
| SELECT a, b, LAST_VALUE(c) as last_c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behavior of this query changed, and the query was updated to produce the original results. This seems like it was unintended. I filed #15676
…cumulator` (apache#15542) * Improve performance of `last_value` by implementing special `GroupsAccumulator` * less diff
Which issue does this PR close?
firstandlast#13998.Rationale for this change
Achieved significant performance improvement when cardinality is high.
select id2, id4, last_value(v1 order by id2, id4) as r2 from '~/h2o_100m.parquet' group by id2, id4;select l_shipmode, last_value(l_partkey order by l_orderkey, l_linenumber, l_comment, l_suppkey, l_tax) from 'benchmarks/data/tpch_sf10/lineitem' group by l_shipmode;What changes are included in this PR?
pick_first_in_group: booltoPrimitiveGroupsAccumulator. If ture take first element in an aggregation group according to the requested ordering, otherwisetake last elementAdditional context
#15266
Are these changes tested?
Yes
Are there any user-facing changes?
No