Skip to content

Conversation

@UBarney
Copy link
Contributor

@UBarney UBarney commented Apr 2, 2025

Which issue does this PR close?

Rationale for this change

Achieved significant performance improvement when cardinality is high.

benchmark sql main thisPR
select id2, id4, last_value(v1 order by id2, id4) as r2 from '~/h2o_100m.parquet' group by id2, id4; 36.546s 7.276s
select l_shipmode, last_value(l_partkey order by l_orderkey, l_linenumber, l_comment, l_suppkey, l_tax) from 'benchmarks/data/tpch_sf10/lineitem' group by l_shipmode; 0.962s 0.801s

What changes are included in this PR?

  • Add fields pick_first_in_group: bool to PrimitiveGroupsAccumulator. If ture take first element in an aggregation group according to the requested ordering, otherwisetake last element

Additional context

#15266

Are these changes tested?

Yes

Are there any user-facing changes?

No

@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Apr 2, 2025
@UBarney UBarney marked this pull request as ready for review April 3, 2025 01:34
Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @UBarney I think this is a right direction, I would suggest though to split the PR into smaller. The fix itself is important however there is bunch of renames/code moves, etc. It would be nice to start with a PR with just a fix a performance benefits description?

}

struct FirstPrimitiveGroupsAccumulator<T>
fn create_group_acc(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn create_group_acc(
fn create_group_accumulator(

fn create_group_acc(
args: AccumulatorArgs,
pick_first_in_group: bool,
) -> std::result::Result<Box<dyn GroupsAccumulator>, DataFusionError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can use DFResult instead of Result with DataFusionError

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function has been deleted

@UBarney UBarney marked this pull request as draft April 5, 2025 14:14
@UBarney UBarney marked this pull request as ready for review April 6, 2025 03:56
@UBarney
Copy link
Contributor Author

UBarney commented Apr 6, 2025

@comphead Thanks for reviewing. I have split this PR. This PR only contains performance improvements. After this PR is merged, I will start a refactor PR to handle renames and code moves

@alamb alamb added the performance Make DataFusion faster label Apr 7, 2025
@alamb alamb mentioned this pull request Apr 7, 2025
12 tasks
}
}

// TODO: rename to PrimitiveGroupsAccumulator
Copy link
Contributor

@jayzhan211 jayzhan211 Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to include this kind of improvement in the same PR to avoid confusion. Only changes that are highly independent should be considered for splitting into smaller PRs

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @UBarney and @jayzhan211

@comphead comphead merged commit b22e4d2 into apache:main Apr 10, 2025
27 checks passed
03)----StreamingTableExec: partition_sizes=1, projection=[a, b, c], infinite_source=true, output_ordering=[a@0 ASC NULLS LAST, b@1 ASC NULLS LAST, c@2 ASC NULLS LAST]

query III
SELECT a, b, LAST_VALUE(c) as last_c
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior of this query changed, and the query was updated to produce the original results. This seems like it was unintended. I filed #15676

nirnayroy pushed a commit to nirnayroy/datafusion that referenced this pull request May 2, 2025
…cumulator` (apache#15542)

* Improve performance of `last_value` by implementing special `GroupsAccumulator`

* less diff
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate functions Changes to functions implementation performance Make DataFusion faster sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support fast group accumulator for first and last

5 participants