Proposal: Change `Accumulator` trait to accept `RecordBatch` / `num_rows` to allow faster `Count` #8067

Dandandan · 2023-11-06T17:23:40Z

Is your feature request related to a problem or challenge?

Currently the CountAccumulator implementation requires values: &[ArrayRef] to be passed.

In order to eliminate scanning a (first) column, we need to be able to accept a RecordBatch or num_rows instead of values: &[ArrayRef].

Describe the solution you'd like

Rather than changing every method to accept a RecordBatch (and needing to update the code), I propose adding two new methods:

update_record_batch(&mut self, recordbatch: &RecordBatch)
retract_record_batch(&mut self, recordbatch: &RecordBatch)

The default implementation of the methods can use update_batch and update_record_batch (i.e. assume having at least one column).

In the aggregation code, we call update_record_batch/retract_record_batch instead.

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

2010YOUY01 · 2023-11-07T05:58:56Z

Was that because this counting operation is possible to be done during scanning?

Looks like it's a case of aggregate pushdown. For min()/max()/count() aggregate functions on Parquet, it's possible to get the result on whole column only use metadata, without full scan.

To do that i think update_record_batch() is needed, possibly also allow RecordBatch to carry more flexible payloads

alamb · 2023-11-07T13:45:39Z

I think using a RecordBatch rather than &[ArrayRef] makes sense to me

If we are going to change the API anyways, I recommend considering changing the signature to ColumnarValue so it can handle either a RecordBatch or a ScalarValue

alamb · 2023-11-07T14:07:37Z

The other thing maybe we can think about while messing with the Accumulator trait is how we might expose GroupsAccumulator as well 🤔

Dandandan · 2023-11-08T09:10:08Z

I looked a bit more into this, it looks currently we're getting away mostly by converting 1 scalars as "count expression" (count(Int64(1)) to an array with to_array_of_size.
This is a bit wasteful, but also not extremely bad (as long as the size is not enormous).

Dandandan added enhancement New feature or request datafusion Changes in the datafusion crate api change Changes the API exposed to users of the crate performance Make DataFusion faster labels Nov 6, 2023

Dandandan changed the title ~~Proposal: Change Accumulator trait to accept RecordBatch / num_rows~~ Proposal: Change Accumulator trait to accept RecordBatch / num_rows to allow faster Count Nov 6, 2023

andygrove mentioned this issue Jun 13, 2024

[EPIC] Improving Performance apache/datafusion-comet#566

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Change `Accumulator` trait to accept `RecordBatch` / `num_rows` to allow faster `Count` #8067

Proposal: Change `Accumulator` trait to accept `RecordBatch` / `num_rows` to allow faster `Count` #8067

Dandandan commented Nov 6, 2023 •

edited

Loading

2010YOUY01 commented Nov 7, 2023

alamb commented Nov 7, 2023

alamb commented Nov 7, 2023

Dandandan commented Nov 8, 2023 •

edited

Loading

Proposal: Change Accumulator trait to accept RecordBatch / num_rows to allow faster Count #8067

Proposal: Change Accumulator trait to accept RecordBatch / num_rows to allow faster Count #8067

Comments

Dandandan commented Nov 6, 2023 • edited Loading

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

2010YOUY01 commented Nov 7, 2023

alamb commented Nov 7, 2023

alamb commented Nov 7, 2023

Dandandan commented Nov 8, 2023 • edited Loading

Proposal: Change `Accumulator` trait to accept `RecordBatch` / `num_rows` to allow faster `Count` #8067

Proposal: Change `Accumulator` trait to accept `RecordBatch` / `num_rows` to allow faster `Count` #8067

Dandandan commented Nov 6, 2023 •

edited

Loading

Dandandan commented Nov 8, 2023 •

edited

Loading