Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Change Accumulator trait to accept RecordBatch / num_rows to allow faster Count #8067

Open
Dandandan opened this issue Nov 6, 2023 · 4 comments
Labels
api change Changes the API exposed to users of the crate datafusion Changes in the datafusion crate enhancement New feature or request performance Make DataFusion faster

Comments

@Dandandan
Copy link
Contributor

Dandandan commented Nov 6, 2023

Is your feature request related to a problem or challenge?

Currently the CountAccumulator implementation requires values: &[ArrayRef] to be passed.

In order to eliminate scanning a (first) column, we need to be able to accept a RecordBatch or num_rows instead of values: &[ArrayRef].

Describe the solution you'd like

Rather than changing every method to accept a RecordBatch (and needing to update the code), I propose adding two new methods:

update_record_batch(&mut self, recordbatch: &RecordBatch)
retract_record_batch(&mut self, recordbatch: &RecordBatch)

The default implementation of the methods can use update_batch and update_record_batch (i.e. assume having at least one column).

In the aggregation code, we call update_record_batch/retract_record_batch instead.

Describe alternatives you've considered

No response

Additional context

No response

@Dandandan Dandandan added enhancement New feature or request datafusion Changes in the datafusion crate api change Changes the API exposed to users of the crate performance Make DataFusion faster labels Nov 6, 2023
@Dandandan Dandandan changed the title Proposal: Change Accumulator trait to accept RecordBatch / num_rows Proposal: Change Accumulator trait to accept RecordBatch / num_rows to allow faster Count Nov 6, 2023
@2010YOUY01
Copy link
Contributor

Was that because this counting operation is possible to be done during scanning?

Looks like it's a case of aggregate pushdown. For min()/max()/count() aggregate functions on Parquet, it's possible to get the result on whole column only use metadata, without full scan.

To do that i think update_record_batch() is needed, possibly also allow RecordBatch to carry more flexible payloads

@alamb
Copy link
Contributor

alamb commented Nov 7, 2023

I think using a RecordBatch rather than &[ArrayRef] makes sense to me

If we are going to change the API anyways, I recommend considering changing the signature to ColumnarValue so it can handle either a RecordBatch or a ScalarValue

@alamb
Copy link
Contributor

alamb commented Nov 7, 2023

The other thing maybe we can think about while messing with the Accumulator trait is how we might expose GroupsAccumulator as well 🤔

@Dandandan
Copy link
Contributor Author

Dandandan commented Nov 8, 2023

I looked a bit more into this, it looks currently we're getting away mostly by converting 1 scalars as "count expression" (count(Int64(1)) to an array with to_array_of_size.
This is a bit wasteful, but also not extremely bad (as long as the size is not enormous).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate datafusion Changes in the datafusion crate enhancement New feature or request performance Make DataFusion faster
Projects
None yet
Development

No branches or pull requests

3 participants