-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate GroupByHash output in multiple RecordBatch
es rather than one large one
#9562
Comments
take |
This issue is interesting, let me try implement it |
FYI this issue may be tricky -- as it will be performance critical -- I will be happy to assist |
this issue is a bit tricky for me 😢 , I can only think of the following approaches: |
I think this approach sounds good -- nice proposal One thing that could help keep the PRs small and manageable would be to switch the APIs as described above but you could avoid having to change all the Then we can make subsequent PRs to switch over other groups accumulators as needed |
Actually, I found the The alternative may be the way mentioned in this issue. |
Hi @guojidan, seems there is no update for some months, can I pick this up? |
Thanks. You can take this. I am working on another project now. |
Is your feature request related to a problem or challenge?
AggregateExec
generates one single (giant)RecordBatch
on output (source)RecordBatch::slice()
, which does not actually allocate any additional memory) (source)This has at least two potential downsides:
RecordBatch
s, those slices are treated as though they were an additional allocation that needs to be tracked (source)Something like this in pictures:
Describe the solution you'd like
If we had infinite time / engineering hours I think a better approach would actually be to change GroupByHash so it didn't create a single giant contiguous
RecordBatch
Instead it would be better if GroupByHash produced a
Vec<RecordBatch>
and then incrementally fed those batches outDoing this would allow the GroupByHash to release memory incrementally as it output. This is analogous to how @korowa made join output incremental in #8658
Perhaps something like
Describe alternatives you've considered
No response
Additional context
@yjshen notes:
The text was updated successfully, but these errors were encountered: