-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EPIC] Improved Externalized / Spilling / Large than Memory Hash Aggregation #13123
Comments
@2010YOUY01 says in #13090 (comment)
DF doesn't have a buffer pool in the traditional sense, and the way arrow-rs allocates memory directly from the system allocator makes it quite hard to implement. However, I think the fact that we have arrow-rs and the arrow IPC offers lots of opportunity.
I think starting with stability and then optimizing is a great idea 💯 Note that one challenge of TPCH specifically is that it contains many joins and is largely focused on that, so in order to run TPCH-SF1000 we would also need to implement spilling joins Another potential option would be to work on running clickbench with a very small memory (100MB)? Or maybe we could figure out another large dataset 🤔 |
Maybe @comphead 's work to get SMJ working in #13111 will help this (e.g. we could always use SMJ for the large TPCH queries 🤔 ) |
This is a good idea, we should get clickbench work under memory constraints before TPCH |
Here is a PR to optimize the spill format: |
This is a collection of items to improve external (spilling) aggregation
Background
in the Solid State Age (DuckDB external aggregation paper))
DataFusion has supported memory limited / spilling hash aggregation since @kazuyukitanimura added it last year in #7400.
We can likely improve this feature and @2010YOUY01 is considering working on it
Tasks the solution you'd like
The text was updated successfully, but these errors were encountered: