-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Further refine the Top K sort operator #9417
Comments
TL;DR: The issue is caused by "double" memory accounting for sliced batches in AggExec and TopkExec. The primary cause of resource exhaustion is incorrect memory accounting for record batches stored in TopK's RecordBatchStore, as highlighted in the issue description (approximately 220MB per batch). Upon inspecting the memory size calculation output:
It becomes evident that the batch is a zero-copy slice of a larger batch, resulting in a discrepancy between actual and expected memory used by TopK, considering only 8192 rows. Analyzing the physical plan:
and AggExec: For the current plan, we can see that each batch we insert into TopK is a slice of the Agg output batch, which AggExec should track. We need to avoid double memory accounting for sliced batches to fix this issue. And for option3, there is |
Note that in this query, |
Ah good point, the cardinality of the input grouped data is indeed very large (~17M). Indeed, a quick google search brought up the following recent paper (citing DataFusion/your blog post) about a new high-cardinality top K aggregation technique: https://www.microsoft.com/en-us/research/publication/cache-efficient-top-k-aggregation-over-high-cardinality-large-datasets/ |
From DataFusion's memory management perspective, I found that I suggest we have |
I agree with @Dandandan in #9417 (comment) that the core problem is with accounting
If we had infinite time / engineering hours I think a better approach would actually be to change GroupByHash so it didn't create a single giant contiguous Instead it would be better if GroupByHash produced a Doing this would allow the GroupByHash to release memory incrementally as it output. This is analogous to how @korowa made join output incremental in #8658 |
If incremental output of Grouping sounds reasonable to people I can file a follow on ticket to track the work. |
I agree that the core problem for the issue is accounting and that the most overreported batch slice would come from AggExec's mono output record batch. But I also believe there's a distinction between optimizing AggExec's output pattern and handling memory accounting. To improve AggExec's mono output pattern, #7065 might be similar to the idea of incremental output. Regarding the memory accounting side, I'm curious if you have considered alternatives that allow for more accurate accounting for different batches. The idea of having sliced batches not reporting their memory usage or using |
Yes, please do |
Makes sense to me as well, thank you 🙏 |
I agree
I think it is tricky business and depends on what we are using the memory accounting for At the moment I think the memory accounting is mostly to prevent OOM kills (over commit of memory), since memory for a sliced However, ensuring we don't double count is important too (like two slices to the same 1M row RecordBatch will count as a total of 2M rows, even though there only a single allocation). |
Filed #9562 to track incremental group by output |
I think #10511 is related to this, except it's using |
Coming back to this, I guess if we can implement another option without implementing spilling: force compaction once we hit the limit. |
I think this is a great idea |
Implementing this "reduce memory usage when under pressure" might be a more interesting general approach to improve DataFusion's performance under memory pressure (e.g. maybe we can trigger other operators to clear memory (like partial aggregates) when we hit memory pressure 🤔 |
That's an interesting idea :) |
Is your feature request related to a problem or challenge?
The Top-K operator has recently been added for a specialized use case when encountering
ORDER BY
andLIMIT
clauses together (#7250, #7721), as a way to optimize the memory usage of the sorting procedure.Still the present implementation relies on keeping in memory the input record batches with potential row candidates for the final K output rows. This means that in the pathological case, there can be K batches in memory per the TopK operator, which are themselves spawned per input partition.
In particular this leads to the following error for ClickBench query 19:
In the above case I see 12 partitions x ~3.5 batches per TopK operator in memory x 223 MB per batch (which is kind of strange for 4 columns) = 9366 MB, thus peaking above the set memory limit of 8GB.
Describe the solution you'd like
Ideally something that doesn't hurt performance but reduces the memory footprint even more. Failing that, something that perhaps hurts performance only once the memory limit threshold has been surpassed (e.g. by spilling), but without crashing the query.
Describe alternatives you've considered
Option 1
Increasing or not setting a memory limit.
Option 2
Introduce spilling to disk for the TopK operator as a fallback when the memory limit is hit.
Option 3
Potentially something like converting the column arrays of the input record batch to rows, like for the evaluated sort keys
https://github.com/apache/arrow-datafusion/blob/b2ff249bfb918ac6697dbc92b51262a7bdbb5971/datafusion/physical-plan/src/topk/mod.rs#L163
and then making
TopKRow
track the projected rows, in addition to the sort keys, but compare only against the sort key. This would enable theBinaryHeap
to discard the unneeded rows.Finally one could use
arrow_row::RowConverter::convert_rows
to get back the columns whenemit
ing.However this is almost guaranteed to lead to worse performance in the general case due to all of the row-conversion taking place.
Additional context
Potentially relevant for #7195.
The text was updated successfully, but these errors were encountered: