Avoiding spilling in TopK queries by reinserting the to-spill data to memory buffer #3579

Dandandan · 2022-09-21T19:02:02Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
We recently added optimizations for ORDER BY expr LIMIT by pushing limits to individual operations (saving memory, CPU time + limiting output rows) and executing sorts in parallel.

The disk spill operation in SortExec currently still assumes the to-spill disk doesn't fit in memory.
However after sorting we only have to keep the batch(es) with top fetch rows and store those, which probably avoids spilling to disk.

Describe the solution you'd like
We can identify that the to-spill data fits in memory after being merged / sorted and avoid spilling to disk.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

The text was updated successfully, but these errors were encountered:

isidentical · 2022-09-22T21:56:29Z

I've been looking into this and noticed that there are currently two places where this needs to be handled (and from what I've understood, they are actually separate optimizations).

The first place is what you mentioned in this issue, which is during the spill process we sort everything we have and limit the actual output (which would mean that if we have 10 in-memory batches of limit 50, and if we can easily handle 100 records, we won't need to spill after sorting everything in place since the resulting batch would only have 50 items at most):
https://github.com/apache/arrow-datafusion/blob/add10a67c8e16aca0a683957ddbea29a2a3a4156/datafusion/core/src/physical_plan/sorts/sort.rs#L281-L282

The second place is actually where we make the first allocation. We calculate the size of the given batch, and then we assume down there that the size of the batch we got from the sort is actually the same as the size of the input batch.
https://github.com/apache/arrow-datafusion/blob/add10a67c8e16aca0a683957ddbea29a2a3a4156/datafusion/core/src/physical_plan/sorts/sort.rs#L119-L121

But thanks to the TopK PR, this is no longer the case. The sorting might actually reduce the size while we are still inserting it, so we could in theory free all the redundant memory first. This would actually help a lot, since the next insert_batch will have a lot of free-space to operate on. I've had a very rough benchmark (of a very small subset of the data from ClickHouse benchmark) and it seems like we are saving around ~320 spills (it goes from 320 spills to no spills) just by properly adjusting the used memory size.

Dandandan · 2022-09-23T06:34:33Z

Great find and analysis @isidentical!

I assumed it was already tracking the correct size of the reduced batch. Could you create a new issue for this "memory tracking bug"?

isidentical · 2022-09-23T07:40:06Z

@Dandandan definitely! I've created #3596 for it.

Dandandan added the enhancement New feature or request label Sep 21, 2022

isidentical mentioned this issue Sep 22, 2022

Prevent over-allocations (and spills) on sorts with a fixed limit #3593

Merged

isidentical mentioned this issue Sep 23, 2022

Prevent over-allocation (and spills) on TopK queries #3596

Closed

alamb mentioned this issue Aug 4, 2023

[EPIC] A collection of Sort + Limit / Top K optimizations #7195

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoiding spilling in TopK queries by reinserting the to-spill data to memory buffer #3579

Avoiding spilling in TopK queries by reinserting the to-spill data to memory buffer #3579

Dandandan commented Sep 21, 2022

isidentical commented Sep 22, 2022

Dandandan commented Sep 23, 2022

isidentical commented Sep 23, 2022

Avoiding spilling in TopK queries by reinserting the to-spill data to memory buffer #3579

Avoiding spilling in TopK queries by reinserting the to-spill data to memory buffer #3579

Comments

Dandandan commented Sep 21, 2022

isidentical commented Sep 22, 2022

Dandandan commented Sep 23, 2022

isidentical commented Sep 23, 2022