-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoiding spilling in TopK queries by reinserting the to-spill data to memory buffer #3579
Comments
I've been looking into this and noticed that there are currently two places where this needs to be handled (and from what I've understood, they are actually separate optimizations). The first place is what you mentioned in this issue, which is during the spill process we sort everything we have and limit the actual output (which would mean that if we have 10 in-memory batches of limit 50, and if we can easily handle 100 records, we won't need to spill after sorting everything in place since the resulting batch would only have 50 items at most): The second place is actually where we make the first allocation. We calculate the size of the given batch, and then we assume down there that the size of the batch we got from the sort is actually the same as the size of the input batch. But thanks to the TopK PR, this is no longer the case. The sorting might actually reduce the size while we are still inserting it, so we could in theory free all the redundant memory first. This would actually help a lot, since the next |
Great find and analysis @isidentical! I assumed it was already tracking the correct size of the reduced batch. Could you create a new issue for this "memory tracking bug"? |
@Dandandan definitely! I've created #3596 for it. |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
We recently added optimizations for
ORDER BY expr LIMIT
by pushing limits to individual operations (saving memory, CPU time + limiting output rows) and executing sorts in parallel.The disk spill operation in
SortExec
currently still assumes the to-spill disk doesn't fit in memory.However after sorting we only have to keep the batch(es) with top
fetch
rows and store those, which probably avoids spilling to disk.Describe the solution you'd like
We can identify that the to-spill data fits in memory after being merged / sorted and avoid spilling to disk.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: