Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SortPreservingMerge does not account for memory usage #5885

Closed
tustvold opened this issue Apr 5, 2023 · 3 comments · Fixed by #7130
Closed

SortPreservingMerge does not account for memory usage #5885

tustvold opened this issue Apr 5, 2023 · 3 comments · Fixed by #7130
Assignees
Labels
enhancement New feature or request

Comments

@tustvold
Copy link
Contributor

tustvold commented Apr 5, 2023

Is your feature request related to a problem or challenge?

SortPreservingMerge currently has extremely limited memory accounting functionality, with no accounting for buffered batches or cursors.

The only memory accounting is a static assignment at construction time by ExternalSorter of the size of the in memory batches, when merging spilled and in-memory data. This assignment is never decremented, and does not take into account any memory usage resulting from loading the spilled data back into memory.

Describe the solution you'd like

SortPreservingMerge should account for the memory usage of the data it has buffered

Additionally the various streams created by ExternalSorter, both for in-memory and spilled data, should be accounted for

Describe alternatives you've considered

No response

Additional context

#5879 tracks unifying the sorting implementations, which may help make this story more consistent.

@tustvold tustvold added the enhancement New feature or request label Apr 5, 2023
@alamb alamb changed the title Sort Memory Accounting SortPreservingMerge does not account for memory usage May 10, 2023
@tustvold tustvold self-assigned this May 16, 2023
tustvold added a commit to tustvold/arrow-datafusion that referenced this issue May 16, 2023
tustvold added a commit to tustvold/arrow-datafusion that referenced this issue May 16, 2023
tustvold added a commit to tustvold/arrow-datafusion that referenced this issue May 16, 2023
tustvold added a commit to tustvold/arrow-datafusion that referenced this issue May 18, 2023
tustvold added a commit to tustvold/arrow-datafusion that referenced this issue May 18, 2023
tustvold added a commit to tustvold/arrow-datafusion that referenced this issue May 18, 2023
richox pushed a commit to richox/arrow-datafusion that referenced this issue May 29, 2023
richox pushed a commit to richox/arrow-datafusion that referenced this issue May 29, 2023
@alamb
Copy link
Contributor

alamb commented Jun 5, 2023

There is a PR open for this feature -- I think it is valuable to complete. However, it currently doesn't pass the PR tests (I think the tests need some updating)

richox pushed a commit to richox/arrow-datafusion that referenced this issue Jul 5, 2023
richox pushed a commit to richox/arrow-datafusion that referenced this issue Jul 5, 2023
@alamb
Copy link
Contributor

alamb commented Jul 18, 2023

We saw this cause troubles for us during some internal testing. More details to come

@alamb alamb assigned alamb and unassigned tustvold Jul 28, 2023
@alamb
Copy link
Contributor

alamb commented Jul 28, 2023

Specifically, we have pretty good evidence that for dictionary encoded arrays with high cardinality, the interned dictionary values consume an ever increasing amount of memory

yjshen pushed a commit that referenced this issue Aug 9, 2023
* Account for memory usage in SortPreservingMerge

* Review Comments: Improve documentation and comments

* Review Comments: Improve documentation and comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
2 participants