This repository has been archived by the owner on Feb 18, 2024. It is now read-only.
Fixed error in memory usage of sliced binary/list/utf8arrays #1293
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
fixes #1292
The
estimated_bytes_size
function did not shrink if an Utf8 array was sliced, leading to a slice recursion until there were only 3 elements left in the array. This lead to writing 175e7 / 3 pages to the parquet file, which was insanely slow.This PR adapts
estimated_bytes_size
so that it reports the sliced array size. It already does this for other data types and I think this is most consistent and least error prone.If you think we should make it a dedicated function or a branch flagged by an extra argument, that's also fine.