Fixed error in memory usage of sliced binary/list/utf8arrays #1293

ritchie46 · 2022-11-05T14:14:59Z

The estimated_bytes_size function did not shrink if an Utf8 array was sliced, leading to a slice recursion until there were only 3 elements left in the array. This lead to writing 175e7 / 3 pages to the parquet file, which was insanely slow.

This PR adapts estimated_bytes_size so that it reports the sliced array size. It already does this for other data types and I think this is most consistent and least error prone.

If you think we should make it a dedicated function or a branch flagged by an extra argument, that's also fine.

codecov · 2022-11-05T14:18:38Z

Codecov Report

Base: 83.12% // Head: 83.11% // Decreases project coverage by -0.00% ⚠️

Coverage data is based on head (61fdb87) compared to base (562de6a).
Patch has no changes to coverable lines.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1293      +/-   ##
==========================================
- Coverage   83.12%   83.11%   -0.01%     
==========================================
  Files         369      369              
  Lines       40187    40187              
==========================================
- Hits        33405    33402       -3     
- Misses       6782     6785       +3

Impacted Files	Coverage Δ
src/compute/aggregate/memory.rs	`35.71% <ø> (ø)`
src/io/ipc/read/array/utf8.rs	`92.75% <0.00%> (-5.80%)`	⬇️
src/bitmap/utils/slice_iterator.rs	`98.78% <0.00%> (+1.21%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

ritchie46 · 2022-11-05T14:23:02Z

Clippy fails, but it is unrelated to this PR.

jorgecarleitao · 2022-11-13T04:14:54Z

Thanks @ritchie46 !

…rleitao#1293)

report sliced memory usage in binary/list/utf8arrays

61fdb87

ritchie46 mentioned this pull request Nov 5, 2022

write_parquet performs very badly on large files compared to write_csv pola-rs/polars#3845

Closed

ritchie46 mentioned this pull request Nov 6, 2022

fix(rust, python): fix freeze/stall when writing more than 2^31 string values to parquet pola-rs/polars#5366

Merged

jorgecarleitao changed the title ~~report sliced memory usage in binary/list/utf8arrays~~ Fixed error in memory usage of sliced binary/list/utf8arrays Nov 13, 2022

jorgecarleitao added the bug Something isn't working label Nov 13, 2022

jorgecarleitao merged commit 48a5322 into jorgecarleitao:main Nov 13, 2022

ritchie46 added a commit to ritchie46/arrow2 that referenced this pull request Mar 29, 2023

Fixed error in memory usage of sliced binary/list/utf8arrays (jorgeca…

b652400

…rleitao#1293)

ritchie46 added a commit to ritchie46/arrow2 that referenced this pull request Apr 5, 2023

Fixed error in memory usage of sliced binary/list/utf8arrays (jorgeca…

492d130

…rleitao#1293)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed error in memory usage of sliced binary/list/utf8arrays #1293

Fixed error in memory usage of sliced binary/list/utf8arrays #1293

ritchie46 commented Nov 5, 2022 •

edited

Loading

codecov bot commented Nov 5, 2022

ritchie46 commented Nov 5, 2022

jorgecarleitao commented Nov 13, 2022

Fixed error in memory usage of sliced binary/list/utf8arrays #1293

Fixed error in memory usage of sliced binary/list/utf8arrays #1293

Conversation

ritchie46 commented Nov 5, 2022 • edited Loading

codecov bot commented Nov 5, 2022

Codecov Report

ritchie46 commented Nov 5, 2022

jorgecarleitao commented Nov 13, 2022

ritchie46 commented Nov 5, 2022 •

edited

Loading