Reduce memory usage in Parquet->Arrow decimal column chunk conversion #751

danburkert · 2022-01-11T00:04:06Z

This PR reduces memory usage, both in terms of memory used as well as
allocations, in the Parquet->Arrow conversion of Decimal chunks. There
are two optimizations:

Instead of using slice::concat to expand buffers to 16 bytes, a
stack-allocated 16 byte buffer is used instead. This removes an
allocation per value.
Data is expanded from the encoded Parquet fixed-size binary pages
into a byte buffer, which is then converted to a buffer of i128s. To
reduce the size of the intermediate byte buffer, this conversion is
now done page-by-page.

This PR reduces memory usage, both in terms of memory used as well as allocations, in the Parquet->Arrow conversion of Decimal chunks. There are two optimizations: 1. Instead of using `slice::concat` to expand buffers to 16 bytes, a stack-allocated 16 byte buffer is used instead. This removes an allocation per value. 2. Data is expanded from the encoded Parquet fixed-size binary pages into a byte buffer, which is then converted to a buffer of i128s. To reduce the size of the intermediate byte buffer, this conversion is now done page-by-page.

danburkert · 2022-01-11T02:28:57Z

For reasons I don't fully understand, optimization #2 was yielding the wrong results, so I've backed it out in the second commit and replaced it with a FromIter call.

codecov · 2022-01-11T02:46:40Z

Codecov Report

Merging #751 (d000eff) into main (2493f7d) will increase coverage by 0.19%.
The diff coverage is 68.42%.

@@            Coverage Diff             @@
##             main     #751      +/-   ##
==========================================
+ Coverage   70.80%   71.00%   +0.19%     
==========================================
  Files         313      313              
  Lines       16930    16912      -18     
==========================================
+ Hits        11988    12008      +20     
+ Misses       4942     4904      -38

Impacted Files	Coverage Δ
src/io/parquet/read/mod.rs	`39.31% <68.42%> (+0.47%)`	⬆️
src/array/fixed_size_binary/iterator.rs	`83.33% <0.00%> (-8.34%)`	⬇️
src/compute/aggregate/min_max.rs	`65.90% <0.00%> (-0.76%)`	⬇️
src/compute/nullif.rs	`0.00% <0.00%> (ø)`
src/compute/comparison/mod.rs	`39.53% <0.00%> (ø)`
src/io/parquet/write/stream.rs	`0.00% <0.00%> (ø)`
src/compute/comparison/primitive.rs	`100.00% <0.00%> (ø)`
src/io/json/read/infer_schema.rs	`85.57% <0.00%> (+5.08%)`	⬆️
src/compute/utils.rs	`95.65% <0.00%> (+17.08%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2493f7d...d000eff. Read the comment docs.

jorgecarleitao

Thanks! Clippy missing but otherwise ready to ship

jorgecarleitao

Noticed an extra optimization we can do here

src/io/parquet/read/mod.rs

Co-authored-by: Jorge Leitao <jorgecarleitao@gmail.com>

danburkert · 2022-01-11T19:25:43Z

Noticed an extra optimization we can do here

@jorgecarleitao nice, good call. I had to insert a chunks_exact() in order to get that to compile, but I think that was the intention.

jorgecarleitao · 2022-01-11T19:44:37Z

Woops, we are missing an iterator over the values of a FixedSizeBinary. PR here: #757

jorgecarleitao · 2022-01-13T07:10:39Z

Thanks again, very clean :)

avfeinberg approved these changes Jan 11, 2022

View reviewed changes

mdrach approved these changes Jan 11, 2022

View reviewed changes

backout per-page optimization

cf79676

jorgecarleitao approved these changes Jan 11, 2022

View reviewed changes

jorgecarleitao reviewed Jan 11, 2022

View reviewed changes

src/io/parquet/read/mod.rs Outdated Show resolved Hide resolved

jorgecarleitao reviewed Jan 11, 2022

View reviewed changes

src/io/parquet/read/mod.rs Outdated Show resolved Hide resolved

danburkert and others added 2 commits January 11, 2022 11:14

Apply suggestions from code review

3e7b85c

Co-authored-by: Jorge Leitao <jorgecarleitao@gmail.com>

fix values optimization

d000eff

jorgecarleitao merged commit 6b7af9f into jorgecarleitao:main Jan 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce memory usage in Parquet->Arrow decimal column chunk conversion #751

Reduce memory usage in Parquet->Arrow decimal column chunk conversion #751

danburkert commented Jan 11, 2022

danburkert commented Jan 11, 2022

codecov bot commented Jan 11, 2022 •

edited

Loading

jorgecarleitao left a comment

jorgecarleitao left a comment

danburkert commented Jan 11, 2022

jorgecarleitao commented Jan 11, 2022

jorgecarleitao commented Jan 13, 2022

Reduce memory usage in Parquet->Arrow decimal column chunk conversion #751

Reduce memory usage in Parquet->Arrow decimal column chunk conversion #751

Conversation

danburkert commented Jan 11, 2022

danburkert commented Jan 11, 2022

codecov bot commented Jan 11, 2022 • edited Loading

Codecov Report

jorgecarleitao left a comment

Choose a reason for hiding this comment

jorgecarleitao left a comment

Choose a reason for hiding this comment

danburkert commented Jan 11, 2022

jorgecarleitao commented Jan 11, 2022

jorgecarleitao commented Jan 13, 2022

codecov bot commented Jan 11, 2022 •

edited

Loading