Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Reduce memory usage in Parquet->Arrow decimal column chunk conversion #751

Merged
merged 4 commits into from
Jan 13, 2022

Conversation

danburkert
Copy link
Contributor

This PR reduces memory usage, both in terms of memory used as well as
allocations, in the Parquet->Arrow conversion of Decimal chunks. There
are two optimizations:

  1. Instead of using slice::concat to expand buffers to 16 bytes, a
    stack-allocated 16 byte buffer is used instead. This removes an
    allocation per value.
  2. Data is expanded from the encoded Parquet fixed-size binary pages
    into a byte buffer, which is then converted to a buffer of i128s. To
    reduce the size of the intermediate byte buffer, this conversion is
    now done page-by-page.

This PR reduces memory usage, both in terms of memory used as well as
allocations, in the Parquet->Arrow conversion of Decimal chunks. There
are two optimizations:

1. Instead of using `slice::concat` to expand buffers to 16 bytes, a
  stack-allocated 16 byte buffer is used instead. This removes an
  allocation per value.
2. Data is expanded from the encoded Parquet fixed-size binary pages
   into a byte buffer, which is then converted to a buffer of i128s. To
   reduce the size of the intermediate byte buffer, this conversion is
   now done page-by-page.
@danburkert
Copy link
Contributor Author

For reasons I don't fully understand, optimization #2 was yielding the wrong results, so I've backed it out in the second commit and replaced it with a FromIter call.

@codecov
Copy link

codecov bot commented Jan 11, 2022

Codecov Report

Merging #751 (d000eff) into main (2493f7d) will increase coverage by 0.19%.
The diff coverage is 68.42%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #751      +/-   ##
==========================================
+ Coverage   70.80%   71.00%   +0.19%     
==========================================
  Files         313      313              
  Lines       16930    16912      -18     
==========================================
+ Hits        11988    12008      +20     
+ Misses       4942     4904      -38     
Impacted Files Coverage Δ
src/io/parquet/read/mod.rs 39.31% <68.42%> (+0.47%) ⬆️
src/array/fixed_size_binary/iterator.rs 83.33% <0.00%> (-8.34%) ⬇️
src/compute/aggregate/min_max.rs 65.90% <0.00%> (-0.76%) ⬇️
src/compute/nullif.rs 0.00% <0.00%> (ø)
src/compute/comparison/mod.rs 39.53% <0.00%> (ø)
src/io/parquet/write/stream.rs 0.00% <0.00%> (ø)
src/compute/comparison/primitive.rs 100.00% <0.00%> (ø)
src/io/json/read/infer_schema.rs 85.57% <0.00%> (+5.08%) ⬆️
src/compute/utils.rs 95.65% <0.00%> (+17.08%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2493f7d...d000eff. Read the comment docs.

Copy link
Owner

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Clippy missing but otherwise ready to ship

Copy link
Owner

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticed an extra optimization we can do here

src/io/parquet/read/mod.rs Outdated Show resolved Hide resolved
danburkert and others added 2 commits January 11, 2022 11:14
Co-authored-by: Jorge Leitao <jorgecarleitao@gmail.com>
@danburkert
Copy link
Contributor Author

Noticed an extra optimization we can do here

@jorgecarleitao nice, good call. I had to insert a chunks_exact() in order to get that to compile, but I think that was the intention.

@jorgecarleitao
Copy link
Owner

Woops, we are missing an iterator over the values of a FixedSizeBinary. PR here: #757

@jorgecarleitao jorgecarleitao merged commit 6b7af9f into jorgecarleitao:main Jan 13, 2022
@jorgecarleitao
Copy link
Owner

Thanks again, very clean :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants