Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Improved performance of reading Binary from parquet #1190

Merged
merged 1 commit into from
Jul 29, 2022

Conversation

ritchie46
Copy link
Collaborator

During a flamegraph of a parquet file I saw that Binary<O>::push spend a large amount reallocation. This PR pre-allocates a the values buffer in parquet hinted by the capacity. This can prevent a few reallocations. The string/binary size is arbitrarily set to the size of a minimal string, e.g. 3 pointers. Could of course be something else that better fits real world data. I could do some runs to find a distribution?

@codecov
Copy link

codecov bot commented Jul 28, 2022

Codecov Report

Merging #1190 (8a74c75) into main (2d6ee11) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #1190      +/-   ##
==========================================
- Coverage   83.54%   83.53%   -0.01%     
==========================================
  Files         364      364              
  Lines       36398    36398              
==========================================
- Hits        30407    30406       -1     
- Misses       5991     5992       +1     
Impacted Files Coverage Δ
src/io/parquet/read/deserialize/binary/utils.rs 72.15% <100.00%> (ø)
src/chunk.rs 83.33% <0.00%> (-7.15%) ⬇️
src/bitmap/utils/slice_iterator.rs 97.56% <0.00%> (-1.22%) ⬇️
src/array/binary/mod.rs 90.12% <0.00%> (+1.23%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2d6ee11...8a74c75. Read the comment docs.

@jorgecarleitao
Copy link
Owner

Thanks, @ritchie46! Should we make it a power of 2, i.e. 16 or 32?

@jorgecarleitao jorgecarleitao changed the title parquet binary: pre-alloc binary buffer Improved performance of reading Binary from parquet Jul 29, 2022
@jorgecarleitao jorgecarleitao added the enhancement An improvement to an existing feature label Jul 29, 2022
@jorgecarleitao jorgecarleitao merged commit 8604cb7 into jorgecarleitao:main Jul 29, 2022
@jorgecarleitao
Copy link
Owner

Thinking about it, capacity is not a multiple, so my comment is not very insightful.

@ritchie46
Copy link
Collaborator Author

Thinking about it, capacity is not a multiple, so my comment is not very insightful.

Yes, we could also round up to a power of two? Maybe the allocator does something like that already?

In any case, I have no strong opinion here.

@ritchie46 ritchie46 deleted the prealloc_capacity branch July 29, 2022 06:13
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement An improvement to an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants