Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Improved performance of reading utf8 required from parquet (-15%) #670

Merged
merged 2 commits into from
Dec 10, 2021

Conversation

jorgecarleitao
Copy link
Owner

Closes #666

Two optimizations:

  • pre-allocate
  • inline reading u32 in little endian (parquet2's get_length is not inlined and causes some perf problems)
read required utf8 2^20 time:   [8.0038 ms 8.0496 ms 8.0977 ms]                                    
                        change: [-17.183% -16.507% -15.848%] (p = 0.00 < 0.05)

Thanks @ldn9638 for the idea!

@jorgecarleitao jorgecarleitao added the enhancement An improvement to an existing feature label Dec 9, 2021
@codecov
Copy link

codecov bot commented Dec 9, 2021

Codecov Report

Merging #670 (bb98a1d) into main (a18555c) will increase coverage by 0.07%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #670      +/-   ##
==========================================
+ Coverage   69.59%   69.67%   +0.07%     
==========================================
  Files         299      301       +2     
  Lines       16746    16803      +57     
==========================================
+ Hits        11655    11707      +52     
- Misses       5091     5096       +5     
Impacted Files Coverage Δ
src/io/parquet/read/binary/basic.rs 68.32% <100.00%> (+0.60%) ⬆️
src/io/parquet/read/utils.rs 78.94% <100.00%> (ø)
src/compute/utils.rs 72.72% <0.00%> (-2.28%) ⬇️
src/scalar/utf8.rs 92.30% <0.00%> (-2.14%) ⬇️
src/buffer/immutable.rs 96.22% <0.00%> (-1.51%) ⬇️
src/compute/upper.rs 100.00% <0.00%> (ø)
src/compute/lower.rs 100.00% <0.00%> (ø)
src/scalar/binary.rs 96.15% <0.00%> (+1.70%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a18555c...bb98a1d. Read the comment docs.

@ldn9638
Copy link

ldn9638 commented Dec 10, 2021

Very cool !
But I still have a idea,For variable-length binary types. The current memory structure is
{size,data|size,data|size,data|size,data|.......}
if modified to
{size_buf_size|size,size,size|data,data,data}.

Will it be more efficient when reading? It can avoid extend_from_slice in BinaryIter

@ldn9638
Copy link

ldn9638 commented Dec 10, 2021

The main reason is that if the row_group contains a lot of data_pages, it will continue to be very slow when the page_stream_to_array constructs array.

@jorgecarleitao
Copy link
Owner Author

The [size][values]...[size][values] comes from the PLAIN encoding from parquet. If the page is DELTA_LENGTH_BYTE_ARRAY, it will have the format that you mention (see also ARROW-13388)

In general a column chunk should have a small number of pages (usually 2-3 for a dict-encoded and 1 for a non-dict encoded. More pages means more fragmentation and thus less compression. Since many parquet writers do not write or support filter pushdown at the page level, more fragmentation does not help in skipping pages.

it will continue to be very slow when the page_stream_to_array constructs array.

I think that this is the fastest possible under the constraints of the parquet format, since we can't tell the total size of a data page prior to reading its header (but I would love to be proven wrong and learn something new ^_^)

@ldn9638
Copy link

ldn9638 commented Dec 10, 2021

Thank you for your reply.
Can I use metadata. uncompressed_ size() - metadata. num_values() * 4 is the initialization capacity of the values container?

@jorgecarleitao
Copy link
Owner Author

Can I use metadata. uncompressed_ size() - metadata. num_values() * 4 is the initialization capacity of the values container?

I think that the following holds:

metadata.uncompressed_size() = sum_{pages} page.header + page.buffer

pages may contain a dictionary page (we only know once we read the page), and individual pages may be encoded differently (e.g. mixing PLAIN with DELTA_LENGTH_BYTE_ARRAY) -- metadata.uncompressed_size will vary accordingly. So, it is a bit difficult to make a general statement for optimizations.

I think that the DELTA_LENGTH_BYTE_ARRAY encoding is the solution for this; it has a better compression ratio and is faster to read and write to and from arrow.

for reference, I took the equation above by looking at the writer, here and here.

@ldn9638
Copy link

ldn9638 commented Dec 10, 2021

Thanks for your reply. I'll study again

@jorgecarleitao jorgecarleitao merged commit 2b86bc9 into main Dec 10, 2021
@jorgecarleitao jorgecarleitao deleted the bench_required branch December 10, 2021 18:52
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement An improvement to an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

reading of parquet binary type, is very slow:
2 participants