Improved performance of reading utf8 required from parquet (-15%) #670

jorgecarleitao · 2021-12-09T16:09:53Z

Closes #666

Two optimizations:

pre-allocate
inline reading u32 in little endian (parquet2's get_length is not inlined and causes some perf problems)

read required utf8 2^20 time:   [8.0038 ms 8.0496 ms 8.0977 ms]                                    
                        change: [-17.183% -16.507% -15.848%] (p = 0.00 < 0.05)

Thanks @ldn9638 for the idea!

codecov · 2021-12-09T16:21:12Z

Codecov Report

Merging #670 (bb98a1d) into main (a18555c) will increase coverage by 0.07%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #670      +/-   ##
==========================================
+ Coverage   69.59%   69.67%   +0.07%     
==========================================
  Files         299      301       +2     
  Lines       16746    16803      +57     
==========================================
+ Hits        11655    11707      +52     
- Misses       5091     5096       +5

Impacted Files	Coverage Δ
src/io/parquet/read/binary/basic.rs	`68.32% <100.00%> (+0.60%)`	⬆️
src/io/parquet/read/utils.rs	`78.94% <100.00%> (ø)`
src/compute/utils.rs	`72.72% <0.00%> (-2.28%)`	⬇️
src/scalar/utf8.rs	`92.30% <0.00%> (-2.14%)`	⬇️
src/buffer/immutable.rs	`96.22% <0.00%> (-1.51%)`	⬇️
src/compute/upper.rs	`100.00% <0.00%> (ø)`
src/compute/lower.rs	`100.00% <0.00%> (ø)`
src/scalar/binary.rs	`96.15% <0.00%> (+1.70%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a18555c...bb98a1d. Read the comment docs.

ldn9638 · 2021-12-10T08:10:48Z

Very cool !
But I still have a idea,For variable-length binary types. The current memory structure is
{size,data|size,data|size,data|size,data|.......}
if modified to
{size_buf_size|size,size,size|data,data,data}.

Will it be more efficient when reading? It can avoid extend_from_slice in BinaryIter

ldn9638 · 2021-12-10T08:17:48Z

The main reason is that if the row_group contains a lot of data_pages, it will continue to be very slow when the page_stream_to_array constructs array.

jorgecarleitao · 2021-12-10T11:45:41Z

The [size][values]...[size][values] comes from the PLAIN encoding from parquet. If the page is DELTA_LENGTH_BYTE_ARRAY, it will have the format that you mention (see also ARROW-13388)

In general a column chunk should have a small number of pages (usually 2-3 for a dict-encoded and 1 for a non-dict encoded. More pages means more fragmentation and thus less compression. Since many parquet writers do not write or support filter pushdown at the page level, more fragmentation does not help in skipping pages.

it will continue to be very slow when the page_stream_to_array constructs array.

I think that this is the fastest possible under the constraints of the parquet format, since we can't tell the total size of a data page prior to reading its header (but I would love to be proven wrong and learn something new ^_^)

ldn9638 · 2021-12-10T12:45:31Z

Thank you for your reply.
Can I use metadata. uncompressed_ size() - metadata. num_values() * 4 is the initialization capacity of the values container?

jorgecarleitao · 2021-12-10T15:34:59Z

Can I use metadata. uncompressed_ size() - metadata. num_values() * 4 is the initialization capacity of the values container?

I think that the following holds:

metadata.uncompressed_size() = sum_{pages} page.header + page.buffer

pages may contain a dictionary page (we only know once we read the page), and individual pages may be encoded differently (e.g. mixing PLAIN with DELTA_LENGTH_BYTE_ARRAY) -- metadata.uncompressed_size will vary accordingly. So, it is a bit difficult to make a general statement for optimizations.

I think that the DELTA_LENGTH_BYTE_ARRAY encoding is the solution for this; it has a better compression ratio and is faster to read and write to and from arrow.

for reference, I took the equation above by looking at the writer, here and here.

ldn9638 · 2021-12-10T16:02:54Z

Thanks for your reply. I'll study again

jorgecarleitao added 2 commits December 8, 2021 18:27

Added bench for required column parquet

3e5207a

Improved performance.

bb98a1d

jorgecarleitao added the enhancement An improvement to an existing feature label Dec 9, 2021

jorgecarleitao merged commit 2b86bc9 into main Dec 10, 2021

jorgecarleitao deleted the bench_required branch December 10, 2021 18:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved performance of reading utf8 required from parquet (-15%) #670

Improved performance of reading utf8 required from parquet (-15%) #670

jorgecarleitao commented Dec 9, 2021

codecov bot commented Dec 9, 2021 •

edited

Loading

ldn9638 commented Dec 10, 2021

ldn9638 commented Dec 10, 2021

jorgecarleitao commented Dec 10, 2021

ldn9638 commented Dec 10, 2021

jorgecarleitao commented Dec 10, 2021

ldn9638 commented Dec 10, 2021

Improved performance of reading utf8 required from parquet (-15%) #670

Improved performance of reading utf8 required from parquet (-15%) #670

Conversation

jorgecarleitao commented Dec 9, 2021

codecov bot commented Dec 9, 2021 • edited Loading

Codecov Report

ldn9638 commented Dec 10, 2021

ldn9638 commented Dec 10, 2021

jorgecarleitao commented Dec 10, 2021

ldn9638 commented Dec 10, 2021

jorgecarleitao commented Dec 10, 2021

ldn9638 commented Dec 10, 2021

codecov bot commented Dec 9, 2021 •

edited

Loading