parquet: Speed up `BitReader`/`DeltaBitPackDecoder` #325

kornholi · 2021-05-19T00:28:21Z

This PR removes some reference counting in BitReader and a few allocations in DeltaBitPackDecoder.

At least for the datasets I tested with, delta-encoded integer columns decode around 50% faster. A SELECT AVG(foo) through datafusion was about 30% faster as well.

From a quick test, this speeds up reading delta-packed int columns by over 30%.

From a quick test, it seems to decode around 10% faster overall.

Dandandan · 2021-05-20T15:30:43Z

FYI @nevi-me @jorgecarleitao @yordan-pavlov

yordan-pavlov · 2021-05-20T18:38:53Z

@kornholi thank you for looking into BitReader and DeltaBitPackDecoder; these performance improvements will combine very well with the change I have been working on here #200; similar to you, I have also found the current implementation of the parquet reader to be fairly inefficient in some places where for example it's unnecessarily creating clones of ByteBufferPtr, or allocating Vecs, etc. I can't wait to see what the combined performance improvements will be.

alamb · 2021-05-21T18:38:19Z

I restarted the CI checks on this PR as the failure on the windows tests seemed unrelated to your changes

alamb

I think this looks good to me, but I would feel better if @sunchao took a look at it before we merged it in

alamb · 2021-05-21T18:45:14Z

parquet/src/encodings/decoding.rs

@@ -427,6 +425,7 @@ impl<T: DataType> DeltaBitPackDecoder<T> {
            );
            assert!(loaded == self.values_current_mini_block);
        } else {
+            self.deltas_in_mini_block.clear();


I don't understand the need for this change -- was calling clear() a major bottleneck? Or was it having to reinitialize the entire deltas_in_mini_block to default() in the self.use_batch branch?

In this case, the resize is expensive even though it optimizes down to mostly a memset (only 4 elems in the array in my tests). Around a 5% throughput difference.

sunchao

LGTM. Thanks @kornholi !

* parquet: Avoid temporary `BufferPtr`s in `BitReader` From a quick test, this speeds up reading delta-packed int columns by over 30%. * parquet: Avoid some allocations in `DeltaBitPackDecoder` From a quick test, it seems to decode around 10% faster overall.

* parquet: Avoid temporary `BufferPtr`s in `BitReader` From a quick test, this speeds up reading delta-packed int columns by over 30%. * parquet: Avoid some allocations in `DeltaBitPackDecoder` From a quick test, it seems to decode around 10% faster overall. Co-authored-by: Kornelijus Survila <kornholijo@gmail.com>

kornholi added 2 commits May 18, 2021 19:01

parquet: Avoid temporary BufferPtrs in BitReader

a973f15

From a quick test, this speeds up reading delta-packed int columns by over 30%.

parquet: Avoid some allocations in DeltaBitPackDecoder

10d8390

From a quick test, it seems to decode around 10% faster overall.

kornholi force-pushed the pq-bitreader-allocs branch from 753019c to 10d8390 Compare May 19, 2021 01:01

alamb added parquet Changes to the parquet crate enhancement Any new improvement worthy of a entry in the changelog labels May 21, 2021

alamb reviewed May 21, 2021

View reviewed changes

sunchao approved these changes May 24, 2021

View reviewed changes

sunchao merged commit b2de544 into apache:master May 24, 2021

kornholi deleted the pq-bitreader-allocs branch May 24, 2021 03:05

alamb mentioned this pull request May 24, 2021

Implement biweekly releases for arrow-rs, parquet-rs #292

Closed

8 tasks

alamb added the cherry-picked label Jun 4, 2021

alamb mentioned this pull request Jun 4, 2021

Cherry pick parquet: Speed up BitReader/DeltaBitPackDecoder to active_release #408

Merged

alamb mentioned this pull request Jun 10, 2021

Add changelog and bump version for proposed 4.3.0 release #444

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parquet: Speed up `BitReader`/`DeltaBitPackDecoder` #325

parquet: Speed up `BitReader`/`DeltaBitPackDecoder` #325

kornholi commented May 19, 2021

Dandandan commented May 20, 2021 •

edited

Loading

yordan-pavlov commented May 20, 2021 •

edited

Loading

alamb commented May 21, 2021

alamb left a comment

alamb May 21, 2021

kornholi May 24, 2021

sunchao left a comment

parquet: Speed up BitReader/DeltaBitPackDecoder #325

parquet: Speed up BitReader/DeltaBitPackDecoder #325

Conversation

kornholi commented May 19, 2021

Dandandan commented May 20, 2021 • edited Loading

yordan-pavlov commented May 20, 2021 • edited Loading

alamb commented May 21, 2021

alamb left a comment

Choose a reason for hiding this comment

alamb May 21, 2021

Choose a reason for hiding this comment

kornholi May 24, 2021

Choose a reason for hiding this comment

sunchao left a comment

Choose a reason for hiding this comment

parquet: Speed up `BitReader`/`DeltaBitPackDecoder` #325

parquet: Speed up `BitReader`/`DeltaBitPackDecoder` #325

Dandandan commented May 20, 2021 •

edited

Loading

yordan-pavlov commented May 20, 2021 •

edited

Loading