Simplify null mask preservation in parquet reader #2116

tustvold · 2022-07-20T21:29:57Z

Which issue does this PR close?

Part of #2107

Rationale for this change

The original logic added in #1054 is very confusing as it determines whether to use a packed decoder, based on the type of DefinitionLevelBuffer passed to DefinitionLevelBufferDecoder. Not only is this confusing, but it creates a problem when skipping the first records in a column chunk, as the type of decoder is not known until data has been read 😱

This largely dated from a time when GenericRecordReader was generic over the levels in addition to values. In the end I removed this prior to merge as it was unnecessary complexity.

What changes are included in this PR?

Explicitly construct the decoders in GenericRecordReader, and passes them to GenericColumnReader::new_with_decoders. This allows adding an additional constructor parameter to DefinitionLevelBufferDecoder to instruct it whether to decode packed or not.

Are there any user-facing changes?

No, all these traits are crate private

tustvold · 2022-07-20T21:31:16Z

I intend to run benchmarks for this shortly to confirm no regression

codecov-commenter · 2022-07-20T21:58:58Z

Codecov Report

Merging #2116 (042aa9b) into master (efd3152) will increase coverage by 0.04%.
The diff coverage is 97.20%.

❗ Current head 042aa9b differs from pull request most recent head 4b40729. Consider uploading reports for the commit 4b40729 to get more accurate results

@@            Coverage Diff             @@
##           master    #2116      +/-   ##
==========================================
+ Coverage   83.73%   83.77%   +0.04%     
==========================================
  Files         225      225              
  Lines       59412    59474      +62     
==========================================
+ Hits        49748    49826      +78     
+ Misses       9664     9648      -16

Impacted Files	Coverage Δ
arrow/src/array/mod.rs	`100.00% <ø> (ø)`
arrow/src/buffer/mutable.rs	`89.13% <ø> (ø)`
arrow/src/compute/kernels/temporal.rs	`94.10% <ø> (ø)`
arrow/src/csv/reader.rs	`89.98% <0.00%> (+0.09%)`	⬆️
integration-testing/src/lib.rs	`0.00% <0.00%> (ø)`
parquet/src/column/reader/decoder.rs	`61.33% <60.00%> (-2.05%)`	⬇️
arrow/src/json/reader.rs	`84.52% <64.28%> (-0.07%)`	⬇️
arrow/src/compute/kernels/cast.rs	`95.79% <85.71%> (ø)`
...rquet/src/arrow/record_reader/definition_levels.rs	`86.44% <87.87%> (+1.25%)`	⬆️
parquet/src/arrow/array_reader/byte_array.rs	`85.79% <90.90%> (-0.57%)`	⬇️
... and 44 more

tustvold · 2022-07-20T22:38:09Z

For some reason this represents a performance regression... More investigation needed 🤔

Ted-Jiang · 2022-07-21T01:42:34Z

when skipping the first records in a column chunk, as the type of decoder is not known until data has been read.

I haven't realize this 😂

I have one question: as

determines whether to use a packed decoder, based on the type of DefinitionLevelBuffer passed to DefinitionLevelBufferDecoder

but DefinitionLevelBuffer type is based on null_mask_only in previous. and null_mask_only is build at

arrow-rs/parquet/src/arrow/array_reader/builder.rs

Line 163 in 029203e

let null_mask_only = field.def_level == 1 && field.nullable;

which in build_primitive_reader which not started read data.
So i think the type of packed decoder is known before read data.

Ted-Jiang · 2022-07-21T10:56:45Z

i run IT in my modified version, it fail at skip first !😭
But works well on read first.

tustvold · 2022-07-21T13:22:03Z

So i think the type of packed decoder is known before read data.

Correct, this is just a plumbing exercise to get that knowledge into the decoder at construction time

tustvold · 2022-07-21T14:20:49Z

Perplexingly if you run the arrow_reader benchmark from the crate root, this does not represent a performance regression, but if you run it from within the parquet crate, it does... I'm not really sure what to make of this

tustvold · 2022-07-21T19:33:23Z

parquet/src/column/reader.rs

@@ -195,7 +195,6 @@ where
    ///
    /// `values` will be contiguously populated with the non-null values. Note that if the column
    /// is not required, this may be less than either `batch_size` or the number of levels read
-    #[inline]


It would appear that this can result in sub-optimal inlining behaviour, in particular when compiling the parquet crate there is a noticeable performance degredation. Unfortunately the inlined code is so mangled that I've been unable to determine exactly what is going on, but I may revisit this at a later date

as in "when you leave inline the benchmarks get slower"?

When you leave inline certain benchmarks get slower when compiled from the parquet crate, although there is no difference when compiled from the workspace level... It is incredibly strange...

alamb

Looks like a nice improvement to me. However, I am not an expert in this code so perhaps @sunchao or @viirya or @nevi-me would like to take a look

alamb · 2022-07-21T19:58:47Z

parquet/src/arrow/array_reader/byte_array.rs

-        self.def_levels_buffer
-            .as_ref()
-            .map(|buf| buf.typed_data())
+        self.def_levels_buffer.as_ref().map(|buf| buf.typed_data())


it is not entirely clear to me why the formatting changed on these lines -- not that it is a bad change, but it seems like it wasn't a semantic change either 🤷

I don't know either...

alamb · 2022-07-21T19:59:56Z

parquet/src/column/reader.rs

@@ -195,7 +195,6 @@ where
    ///
    /// `values` will be contiguously populated with the non-null values. Note that if the column
    /// is not required, this may be less than either `batch_size` or the number of levels read
-    #[inline]


as in "when you leave inline the benchmarks get slower"?

alamb · 2022-07-21T20:01:34Z

parquet/src/column/reader/decoder.rs

@@ -277,25 +288,25 @@ enum LevelDecoderInner {
 impl ColumnLevelDecoder for ColumnLevelDecoderImpl {
    type Slice = [i16];

-    fn new(max_level: i16, encoding: Encoding, data: ByteBufferPtr) -> Self {
-        let bit_width = num_required_bits(max_level as u64);
+    fn set_data(&mut self, encoding: Encoding, data: ByteBufferPtr) {


I am not an expert in this area, but the new code structure seems to make sense to me

alamb · 2022-07-21T20:06:19Z

parquet/src/arrow/record_reader/mod.rs

        let def_levels = (desc.max_def_level() > 0)
-            .then(|| DefinitionLevelBuffer::new(&desc, null_mask_only));
+            .then(|| DefinitionLevelBuffer::new(&desc, packed_null_mask(&desc)));


Is this is the key change in this PR? that the decision to use a null mask is pushed down to this level?

Correct 👍

alamb · 2022-07-21T20:17:58Z

CI Failure should be fixed by #2121

…k-preservation

ursabot · 2022-07-22T15:11:53Z

Benchmark runs are scheduled for baseline = 5e3facf and contender = a9fa1b4. a9fa1b4 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Simplify null mask preservation

c1357d1

github-actions bot added the parquet Changes to the parquet crate label Jul 20, 2022

tustvold mentioned this pull request Jul 20, 2022

Support skip_def_levels for ColumnLevelDecoder #2111

Merged

Fix benchmarks

fd039ec

Remove PackedDecoder Option

4b40729

tustvold force-pushed the simplify-null-mask-preservation branch from d4a689a to 4b40729 Compare July 20, 2022 22:36

tustvold marked this pull request as draft July 20, 2022 22:38

Use match expression

467c876

tustvold marked this pull request as ready for review July 21, 2022 14:06

Remove inline from GenericColumnReader::read_batch

a7b7c36

tustvold commented Jul 21, 2022

View reviewed changes

alamb changed the title ~~Simplify null mask preservation~~ Simplify null mask preservation in parquet reader Jul 21, 2022

alamb approved these changes Jul 21, 2022

View reviewed changes

Merge remote-tracking branch 'upstream/master' into simplify-null-mas…

98c9a9a

…k-preservation

tustvold merged commit a9fa1b4 into apache:master Jul 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify null mask preservation in parquet reader #2116

Simplify null mask preservation in parquet reader #2116

tustvold commented Jul 20, 2022 •

edited

Loading

tustvold commented Jul 20, 2022

codecov-commenter commented Jul 20, 2022 •

edited

Loading

tustvold commented Jul 20, 2022

Ted-Jiang commented Jul 21, 2022 •

edited

Loading

Ted-Jiang commented Jul 21, 2022

tustvold commented Jul 21, 2022

tustvold commented Jul 21, 2022

tustvold Jul 21, 2022

alamb Jul 21, 2022

tustvold Jul 21, 2022

alamb left a comment

alamb Jul 21, 2022

tustvold Jul 21, 2022

alamb Jul 21, 2022

alamb Jul 21, 2022

alamb Jul 21, 2022

tustvold Jul 21, 2022

alamb commented Jul 21, 2022

ursabot commented Jul 22, 2022

Simplify null mask preservation in parquet reader #2116

Simplify null mask preservation in parquet reader #2116

Conversation

tustvold commented Jul 20, 2022 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold commented Jul 20, 2022

codecov-commenter commented Jul 20, 2022 • edited Loading

Codecov Report

tustvold commented Jul 20, 2022

Ted-Jiang commented Jul 21, 2022 • edited Loading

Ted-Jiang commented Jul 21, 2022

tustvold commented Jul 21, 2022

tustvold commented Jul 21, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Jul 21, 2022

ursabot commented Jul 22, 2022

tustvold commented Jul 20, 2022 •

edited

Loading

codecov-commenter commented Jul 20, 2022 •

edited

Loading

Ted-Jiang commented Jul 21, 2022 •

edited

Loading