Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify null mask preservation in parquet reader #2116

Merged
merged 6 commits into from
Jul 22, 2022

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Jul 20, 2022

Which issue does this PR close?

Part of #2107

Rationale for this change

The original logic added in #1054 is very confusing as it determines whether to use a packed decoder, based on the type of DefinitionLevelBuffer passed to DefinitionLevelBufferDecoder. Not only is this confusing, but it creates a problem when skipping the first records in a column chunk, as the type of decoder is not known until data has been read 😱

This largely dated from a time when GenericRecordReader was generic over the levels in addition to values. In the end I removed this prior to merge as it was unnecessary complexity.

What changes are included in this PR?

Explicitly construct the decoders in GenericRecordReader, and passes them to GenericColumnReader::new_with_decoders. This allows adding an additional constructor parameter to DefinitionLevelBufferDecoder to instruct it whether to decode packed or not.

Are there any user-facing changes?

No, all these traits are crate private

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 20, 2022
@tustvold
Copy link
Contributor Author

I intend to run benchmarks for this shortly to confirm no regression

@codecov-commenter
Copy link

codecov-commenter commented Jul 20, 2022

Codecov Report

Merging #2116 (042aa9b) into master (efd3152) will increase coverage by 0.04%.
The diff coverage is 97.20%.

❗ Current head 042aa9b differs from pull request most recent head 4b40729. Consider uploading reports for the commit 4b40729 to get more accurate results

@@            Coverage Diff             @@
##           master    #2116      +/-   ##
==========================================
+ Coverage   83.73%   83.77%   +0.04%     
==========================================
  Files         225      225              
  Lines       59412    59474      +62     
==========================================
+ Hits        49748    49826      +78     
+ Misses       9664     9648      -16     
Impacted Files Coverage Δ
arrow/src/array/mod.rs 100.00% <ø> (ø)
arrow/src/buffer/mutable.rs 89.13% <ø> (ø)
arrow/src/compute/kernels/temporal.rs 94.10% <ø> (ø)
arrow/src/csv/reader.rs 89.98% <0.00%> (+0.09%) ⬆️
integration-testing/src/lib.rs 0.00% <0.00%> (ø)
parquet/src/column/reader/decoder.rs 61.33% <60.00%> (-2.05%) ⬇️
arrow/src/json/reader.rs 84.52% <64.28%> (-0.07%) ⬇️
arrow/src/compute/kernels/cast.rs 95.79% <85.71%> (ø)
...rquet/src/arrow/record_reader/definition_levels.rs 86.44% <87.87%> (+1.25%) ⬆️
parquet/src/arrow/array_reader/byte_array.rs 85.79% <90.90%> (-0.57%) ⬇️
... and 44 more

@tustvold tustvold force-pushed the simplify-null-mask-preservation branch from d4a689a to 4b40729 Compare July 20, 2022 22:36
@tustvold
Copy link
Contributor Author

For some reason this represents a performance regression... More investigation needed 🤔

@tustvold tustvold marked this pull request as draft July 20, 2022 22:38
@Ted-Jiang
Copy link
Member

Ted-Jiang commented Jul 21, 2022

when skipping the first records in a column chunk, as the type of decoder is not known until data has been read.

I haven't realize this 😂

I have one question: as

determines whether to use a packed decoder, based on the type of DefinitionLevelBuffer passed to DefinitionLevelBufferDecoder

but DefinitionLevelBuffer type is based on null_mask_only in previous. and null_mask_only is build at

let null_mask_only = field.def_level == 1 && field.nullable;

which in build_primitive_reader which not started read data.
So i think the type of packed decoder is known before read data.

@Ted-Jiang
Copy link
Member

i run IT in my modified version, it fail at skip first !😭
But works well on read first.

@tustvold
Copy link
Contributor Author

So i think the type of packed decoder is known before read data.

Correct, this is just a plumbing exercise to get that knowledge into the decoder at construction time

@tustvold tustvold marked this pull request as ready for review July 21, 2022 14:06
@tustvold
Copy link
Contributor Author

Perplexingly if you run the arrow_reader benchmark from the crate root, this does not represent a performance regression, but if you run it from within the parquet crate, it does... I'm not really sure what to make of this

@@ -195,7 +195,6 @@ where
///
/// `values` will be contiguously populated with the non-null values. Note that if the column
/// is not required, this may be less than either `batch_size` or the number of levels read
#[inline]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would appear that this can result in sub-optimal inlining behaviour, in particular when compiling the parquet crate there is a noticeable performance degredation. Unfortunately the inlined code is so mangled that I've been unable to determine exactly what is going on, but I may revisit this at a later date

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as in "when you leave inline the benchmarks get slower"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you leave inline certain benchmarks get slower when compiled from the parquet crate, although there is no difference when compiled from the workspace level... It is incredibly strange...

@alamb alamb changed the title Simplify null mask preservation Simplify null mask preservation in parquet reader Jul 21, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a nice improvement to me. However, I am not an expert in this code so perhaps @sunchao or @viirya or @nevi-me would like to take a look

self.def_levels_buffer
.as_ref()
.map(|buf| buf.typed_data())
self.def_levels_buffer.as_ref().map(|buf| buf.typed_data())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not entirely clear to me why the formatting changed on these lines -- not that it is a bad change, but it seems like it wasn't a semantic change either 🤷

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know either...

@@ -195,7 +195,6 @@ where
///
/// `values` will be contiguously populated with the non-null values. Note that if the column
/// is not required, this may be less than either `batch_size` or the number of levels read
#[inline]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as in "when you leave inline the benchmarks get slower"?

@@ -277,25 +288,25 @@ enum LevelDecoderInner {
impl ColumnLevelDecoder for ColumnLevelDecoderImpl {
type Slice = [i16];

fn new(max_level: i16, encoding: Encoding, data: ByteBufferPtr) -> Self {
let bit_width = num_required_bits(max_level as u64);
fn set_data(&mut self, encoding: Encoding, data: ByteBufferPtr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not an expert in this area, but the new code structure seems to make sense to me

let def_levels = (desc.max_def_level() > 0)
.then(|| DefinitionLevelBuffer::new(&desc, null_mask_only));
.then(|| DefinitionLevelBuffer::new(&desc, packed_null_mask(&desc)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this is the key change in this PR? that the decision to use a null mask is pushed down to this level?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct 👍

@alamb
Copy link
Contributor

alamb commented Jul 21, 2022

CI Failure should be fixed by #2121

@tustvold tustvold merged commit a9fa1b4 into apache:master Jul 22, 2022
@ursabot
Copy link

ursabot commented Jul 22, 2022

Benchmark runs are scheduled for baseline = 5e3facf and contender = a9fa1b4. a9fa1b4 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants