Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve parquet performance: Skip levels computation for required struct arrays in parquet #1035

Merged
merged 3 commits into from
Jan 11, 2022

Conversation

tustvold
Copy link
Contributor

Which issue does this PR close?

Closes #1034.

Rationale for this change

See ticket

What changes are included in this PR?

Changes StructArrayReader to not compute definition, repetition and validity buffers for required struct arrays.

Are there any user-facing changes?

Technically this alters the precise semantics of ArrayReader which is currently a public trait. I think this is potentially unintentional (#1032), but even then the documented purpose of these methods is for parent ArrayReader to handle nulls and repeated arrays. However, such a parent cannot exist in the altered edge-case, as otherwise the definition/repetition levels of the struct array would be non-zero.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Dec 12, 2021
@tustvold tustvold force-pushed the struct-reader-levels-computation branch from 4533e29 to 944c722 Compare December 12, 2021 11:26
@codecov-commenter
Copy link

codecov-commenter commented Dec 12, 2021

Codecov Report

Merging #1035 (dce4801) into master (3dca969) will decrease coverage by 0.02%.
The diff coverage is 84.84%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1035      +/-   ##
==========================================
- Coverage   82.31%   82.29%   -0.03%     
==========================================
  Files         168      168              
  Lines       49420    49423       +3     
==========================================
- Hits        40681    40673       -8     
- Misses       8739     8750      +11     
Impacted Files Coverage Δ
parquet/src/arrow/array_reader.rs 75.55% <84.84%> (-1.11%) ⬇️
arrow/src/array/transform/mod.rs 84.73% <0.00%> (-0.14%) ⬇️
parquet_derive/src/parquet_field.rs 66.43% <0.00%> (+0.68%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3dca969...dce4801. Read the comment docs.

@alamb alamb changed the title Skip levels computation for required struct arrays (#1034) Improve parquet performance: Skip levels computation for required struct arrays in parquet Dec 20, 2021
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finally had a chance to review this PR, and now that I see what it is doing it looks correct to me. 👍 thank you @tustvold

cc @nevi-me our resident array reader / struct array level expert and @liurenjie1024 as the original author

let mut def_level_data_buffer = MutableBuffer::new(buffer_size);
def_level_data_buffer.resize(buffer_size, 0);
// Now we can build array data
let mut array_data = ArrayDataBuilder::new(self.data_type.clone())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe calling this variable array_data_builder would be clearer as it not ArrayData?

@alamb
Copy link
Contributor

alamb commented Dec 20, 2021

@tustvold note there are some conflicts that need to be resolved in this PR

@alamb alamb merged commit 76fde56 into apache:master Jan 11, 2022
@alamb
Copy link
Contributor

alamb commented Jan 11, 2022

Since no one else has any comments, merging this in

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

parquet Performance Optimization: StructArrayReader Redundant Level & Bitmap Computation
3 participants