-
Notifications
You must be signed in to change notification settings - Fork 838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve parquet performance: Skip levels computation for required struct arrays in parquet #1035
Improve parquet performance: Skip levels computation for required struct arrays in parquet #1035
Conversation
4533e29
to
944c722
Compare
Codecov Report
@@ Coverage Diff @@
## master #1035 +/- ##
==========================================
- Coverage 82.31% 82.29% -0.03%
==========================================
Files 168 168
Lines 49420 49423 +3
==========================================
- Hits 40681 40673 -8
- Misses 8739 8750 +11
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I finally had a chance to review this PR, and now that I see what it is doing it looks correct to me. 👍 thank you @tustvold
cc @nevi-me our resident array reader / struct array level expert and @liurenjie1024 as the original author
parquet/src/arrow/array_reader.rs
Outdated
let mut def_level_data_buffer = MutableBuffer::new(buffer_size); | ||
def_level_data_buffer.resize(buffer_size, 0); | ||
// Now we can build array data | ||
let mut array_data = ArrayDataBuilder::new(self.data_type.clone()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe calling this variable array_data_builder
would be clearer as it not ArrayData
?
@tustvold note there are some conflicts that need to be resolved in this PR |
Since no one else has any comments, merging this in |
Which issue does this PR close?
Closes #1034.
Rationale for this change
See ticket
What changes are included in this PR?
Changes StructArrayReader to not compute definition, repetition and validity buffers for required struct arrays.
Are there any user-facing changes?
Technically this alters the precise semantics of
ArrayReader
which is currently a public trait. I think this is potentially unintentional (#1032), but even then the documented purpose of these methods is for parentArrayReader
to handle nulls and repeated arrays. However, such a parent cannot exist in the altered edge-case, as otherwise the definition/repetition levels of the struct array would be non-zero.