-
Notifications
You must be signed in to change notification settings - Fork 738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not write ColumnIndex
for null columns when not writing page statistics
#6011
Conversation
Thanks @etseidl - I will check this one out shortly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this contribution @etseidl. I reviewed the code and logic carefully
I took the liberty of merging this PR up from main so the CI would pass.
@@ -260,6 +260,12 @@ impl<'a, E: ColumnValueEncoder> GenericColumnWriter<'a, E> { | |||
// Used for level information | |||
encodings.insert(Encoding::RLE); | |||
|
|||
// Disable column_index_builder if not collecting page statistics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the reason this works currently for columns without null is that the ColumnIndex builder is marked as invalid if there are no page statistics at the end of the page. However, when the column has no data this code is not run
arrow-rs/parquet/src/column/writer/mod.rs
Line 661 in 9b34950
self.column_index_builder.to_invalid(); |
So TLDR is I think this change looks good to me
writer.write_batch(&data, Some(&def_levels), None).unwrap(); | ||
writer.flush_data_pages().unwrap(); | ||
|
||
let column_close_result = writer.close().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I verified that this test covers the code by removing the change and verifying the test fails like
assertion failed: column_close_result.column_index.is_none()
thread 'column::writer::tests::test_no_column_index_when_stats_disabled' panicked at parquet/src/column/writer/mod.rs:3044:9:
assertion failed: column_close_result.column_index.is_none()
stack backtrace:
0: rust_begin_unwind
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:652:5
1: core::panicking::panic_fmt
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/panicking.rs:72:14
2: core::panicking::panic
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/panicking.rs:146:5
3: parquet::column::writer::tests::test_no_column_index_when_stats_disabled
at ./src/column/writer/mod.rs:3044:9
4: parquet::column::writer::tests::test_no_column_index_when_stats_disabled::{{closure}}
at ./src/column/writer/mod.rs:3024:50
5: core::ops::function::FnOnce::call_once
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/ops/function.rs:250:5
6: core::ops::function::FnOnce::call_once
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
ColumnIndexBuilder
when page statistics are not collectedColumnIndex
for null columns when not writing page statistics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I double checked and I believe this PR contains no breaking API changes so merging it directly to main (for inclusion in 52.0.0
)
Thanks again @etseidl |
Which issue does this PR close?
Closes #6010.
Rationale for this change
The
ColumnIndex
for an all-nulls column will be written even when page statistics are disabled. This is not the case for columns that have non-null values.What changes are included in this PR?
This PR calls
to_invalid()
on the constructedColumnIndexBuilder
when page statistics are not requested (statistics_enabled != EnabledStatistics::Page
). This allows for not writing the null-column statistics inupdate_column_offset_index()
.arrow-rs/parquet/src/column/writer/mod.rs
Lines 643 to 650 in bed3746
Are there any user-facing changes?
No