Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify ColumnReader::read_batch #1995

Merged
merged 1 commit into from
Jul 5, 2022
Merged

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Jul 3, 2022

Which issue does this PR close?

Miscellaneous cleanup whilst working on #1792

Closes #1996
Closes #1997

Rationale for this change

The existing code had redundant result returns, duplicated logic, and confusing semantics.

What changes are included in this PR?

Changes the semantics of ColumnReader::read_batch to error if called on data without the necessary levels buffers, and to return the number of levels read even for columns that have max_def_level == 0 && max_rep_level == 0. This is technically a breaking change, although I think it makes the API a lot less confusing.

Cleanups some additional code

Are there any user-facing changes?

Yes

@tustvold tustvold added the api-change Changes to the arrow API label Jul 3, 2022
@github-actions github-actions bot added arrow Changes to the arrow crate parquet Changes to the parquet crate labels Jul 3, 2022

use super::RecordReader;

struct TestPageReader {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just duplicated InMemoryPageReader

@@ -163,26 +163,18 @@ where
}
}

/// Reads a batch of values of at most `batch_size`.
/// Reads a batch of values of at most `batch_size`, returning a tuple containing the
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the breaking change, we could make this non-breaking by always returning 0 if max_def_level == 0 && max_rep_level == 0, but I personally think this is a little bit confusing. This make the behaviour consistent, instead of potentially varying based on the column definition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I don't fully understand the subtle difference between corresponding number of levels, i.e, the total number of values including nulls, empty lists, etc.. and the actual number of levels read. but it seems like a good change to me and I would defer to you for what the most appropriate semantics should be

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference arises because if a column is repeated / nullable, the corresponding level data is omitted. The result is that whilst there is still the same number of levels, it technically won't read any level data

}
}

struct TestPageReader {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another duplicate of InMemoryPageReader

@@ -1300,44 +1272,10 @@ mod tests {
);
}

if def_levels.is_none() && rep_levels.is_none() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we see the breaking change in behaviour

"Must call `add_rep_levels() first!`"
);

self.num_values = def_levels.len() as u32;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic was ill-formed, as it prevented having def levels without rep levels. Removing this sanity check is harmless, especially as this is just used for testing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.read(levels, levels_read..levels_read + iter_batch_size)?;

if num_def_levels != iter_batch_size {
return Err(general_err!("insufficient definition levels read from column - expected {}, got {}", iter_batch_size, num_def_levels));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This, and similar checks below, will prevent #1997

Comment on lines -261 to -262
if num_def_levels != 0
&& num_rep_levels != 0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the source of #1997 - the value of 0 is used as a sentinel for no levels present, which prevents detecting the case of no actual values left

@codecov-commenter
Copy link

codecov-commenter commented Jul 3, 2022

Codecov Report

Merging #1995 (d7a306f) into master (7ae97c9) will increase coverage by 0.00%.
The diff coverage is 92.30%.

@@           Coverage Diff           @@
##           master    #1995   +/-   ##
=======================================
  Coverage   83.58%   83.58%           
=======================================
  Files         222      222           
  Lines       57495    57467   -28     
=======================================
- Hits        48056    48035   -21     
+ Misses       9439     9432    -7     
Impacted Files Coverage Δ
parquet/src/arrow/arrow_reader.rs 96.87% <ø> (ø)
parquet/src/column/reader.rs 68.31% <81.81%> (-0.30%) ⬇️
parquet/src/arrow/array_reader/byte_array.rs 85.71% <100.00%> (ø)
...et/src/arrow/array_reader/byte_array_dictionary.rs 83.91% <100.00%> (ø)
parquet/src/arrow/array_reader/null_array.rs 88.00% <100.00%> (ø)
parquet/src/arrow/array_reader/primitive_array.rs 89.65% <100.00%> (ø)
parquet/src/arrow/record_reader/mod.rs 94.73% <100.00%> (+1.16%) ⬆️
parquet/src/util/test_common/page_util.rs 88.54% <100.00%> (-0.12%) ⬇️
parquet_derive/src/parquet_field.rs 65.75% <0.00%> (-0.23%) ⬇️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7ae97c9...d7a306f. Read the comment docs.

Miscellaneous parquet cleanups
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not an expert in this area of the code, but I read this PR carefully and it makes sense to me and it was easier to read. Thank you @tustvold

cc @sunchao @nevi-me

@@ -163,26 +163,18 @@ where
}
}

/// Reads a batch of values of at most `batch_size`.
/// Reads a batch of values of at most `batch_size`, returning a tuple containing the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I don't fully understand the subtle difference between corresponding number of levels, i.e, the total number of values including nulls, empty lists, etc.. and the actual number of levels read. but it seems like a good change to me and I would defer to you for what the most appropriate semantics should be

@@ -1036,7 +1005,7 @@ mod tests {
} else {
0
};
let max_rep_level = if def_levels.is_some() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

"Must call `add_rep_levels() first!`"
);

self.num_values = def_levels.len() as u32;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -218,32 +218,32 @@ where
/// The implementation has side effects. It will create a new buffer to hold those
/// definition level values that have already been read into memory but not counted
/// as record values, e.g. those from `self.num_values` to `self.values_written`.
pub fn consume_def_levels(&mut self) -> Result<Option<Buffer>> {
Ok(match self.def_levels.as_mut() {
pub fn consume_def_levels(&mut self) -> Option<Buffer> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in this file are cleanups to make the signature infallible when it always returns Ok, right? (and it is also an API change, and perhaps also fixes clippy)

@tustvold tustvold merged commit c757829 into apache:master Jul 5, 2022
@alamb alamb changed the title Simplify ColumnReader::read_batch Simplify ColumnReader::read_batch Jul 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Infinite Loop possible in ColumnReader::read_batch For Corrupted Files Fix New Clippy Lints
3 participants