-
Notifications
You must be signed in to change notification settings - Fork 838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify ColumnReader::read_batch
#1995
Conversation
|
||
use super::RecordReader; | ||
|
||
struct TestPageReader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just duplicated InMemoryPageReader
@@ -163,26 +163,18 @@ where | |||
} | |||
} | |||
|
|||
/// Reads a batch of values of at most `batch_size`. | |||
/// Reads a batch of values of at most `batch_size`, returning a tuple containing the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the breaking change, we could make this non-breaking by always returning 0 if max_def_level == 0 && max_rep_level == 0, but I personally think this is a little bit confusing. This make the behaviour consistent, instead of potentially varying based on the column definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I don't fully understand the subtle difference between corresponding number of levels, i.e, the total number of values including nulls, empty lists, etc..
and the actual number of levels read.
but it seems like a good change to me and I would defer to you for what the most appropriate semantics should be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference arises because if a column is repeated / nullable, the corresponding level data is omitted. The result is that whilst there is still the same number of levels, it technically won't read any level data
} | ||
} | ||
|
||
struct TestPageReader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another duplicate of InMemoryPageReader
@@ -1300,44 +1272,10 @@ mod tests { | |||
); | |||
} | |||
|
|||
if def_levels.is_none() && rep_levels.is_none() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we see the breaking change in behaviour
"Must call `add_rep_levels() first!`" | ||
); | ||
|
||
self.num_values = def_levels.len() as u32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic was ill-formed, as it prevented having def levels without rep levels. Removing this sanity check is harmless, especially as this is just used for testing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double checked DataPageBuilder
is used for testing:
https://docs.rs/arrow/17.0.0/arrow/?search=DataPageBuilder
👍
.read(levels, levels_read..levels_read + iter_batch_size)?; | ||
|
||
if num_def_levels != iter_batch_size { | ||
return Err(general_err!("insufficient definition levels read from column - expected {}, got {}", iter_batch_size, num_def_levels)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This, and similar checks below, will prevent #1997
if num_def_levels != 0 | ||
&& num_rep_levels != 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the source of #1997 - the value of 0 is used as a sentinel for no levels present, which prevents detecting the case of no actual values left
Codecov Report
@@ Coverage Diff @@
## master #1995 +/- ##
=======================================
Coverage 83.58% 83.58%
=======================================
Files 222 222
Lines 57495 57467 -28
=======================================
- Hits 48056 48035 -21
+ Misses 9439 9432 -7
Continue to review full report at Codecov.
|
Miscellaneous parquet cleanups
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -163,26 +163,18 @@ where | |||
} | |||
} | |||
|
|||
/// Reads a batch of values of at most `batch_size`. | |||
/// Reads a batch of values of at most `batch_size`, returning a tuple containing the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I don't fully understand the subtle difference between corresponding number of levels, i.e, the total number of values including nulls, empty lists, etc..
and the actual number of levels read.
but it seems like a good change to me and I would defer to you for what the most appropriate semantics should be
@@ -1036,7 +1005,7 @@ mod tests { | |||
} else { | |||
0 | |||
}; | |||
let max_rep_level = if def_levels.is_some() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
"Must call `add_rep_levels() first!`" | ||
); | ||
|
||
self.num_values = def_levels.len() as u32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double checked DataPageBuilder
is used for testing:
https://docs.rs/arrow/17.0.0/arrow/?search=DataPageBuilder
👍
@@ -218,32 +218,32 @@ where | |||
/// The implementation has side effects. It will create a new buffer to hold those | |||
/// definition level values that have already been read into memory but not counted | |||
/// as record values, e.g. those from `self.num_values` to `self.values_written`. | |||
pub fn consume_def_levels(&mut self) -> Result<Option<Buffer>> { | |||
Ok(match self.def_levels.as_mut() { | |||
pub fn consume_def_levels(&mut self) -> Option<Buffer> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes in this file are cleanups to make the signature infallible when it always returns Ok
, right? (and it is also an API change, and perhaps also fixes clippy)
ColumnReader::read_batch
Which issue does this PR close?
Miscellaneous cleanup whilst working on #1792
Closes #1996
Closes #1997
Rationale for this change
The existing code had redundant result returns, duplicated logic, and confusing semantics.
What changes are included in this PR?
Changes the semantics of ColumnReader::read_batch to error if called on data without the necessary levels buffers, and to return the number of levels read even for columns that have max_def_level == 0 && max_rep_level == 0. This is technically a breaking change, although I think it makes the API a lot less confusing.
Cleanups some additional code
Are there any user-facing changes?
Yes