-
Notifications
You must be signed in to change notification settings - Fork 811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Vec in ColumnReader (#5177) #5193
Conversation
@@ -1016,34 +991,22 @@ mod tests { | |||
|
|||
#[test] | |||
fn test_read_batch_values_only() { | |||
test_read_batch_int32(16, &mut [0; 10], None, None); // < batch_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We no longer have this complexity, as the buffers are sized as needed
let mut max_levels = values.capacity().min(max_records); | ||
if let Some(ref levels) = def_levels { | ||
max_levels = max_levels.min(levels.capacity()); | ||
} | ||
if let Some(ref levels) = rep_levels { | ||
max_levels = max_levels.min(levels.capacity()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the fix for #5150, we no longer truncate the output based on the capacity of the output buffers
fn truncate_buffer(&mut self, len: usize) { | ||
assert_eq!(self.buffer.len(), len * self.byte_length); | ||
} | ||
byte_length: Option<usize>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is necessary so that we can use mem::take
@@ -320,7 +287,7 @@ mod tests { | |||
); | |||
|
|||
// Can recreate with new dictionary as keys empty | |||
assert!(matches!(&buffer, DictionaryBuffer::Dict { .. })); | |||
assert!(matches!(&buffer, DictionaryBuffer::Values { .. })); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the result of using std::mem::take
, and doesn't impact the higher level dictionary preservation behaviour
@@ -111,7 +50,7 @@ impl<T: Copy + Default> ValuesBuffer for Vec<T> { | |||
levels_read: usize, | |||
valid_mask: &[u8], | |||
) { | |||
assert!(self.len() >= read_offset + levels_read); | |||
self.resize(read_offset + levels_read, T::default()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is necessary because we no longer have the get_output_slice / set_len behaviour which over-allocates at the start
max_records: usize, | ||
out: &mut Self::Buffer, | ||
num_records: usize, | ||
num_levels: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still have to provide num_levels
as parquet's HYBRID_RLE encoding doesn't know how many values it contains 🤦
@@ -17,69 +17,8 @@ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This module is crate-private, and so none of these changes are user-visible
@@ -423,7 +405,7 @@ impl RepetitionLevelDecoderImpl { | |||
} | |||
|
|||
impl ColumnLevelDecoder for RepetitionLevelDecoderImpl { | |||
type Slice = [i16]; | |||
type Buffer = Vec<i16>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These traits are not user-visible, however, this change propagates out of ColumnReader and therefore is
I intend to run the benchmarks prior to merge, as there is some non-trivial changes to capacity allocation that may have performance impacts. |
@@ -1750,7 +1750,7 @@ mod tests { | |||
assert_eq!(num_levels, 513); | |||
|
|||
let expected: Vec<i64> = (1..514).collect(); | |||
assert_eq!(&buffer[..513], &expected); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a major usability enhancement, you no longer need to be careful to keep track of how much of the values buffers actually contain data
usize::MAX | ||
} | ||
|
||
fn count_nulls(&self, range: Range<usize>, _max_level: i16) -> usize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is rolled into read_def_levels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this @tustvold -- I started reviewing it and I will try and find the time to complete the review in the next day or two.
Can you please run some performance benchmarks to make sure there is no inadvertent regression introduced?
I also suggest we give it a day or two to let other people comment if they would like a chance to review it
clicked wrong button, I haven't completed review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice cleanup, as long as the benchmarks are good this LGTM 🥳
@@ -227,13 +226,13 @@ impl<I: OffsetSizeTrait> ColumnValueDecoder for ByteArrayColumnValueDecoder<I> { | |||
Ok(()) | |||
} | |||
|
|||
fn read(&mut self, out: &mut Self::Slice, range: Range<usize>) -> Result<usize> { | |||
fn read(&mut self, out: &mut Self::Buffer, num_values: usize) -> Result<usize> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So previously seems you can pick up where (i.e., range.start
) to start the read, now you must use skip_values
to skip values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, actually I also don't see how range.start
is used to skip value. Maybe it is just no skip?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its a weird quirk of how this API allowed using slices that didn't track their position, this is no longer necessary
4af59a5
to
8f23a32
Compare
This currently represents a non-trivial regression for primitive columns with large numbers of nulls, I am investigating this now Edit: I cannot reproduce this regression on my local machine, which is somewhat complicating fixing this |
7399917
to
8f23a32
Compare
Well having bashed my head against these phantom regressions for an embarrassingly long amount of time, it transpires the benchmarks are just exceptionally noisy 🤦 As written this PR does not represent a consistent performance regression as far as I can ascertain |
Which issue does this PR close?
Part of #5177
Closes #5150
Rationale for this change
See ticket, this pushes Vec into ColumnReader avoiding a whole host of issues related to buffer capacity
What changes are included in this PR?
Are there any user-facing changes?
Most of the changes are to crate-private interfaces, but this is a breaking change to ColumnReader which is public