Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Vec in ColumnReader (#5177) #5193

Merged
merged 2 commits into from
Dec 15, 2023
Merged

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Dec 8, 2023

Which issue does this PR close?

Part of #5177
Closes #5150

Rationale for this change

See ticket, this pushes Vec into ColumnReader avoiding a whole host of issues related to buffer capacity

What changes are included in this PR?

Are there any user-facing changes?

Most of the changes are to crate-private interfaces, but this is a breaking change to ColumnReader which is public

@github-actions github-actions bot added the parquet Changes to the parquet crate label Dec 8, 2023
@@ -1016,34 +991,22 @@ mod tests {

#[test]
fn test_read_batch_values_only() {
test_read_batch_int32(16, &mut [0; 10], None, None); // < batch_size
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer have this complexity, as the buffers are sized as needed

Comment on lines -226 to -232
let mut max_levels = values.capacity().min(max_records);
if let Some(ref levels) = def_levels {
max_levels = max_levels.min(levels.capacity());
}
if let Some(ref levels) = rep_levels {
max_levels = max_levels.min(levels.capacity())
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the fix for #5150, we no longer truncate the output based on the capacity of the output buffers

fn truncate_buffer(&mut self, len: usize) {
assert_eq!(self.buffer.len(), len * self.byte_length);
}
byte_length: Option<usize>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is necessary so that we can use mem::take

@@ -320,7 +287,7 @@ mod tests {
);

// Can recreate with new dictionary as keys empty
assert!(matches!(&buffer, DictionaryBuffer::Dict { .. }));
assert!(matches!(&buffer, DictionaryBuffer::Values { .. }));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the result of using std::mem::take, and doesn't impact the higher level dictionary preservation behaviour

@@ -111,7 +50,7 @@ impl<T: Copy + Default> ValuesBuffer for Vec<T> {
levels_read: usize,
valid_mask: &[u8],
) {
assert!(self.len() >= read_offset + levels_read);
self.resize(read_offset + levels_read, T::default());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is necessary because we no longer have the get_output_slice / set_len behaviour which over-allocates at the start

@tustvold tustvold added the api-change Changes to the arrow API label Dec 8, 2023
max_records: usize,
out: &mut Self::Buffer,
num_records: usize,
num_levels: usize,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still have to provide num_levels as parquet's HYBRID_RLE encoding doesn't know how many values it contains 🤦

@@ -17,69 +17,8 @@

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module is crate-private, and so none of these changes are user-visible

@@ -423,7 +405,7 @@ impl RepetitionLevelDecoderImpl {
}

impl ColumnLevelDecoder for RepetitionLevelDecoderImpl {
type Slice = [i16];
type Buffer = Vec<i16>;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These traits are not user-visible, however, this change propagates out of ColumnReader and therefore is

@tustvold
Copy link
Contributor Author

tustvold commented Dec 8, 2023

I intend to run the benchmarks prior to merge, as there is some non-trivial changes to capacity allocation that may have performance impacts.

@@ -1750,7 +1750,7 @@ mod tests {
assert_eq!(num_levels, 513);

let expected: Vec<i64> = (1..514).collect();
assert_eq!(&buffer[..513], &expected);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a major usability enhancement, you no longer need to be careful to keep track of how much of the values buffers actually contain data

usize::MAX
}

fn count_nulls(&self, range: Range<usize>, _max_level: i16) -> usize {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is rolled into read_def_levels

alamb
alamb previously approved these changes Dec 8, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this @tustvold -- I started reviewing it and I will try and find the time to complete the review in the next day or two.

Can you please run some performance benchmarks to make sure there is no inadvertent regression introduced?

I also suggest we give it a day or two to let other people comment if they would like a chance to review it

cc @sunchao @zeevm @Dandandan

@alamb alamb dismissed their stale review December 8, 2023 18:50

clicked wrong button, I haven't completed review

Copy link
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice cleanup, as long as the benchmarks are good this LGTM 🥳

@@ -227,13 +226,13 @@ impl<I: OffsetSizeTrait> ColumnValueDecoder for ByteArrayColumnValueDecoder<I> {
Ok(())
}

fn read(&mut self, out: &mut Self::Slice, range: Range<usize>) -> Result<usize> {
fn read(&mut self, out: &mut Self::Buffer, num_values: usize) -> Result<usize> {
Copy link
Member

@viirya viirya Dec 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So previously seems you can pick up where (i.e., range.start) to start the read, now you must use skip_values to skip values?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, actually I also don't see how range.start is used to skip value. Maybe it is just no skip?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its a weird quirk of how this API allowed using slices that didn't track their position, this is no longer necessary

@tustvold
Copy link
Contributor Author

tustvold commented Dec 11, 2023

This currently represents a non-trivial regression for primitive columns with large numbers of nulls, I am investigating this now

Edit: I cannot reproduce this regression on my local machine, which is somewhat complicating fixing this

@tustvold
Copy link
Contributor Author

Well having bashed my head against these phantom regressions for an embarrassingly long amount of time, it transpires the benchmarks are just exceptionally noisy 🤦 As written this PR does not represent a consistent performance regression as far as I can ascertain

@tustvold tustvold merged commit 9a1e8b5 into apache:master Dec 15, 2023
48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GenericColumnReader::read_records Yields Truncated Records
4 participants