Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix NullArrayReader (#1245) #1246

Merged
merged 1 commit into from
Jan 29, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions parquet/src/arrow/array_reader.rs
Original file line number Diff line number Diff line change
Expand Up @@ -214,6 +214,10 @@ where
// save definition and repetition buffers
self.def_levels_buffer = self.record_reader.consume_def_levels()?;
self.rep_levels_buffer = self.record_reader.consume_rep_levels()?;

// Must consume bitmap buffer
self.record_reader.consume_bitmap_buffer()?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this is even getting computed at all is definitely a valid question, but at the same time a workload where the speed of reading arrays of nulls is the bottleneck feels decidedly... niche 😆

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One place it comes up is where you have multiple Parquet files representing a "table", with one or more columns being relatively sparse. If you have a file dropped every hour, then it may be that some of the files have a null column while others have a few values.

It doesn't seem likely it would be the bottleneck (the other columns in the file probably would be), but that's at least how it's come up for us.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I didn't mean to suggest NullArrayReader itself is niche, but rather there is probably a limited benefit to optimising it to not compute the bitmask unnecessarily 😄


self.record_reader.reset();
Ok(Arc::new(array))
}
Expand Down
42 changes: 42 additions & 0 deletions parquet/src/arrow/arrow_reader.rs
Original file line number Diff line number Diff line change
Expand Up @@ -327,6 +327,48 @@ mod tests {
compare_batch_json(&mut record_batch_reader, projected_json_values, max_len);
}

#[test]
fn test_null_column_reader_test() {
let mut file = tempfile::tempfile().unwrap();

let schema = "
message message {
OPTIONAL INT32 int32;
}
";
let schema = Arc::new(parse_message_type(schema).unwrap());

let def_levels = vec![vec![0, 0, 0], vec![0, 0, 0, 0]];
generate_single_column_file_with_data::<Int32Type>(
&[vec![], vec![]],
Some(&def_levels),
file.try_clone().unwrap(), // Cannot use &mut File (#1163)
schema,
Some(Field::new("int32", ArrowDataType::Null, true)),
&Default::default(),
)
.unwrap();

file.rewind().unwrap();

let parquet_reader = SerializedFileReader::try_from(file).unwrap();
let mut arrow_reader = ParquetFileArrowReader::new(Arc::new(parquet_reader));
let record_reader = arrow_reader.get_record_reader(2).unwrap();

let batches = record_reader.collect::<ArrowResult<Vec<_>>>().unwrap();

assert_eq!(batches.len(), 4);
for batch in &batches[0..3] {
assert_eq!(batch.num_rows(), 2);
assert_eq!(batch.num_columns(), 1);
assert_eq!(batch.column(0).null_count(), 2);
}

assert_eq!(batches[3].num_rows(), 1);
assert_eq!(batches[3].num_columns(), 1);
assert_eq!(batches[3].column(0).null_count(), 1);
}

#[test]
fn test_primitive_single_column_reader_test() {
run_single_column_reader_tests::<BoolType, BooleanArray, _, BoolType>(
Expand Down