-
Notifications
You must be signed in to change notification settings - Fork 875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add explicit column mask for selection in parquet: ProjectionMask
(#1701)
#1716
Conversation
284724e
to
489d54a
Compare
// comes from. For instance: the leaf `a.b.c.d` would have a link back to `a`: | ||
// -- a <-----+ | ||
// -- -- b | | ||
// -- -- -- c | | ||
// -- -- -- -- d | ||
leaf_to_base: Vec<TypePtr>, | ||
leaf_to_base: Vec<usize>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change to store the root index, as opposed to a copy of the root ptr makes it easier to convert from a root mask to a leaf mask
/// | ||
/// i.e. `[0, 1, 2]` will construct the same mask as `[1, 0, 0, 2]` | ||
pub fn leaves( | ||
schema: &SchemaDescriptor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mask could theoretically carry this along and use it as a sanity check, but I'm inclined to think if a user constructs a mask and then uses it on a different schema, it's not something we can reasonably be expected to handle sensibly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as a useful error is produced, I agree this is fine behavior
@@ -155,24 +100,24 @@ fn get_arrow_schema_from_metadata(encoded_meta: &str) -> Result<Schema> { | |||
Ok(message) => message | |||
.header_as_schema() | |||
.map(arrow::ipc::convert::fb_to_schema) | |||
.ok_or(ArrowError("the message is not Arrow Schema".to_string())), | |||
.ok_or(arrow_err!("the message is not Arrow Schema")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drive by cleanup to move to arrow_err! macro
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using arrow_err
is not obviously better to me: it is the same verbosity but now requires looking / knowing what the arrow_err!
macro does
;0 But it is not worse either
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's consistent with how errors are handled elsewhere in the crate, with arrow_err!, general_err!, etc...
Codecov Report
@@ Coverage Diff @@
## master #1716 +/- ##
==========================================
- Coverage 83.32% 83.31% -0.01%
==========================================
Files 195 196 +1
Lines 56023 55961 -62
==========================================
- Hits 46681 46625 -56
+ Misses 9342 9336 -6
Continue to review full report at Codecov.
|
As a happy side-effect this actually found and fixed a bug in the handling of nested projection pushdown in ParquetRecordBatchStream |
Some(projection) => { | ||
if let Some(col) = projection.iter().find(|x| **x >= num_columns) { | ||
return Err(general_err!( | ||
"column projection {} outside bounds of schema 0..{}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check was actually incorrect as it was checking against the arrow schema not the parquet schema. I think this demonstrates the footgun prone nature of the old API
ProjectionMask
(#1701)
/// | ||
/// i.e. `[0, 1, 2]` will construct the same mask as `[1, 0, 0, 2]` | ||
pub fn leaves( | ||
schema: &SchemaDescriptor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as a useful error is produced, I agree this is fine behavior
parquet/src/arrow/mod.rs
Outdated
/// A [`ProjectionMask`] identifies a set of columns within a potentially nested schema to project | ||
#[derive(Debug, Clone)] | ||
pub struct ProjectionMask { | ||
/// A mask of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment seems to be truncated.
Also, since we have Bitmap
and all the associated handling in Arrow, I wonder if it is worth using that (though a Vec<bool>
is nice and simple
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Describing in the docstring that a mask of None
means All
is probably also a good idea as well as which schema a ProjectionMask
s indexes refer to (I think Parquet)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, since we have Bitmap and all the associated handling in Arrow, I wonder if it is worth using that (though a Vec is nice and simple
Let's stick with simple, and maybe if/when we promote this construct to arrow-rs we can switch to using Bitmap
|
||
/// Create a [`ProjectionMask`] which selects only the specified leaf columns | ||
/// | ||
/// Note: repeated or out of order indices will not impact the final mask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice -- so the idea is that you enforce masks having in-order indicies by wrapping them in a ProjectionMask
which enforces this invariant during construction 👍
Self { mask: None } | ||
} | ||
|
||
/// Create a [`ProjectionMask`] which selects only the specified leaf columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please explain (or provide a link to something that explains) leaves and roots and what "order" they are in. I think it refers to the parquet schema (or maybe the arrow schema and types within Structs / LIsts / others nested types?)
@@ -51,74 +52,18 @@ pub fn parquet_to_arrow_schema( | |||
) -> Result<Schema> { | |||
parquet_to_arrow_schema_by_columns( | |||
parquet_schema, | |||
0..parquet_schema.columns().len(), | |||
ProjectionMask::all(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
parquet_schema: &SchemaDescriptor, | ||
column_indices: T, | ||
mask: ProjectionMask, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this needs an owned mask or if it could be taken by reference
mask: ProjectionMask, | |
mask: &ProjectionMask, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently yes, it gets moved into the Visitor. Theoretically it could borrow and have lifetimes, but in most cases I suspect we have the mask by value anyway.
Edit: This might be an argument to move to arrow Bitmap, as that is internally refcounted... Future PR me thinks
@@ -155,24 +100,24 @@ fn get_arrow_schema_from_metadata(encoded_meta: &str) -> Result<Schema> { | |||
Ok(message) => message | |||
.header_as_schema() | |||
.map(arrow::ipc::convert::fb_to_schema) | |||
.ok_or(ArrowError("the message is not Arrow Schema".to_string())), | |||
.ok_or(arrow_err!("the message is not Arrow Schema")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using arrow_err
is not obviously better to me: it is the same verbosity but now requires looking / knowing what the arrow_err!
macro does
;0 But it is not worse either
@@ -1188,9 +1112,9 @@ mod tests { | |||
// required int64 leaf5; | |||
|
|||
let parquet_schema = SchemaDescriptor::new(Arc::new(parquet_group_type)); | |||
let mask = ProjectionMask::leaves(&parquet_schema, [3, 0, 4]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is much nicer to read / reason about
ProjectionMask
(#1701)ProjectionMask
(#1701)
Which issue does this PR close?
Closes #1701.
Closes #1653
Rationale for this change
The current API is confusing, surfacing errors at runtime, and is liable to accidental misuse - apache/datafusion#2453 and apache/datafusion#2543.
What changes are included in this PR?
This adds an explicit
ColumnMask
that replaces the iterators of indices. This makes for a cleaner API that should hopefully make it harder to accidentally misuse.In particular it also adds the ability to construct a RecordReader based on a mask of root columns, as opposed to leaf columns. This is the core behind apache/datafusion#2543
Are there any user-facing changes?
Yes, this changes the arrow projection API