-
Notifications
You must be signed in to change notification settings - Fork 819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ArrayDataLayout, port validation (#1799) #3818
Conversation
f4baea8
to
578d8a4
Compare
arrow-buffer/src/buffer/offset.rs
Outdated
&self.0 | ||
} | ||
|
||
/// Returns the inner [`Buffer`] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Returns the inner [`Buffer`] | |
/// Returns the inner [`ScalarBuffer`] |
arrow-data/src/data/boolean.rs
Outdated
/// | ||
/// Panics if | ||
/// - `nulls` and `values` are different lengths | ||
/// - `data_type` is not compatible with `T` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what is T
here? Maybe?
/// - `data_type` is not compatible with `T` | |
/// - `data_type` is not compatible with `PhysicalType::Boolean` |
use arrow_buffer::buffer::{BooleanBuffer, NullBuffer}; | ||
use arrow_schema::DataType; | ||
|
||
#[derive(Debug, Clone)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is also good to add layout link? https://arrow.apache.org/docs/format/Columnar.html#fixed-size-primitive-layout?
arrow-data/src/data/run.rs
Outdated
@@ -116,22 +170,67 @@ impl<E: RunEnd> RunArrayData<E> { | |||
/// # Safety | |||
/// | |||
/// - `data_type` must be valid for this layout | |||
/// - `run_ends` must contain monotonically increasing, positive values `<= child.len()` | |||
/// - `run_ends` must contain monotonically increasing, positive values `<= len` | |||
/// - `run_ends.len() == child.len()` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not clear what child
refers to. The variable child
in below function is built using scalar buffer and its length will not be equal to run_ends.len()
.
Also, there are two children. So it'll be clear to mention which child is being referred in the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @tustvold -- I went through this PR carefully.
I think this PR is related to #1799. I think it is important to understand the context for this PR. Specifically, as I understand it the rationale for this work is to provide an easy and zero cost conversion between arrow-rs and arrow2 arrays (see jorgecarleitao/arrow2#1429 for more details) by unifying the lowest levels using the ArrayData enumeration
Am I right in concluding that extracting ArrayDataLayout
allow the data validation logic to be used in the new structure?
While this does create non trivial tech-debt (as now there are multiple ways to get at the lowest level arrow data), I think with some more documentation and providing some good examples we'll be in good shape to migrate them slowly over.
I am not sure how important it is to leave this PR open for longer to ensure anyone who might have an opinion has a chance to comment -- When you are ready to switch to the new ArrayDataEnumeration that certainly should have some comment time.
cc @ritchie46 and @sundy-li
arrow-buffer/src/buffer/boolean.rs
Outdated
@@ -139,4 +139,9 @@ impl BooleanBuffer { | |||
pub fn inner(&self) -> &Buffer { | |||
&self.buffer | |||
} | |||
|
|||
/// Returns the inner [`Buffer`] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Returns the inner [`Buffer`] | |
/// Returns the inner [`Buffer`], consuming self |
arrow-buffer/src/buffer/offset.rs
Outdated
&self.0 | ||
} | ||
|
||
/// Returns the inner [`ScalarBuffer`] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Returns the inner [`ScalarBuffer`] | |
/// Returns the inner [`ScalarBuffer`], consuming self |
arrow-buffer/src/buffer/scalar.rs
Outdated
&self.buffer | ||
} | ||
|
||
/// Returns the inner [`Buffer`] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Returns the inner [`Buffer`] | |
/// Returns the inner [`Buffer`], consuming self |
buffers: Buffers::two(self.offsets.inner().inner(), &self.values), | ||
child_data: &[], | ||
} | ||
} | ||
} | ||
|
||
/// ArrayData for [fixed-size arrays](https://arrow.apache.org/docs/format/Columnar.html#fixed-size-primitive-layout) of bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this the right link?
/// The children of this RunArrayData | ||
/// 1: the run ends | ||
/// 2: the values | ||
children: Box<[ArrayData; 2]>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is there a reason to store the run ends in children[0]
as well as in run_ends
? Aka why the redundancy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that we can return a slice of children in ArrayDataLayout, this type is a little funky because there is an impedence mismatch between the arrow representation and the arrow-rs representation
arrow-data/src/data/mod.rs
Outdated
@@ -33,13 +33,17 @@ use crate::equal; | |||
mod buffers; | |||
pub use buffers::*; | |||
|
|||
#[allow(unused)] // Private until ready (#1176) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, it is the glue logic to avoid needing to change everything at once. Eventually it will be removed. It will never be made public.
As this PR is almost entirely crate-private I intend to get this in, and we can continue to iterate on it |
Benchmark runs are scheduled for baseline = 2a3fd96 and contender = 495682a. 495682a is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #.
Rationale for this change
Switching to the ArrayData enumeration requires a substantial amount of glue logic to convert between the
struct
andenum
layouts. This PR therefore addsstruct
view of the enum variants, for the purposes of validation, etc..What changes are included in this PR?
Are there any user-facing changes?