-
Notifications
You must be signed in to change notification settings - Fork 837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Scalar/Datum abstraction (#1047) #4393
Conversation
@@ -93,6 +93,17 @@ impl BooleanArray { | |||
Self { values, nulls } | |||
} | |||
|
|||
/// Create a new [`BooleanArray`] with length `len` consisting only of nulls | |||
pub fn new_null(len: usize) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a change to make the example work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Split out into - #4402
@@ -321,16 +321,6 @@ fn filter_array( | |||
// actually filter | |||
_ => downcast_primitive_array! { | |||
values => Ok(Arc::new(filter_primitive(values, predicate))), | |||
DataType::Decimal128(p, s) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drive-by fix I noticed whilst implementing this, downcast_primitive_array already handles the decimal case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might make sense to put up as a new PR
/// b_scalar: bool, | ||
/// ) -> BooleanArray { | ||
/// let (array, scalar) = match (a_scalar, b_scalar) { | ||
/// (true, true) | (false, false) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we can see a nice property of the encoding as scalars, the case where they are both scalars is identical to the case where they are both arrays
be967a2
to
04cc386
Compare
Mailing list thread - https://lists.apache.org/thread/f5q4q1g6hgqgrvbv67llv87hpfxg34vc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TLDR is I think this is really nice -- thank you @tustvold
BTW for anyone else reading this, I think one of the major reasons @tustvold is tackling this now is so that as we add scalar kernels for the various timestamp / duration / interval types we don't make the existing problem worse.
I focused on this example, which shows how someone would use the kernels with scalar values:
/// // Comparison of an array and a scalar
/// let a = Int32Array::from(vec![1, 2, 3, 4, 5]);
/// let b = Int32Array::from(vec![1]);
/// let r = eq(&a, &Scalar::new(&b)).unwrap();
/// let values: Vec<_> = r.values().iter().collect();
/// assert_eq!(values, &[true, false, false, false, false]);
I really like the construction of a Scalar
that wraps b
and signals to the kernels that the 1 element array should be treated differently, rather than just implicitly treating 1 element arrays differently
I also think the use of a trait for Datum
is important for uses like DataFusion, which will likely need some sort of owned variant of Scalar (e.g. to represent the 1
in an expression such as column + 1
). Given that Datum
is a trait means we can create such an abstraction.
It might make sense to add an OwnedScalar
in arrow-rs for convenience (OwnedScalar(ArrayRef)
) but we can sort that out later
Questions:
Have we thought about how we will migrate the existing kernels? Given the construction above, we could perhaps leave the old signatures around for a while and deprecate them and call through to the new variants
@@ -321,16 +321,6 @@ fn filter_array( | |||
// actually filter | |||
_ => downcast_primitive_array! { | |||
values => Ok(Arc::new(filter_primitive(values, predicate))), | |||
DataType::Decimal128(p, s) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might make sense to put up as a new PR
/// assert_eq!(values, &[true, false, false, false, false]); | ||
pub trait Datum { | ||
/// Returns the value for this [`Datum`] and a boolean indicating if the value is scalar | ||
fn get(&self) -> (&dyn Array, bool); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might make sense to have two separate methods so that if we wanted to add more information to the Datum we could do it in a backwards compatible way by adding a new method to the trait
fn get(&self) -> &dyn Array;
fn is_scalar(&self) -> bool {
false
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think given that neither piece of information can be used in isolation, it makes sense to return them both together. We can still always add further trait methods to expose additional information in future
My hope is that we can simply replace |
Another thing to potentially consider at the same time is whether we want to roll in any notion of selection vector support (#4095). I will think on this more |
At the very least the "Datum" notion allows for the inclusion of a Selection Vector at a later time -- perhaps like pub trait Datum {
...
/// If there are certain rows that should be ignored by any kernels
/// Defaults to None (all rows)
fn selection(&self) -> Option<&BoolBuffer> { None }
} |
#4465 contains a POC of using this abstraction to implement scalar kernels, I'm therefore confident that this is a sensible abstraction |
fn get(&self) -> (&dyn Array, bool) { | ||
(self, false) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, so an Array with length 1 will also return false indicating it is not a scalar? Only if we explicitly wrap it with a Scalar
so we can get correct output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it was suggested this might be less confusing for users
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, then what's difference that to return false in Datum implementation for Array if the Array is length 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is Datum for Scalar, I mean Datum for dyn Array, then so an Array with length 1 will be treated as scalar without Scalar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An array with length 1 won't be treated as scalar, it will only be treated as scalar if wrapped in Scalar
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, that is what I wanted to ask, I think. Not all arrays with length 1 are all scalars, but only wrapped with Scalar
they are scalars.
The view maybe be reversed. Not array is treated as scalar but scalar is treated as array. When any one wants to talk with this crate, this crate only understands language in array. So if you want to mention a scalar, you need to fit it into an array and let this crate know it behaves like a scalar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some downsides, but I think you already mentioned in the description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An array with length 1 won't be treated as scalar, it will only be treated as scalar if wrapped in Scalar
FWIW I think the only practical difference would be that add(arr1, arr2)
will fail if arr1
has one row (but is not marked as a scalar) and arr2
had some other number of rows (like 100).
I think @tustvold also considered simply treating any arrays that had 1
row as a scalar but felt (as do I) that making it explicit would make for a less confusing experience . Or maybe that was only my opinion 😆
When any one wants to talk with this crate, this crate only understands language in array. So if you want to mention a scalar, you need to fit it into an array and let this crate know it behaves like a scalar.
I think this is an excellent description 👍
I intend to merge this after I cut the arrow 43 release |
|
||
impl<'a> Datum for Scalar<'a> { | ||
fn get(&self) -> (&dyn Array, bool) { | ||
(self.0, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to consider datatype of the wrapped Array? A complex type array with length 1 is also a scalar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If wrapped in Scalar
then yes, this is consistent with DF's ScalarValue which can contain struct or list elements
Which issue does this PR close?
Closes #1047
Relates to #3999
Relates to #2837
Relates to #2766
Rationale for this change
This is a proposal on how to represent scalars within arrow-rs kernels, with a view to providing a more consistent experience for users, and allowing us to consolidate a non-trivial amount of dispatch logic currently found in downstreams.
There are a couple of aspects worth highlighting
Arrays are Preeminent
Arrow is a specification for arrays and not scalars. As such this approach makes a conscious decision to not define a parallel specification for representing scalars. This has a number of advantages:
It does come with some obvious downsides compared to a first-class scalar approach
However, it is my assertion that preserving the arrow-ness of the representation is more important than either of these
Non-Owning
Related to the above, data is only stored in arrays.
Datum
andScalar
solely act as wrappers to influence the dispatch within kernels, you cannot store a scalar value. Similarly the return type of a kernel will always be an array.This reflects both the desire to keep arrays as the canonical representation, and also to discourage use of the scalar abstraction within data structures. We want to encourage the use of arrays, not constructions like
Vec<Scalar>
orHashMap<Scalar, _>
. Permitting such usage not only creates confusion, but performing type-erasure per field in this way is grossly inefficient both from a memory and performance standpoint.Type-Erasure / Dynamic Dispatch
Both
Scalar
andDatum
are abstractions relying on type-erasure. This reflects a couple of goals:&dyn Array
The implied assumption is that the overheads of this dynamic dispatch will be irrelevant when amortised over the number of values in an array. Or to phrase it differently, we are explicitly not optimising for performance of purely scalar operations. IMO such operations are outside the remit of a vectorised execution engine.
What changes are included in this PR?
Adds
Scalar
andDatum
along with examples of their usageAre there any user-facing changes?
No, this just adds the new types