Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Scalar/Datum abstraction (#1047) #4393

Merged
merged 4 commits into from
Jul 4, 2023

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Jun 9, 2023

Which issue does this PR close?

Closes #1047
Relates to #3999
Relates to #2837
Relates to #2766

Rationale for this change

This is a proposal on how to represent scalars within arrow-rs kernels, with a view to providing a more consistent experience for users, and allowing us to consolidate a non-trivial amount of dispatch logic currently found in downstreams.

There are a couple of aspects worth highlighting

Arrays are Preeminent

Arrow is a specification for arrays and not scalars. As such this approach makes a conscious decision to not define a parallel specification for representing scalars. This has a number of advantages:

  • Can use the existing stories for serialization, FFI, display, casting, etc...
  • Avoids ambiguity over how to represent scalars with non-native encodings, e.g. dictionaries, run-end encoded, etc...
  • Allows sharing more logic, by not bifurcating the types
  • Supports all array types with no additional effort

It does come with some obvious downsides compared to a first-class scalar approach

  • Less efficient representation
  • Potentially more confusing / less ergonomic to use

However, it is my assertion that preserving the arrow-ness of the representation is more important than either of these

Non-Owning

Related to the above, data is only stored in arrays. Datum and Scalar solely act as wrappers to influence the dispatch within kernels, you cannot store a scalar value. Similarly the return type of a kernel will always be an array.

This reflects both the desire to keep arrays as the canonical representation, and also to discourage use of the scalar abstraction within data structures. We want to encourage the use of arrays, not constructions like Vec<Scalar> or HashMap<Scalar, _>. Permitting such usage not only creates confusion, but performing type-erasure per field in this way is grossly inefficient both from a memory and performance standpoint.

Type-Erasure / Dynamic Dispatch

Both Scalar and Datum are abstractions relying on type-erasure. This reflects a couple of goals:

  • Designing generics for many operations, like arithmetic, is extremely hard and represents a significant API commitment, type-erasure presents a much smaller API surface that is therefore easier to evolve without breaking changes
  • Avoiding generics helps to reduce code bloat from monomorphisation and hopefully improve compile times
  • Is consistent with the vast majority of kernels which use &dyn Array

The implied assumption is that the overheads of this dynamic dispatch will be irrelevant when amortised over the number of values in an array. Or to phrase it differently, we are explicitly not optimising for performance of purely scalar operations. IMO such operations are outside the remit of a vectorised execution engine.

What changes are included in this PR?

Adds Scalar and Datum along with examples of their usage

Are there any user-facing changes?

No, this just adds the new types

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jun 9, 2023
@@ -93,6 +93,17 @@ impl BooleanArray {
Self { values, nulls }
}

/// Create a new [`BooleanArray`] with length `len` consisting only of nulls
pub fn new_null(len: usize) -> Self {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a change to make the example work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split out into - #4402

@@ -321,16 +321,6 @@ fn filter_array(
// actually filter
_ => downcast_primitive_array! {
values => Ok(Arc::new(filter_primitive(values, predicate))),
DataType::Decimal128(p, s) => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive-by fix I noticed whilst implementing this, downcast_primitive_array already handles the decimal case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might make sense to put up as a new PR

/// b_scalar: bool,
/// ) -> BooleanArray {
/// let (array, scalar) = match (a_scalar, b_scalar) {
/// (true, true) | (false, false) => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we can see a nice property of the encoding as scalars, the case where they are both scalars is identical to the case where they are both arrays

@tustvold tustvold force-pushed the scalar-experiments branch from be967a2 to 04cc386 Compare June 9, 2023 16:30
@tustvold
Copy link
Contributor Author

tustvold commented Jun 9, 2023

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLDR is I think this is really nice -- thank you @tustvold

BTW for anyone else reading this, I think one of the major reasons @tustvold is tackling this now is so that as we add scalar kernels for the various timestamp / duration / interval types we don't make the existing problem worse.

I focused on this example, which shows how someone would use the kernels with scalar values:

/// // Comparison of an array and a scalar
/// let a = Int32Array::from(vec![1, 2, 3, 4, 5]);
/// let b = Int32Array::from(vec![1]);
/// let r = eq(&a, &Scalar::new(&b)).unwrap();
/// let values: Vec<_> = r.values().iter().collect();
/// assert_eq!(values, &[true, false, false, false, false]);

I really like the construction of a Scalar that wraps b and signals to the kernels that the 1 element array should be treated differently, rather than just implicitly treating 1 element arrays differently

I also think the use of a trait for Datum is important for uses like DataFusion, which will likely need some sort of owned variant of Scalar (e.g. to represent the 1 in an expression such as column + 1). Given that Datum is a trait means we can create such an abstraction.

It might make sense to add an OwnedScalar in arrow-rs for convenience (OwnedScalar(ArrayRef)) but we can sort that out later

Questions:
Have we thought about how we will migrate the existing kernels? Given the construction above, we could perhaps leave the old signatures around for a while and deprecate them and call through to the new variants

@@ -321,16 +321,6 @@ fn filter_array(
// actually filter
_ => downcast_primitive_array! {
values => Ok(Arc::new(filter_primitive(values, predicate))),
DataType::Decimal128(p, s) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might make sense to put up as a new PR

/// assert_eq!(values, &[true, false, false, false, false]);
pub trait Datum {
/// Returns the value for this [`Datum`] and a boolean indicating if the value is scalar
fn get(&self) -> (&dyn Array, bool);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make sense to have two separate methods so that if we wanted to add more information to the Datum we could do it in a backwards compatible way by adding a new method to the trait

fn get(&self) -> &dyn Array;
fn is_scalar(&self) -> bool {
 false 
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think given that neither piece of information can be used in isolation, it makes sense to return them both together. We can still always add further trait methods to expose additional information in future

@tustvold
Copy link
Contributor Author

Have we thought about how we will migrate the existing kernels

My hope is that we can simply replace &dyn Array with &dyn Datum and have everything work without it requiring downstreams to modify their code, but I haven't tested this extensively yet

@tustvold
Copy link
Contributor Author

Another thing to potentially consider at the same time is whether we want to roll in any notion of selection vector support (#4095). I will think on this more

@alamb
Copy link
Contributor

alamb commented Jun 14, 2023

At the very least the "Datum" notion allows for the inclusion of a Selection Vector at a later time -- perhaps like

pub trait Datum {
...
  /// If there are certain rows that should be ignored by any kernels
  /// Defaults to None (all rows)
  fn selection(&self) -> Option<&BoolBuffer> { None }
}

@tustvold
Copy link
Contributor Author

#4465 contains a POC of using this abstraction to implement scalar kernels, I'm therefore confident that this is a sensible abstraction

@tustvold tustvold marked this pull request as ready for review June 29, 2023 13:29
fn get(&self) -> (&dyn Array, bool) {
(self, false)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, so an Array with length 1 will also return false indicating it is not a scalar? Only if we explicitly wrap it with a Scalar so we can get correct output?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it was suggested this might be less confusing for users

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, then what's difference that to return false in Datum implementation for Array if the Array is length 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is Datum for Scalar, I mean Datum for dyn Array, then so an Array with length 1 will be treated as scalar without Scalar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An array with length 1 won't be treated as scalar, it will only be treated as scalar if wrapped in Scalar?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, that is what I wanted to ask, I think. Not all arrays with length 1 are all scalars, but only wrapped with Scalar they are scalars.

The view maybe be reversed. Not array is treated as scalar but scalar is treated as array. When any one wants to talk with this crate, this crate only understands language in array. So if you want to mention a scalar, you need to fit it into an array and let this crate know it behaves like a scalar.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some downsides, but I think you already mentioned in the description.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An array with length 1 won't be treated as scalar, it will only be treated as scalar if wrapped in Scalar

FWIW I think the only practical difference would be that add(arr1, arr2) will fail if arr1 has one row (but is not marked as a scalar) and arr2 had some other number of rows (like 100).

I think @tustvold also considered simply treating any arrays that had 1 row as a scalar but felt (as do I) that making it explicit would make for a less confusing experience . Or maybe that was only my opinion 😆

When any one wants to talk with this crate, this crate only understands language in array. So if you want to mention a scalar, you need to fit it into an array and let this crate know it behaves like a scalar.

I think this is an excellent description 👍

@tustvold
Copy link
Contributor Author

I intend to merge this after I cut the arrow 43 release


impl<'a> Datum for Scalar<'a> {
fn get(&self) -> (&dyn Array, bool) {
(self.0, true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to consider datatype of the wrapped Array? A complex type array with length 1 is also a scalar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If wrapped in Scalar then yes, this is consistent with DF's ScalarValue which can contain struct or list elements

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Scalar / Datum support to compute kernels
3 participants