Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Scalar / Datum support to compute kernels #1047

Closed
alamb opened this issue Dec 16, 2021 · 9 comments · Fixed by #4393
Closed

Add Scalar / Datum support to compute kernels #1047

alamb opened this issue Dec 16, 2021 · 9 comments · Fixed by #4393
Assignees
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@alamb
Copy link
Contributor

alamb commented Dec 16, 2021

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

When implementing analytics, users often want to do operations like array + constant or array + array

The Rust implementation of Arrow is strongly typed 🙌 , including the compute kernels, but often data is passed around in Arc<dyn Array> / ArrayRef so the types can be dynamic. This then places the burden on the user of the library to determine the exact types of the inputs in order to call the appropriate kernels

Let's take for example, trying to compare an array to find all elements that contain the number 5 and your input an array may have Int64 or Int32 values

In the current version of arrow, in order to do so you would need to write code like

use arrow::compute::eq_scalar;

fn find_5(array: ArrayRef) -> Result<BooleanArray>{
  match array.data_type() {
    DataType::Int64 => eq_scalar(array.as_any().downcast_ref::<Int64Array>().unwrap(), 5),
    DataType::UInt64 => eq_scalar(array.as_any().downcast_ref::<UInt64Array>().unwrap(), 5),
    ...
  }
}

This ends up being macroized and is non ideal because the user had to do dynamic type dispatch anyways, so there is no runtime performance benefit to strongly typed kernels

Describe the solution you'd like

It would be nice to be able to call a single function and let the rust library dynamically (at runtime) pick the correct kernel to call.

As described and suggested by @jorgecarleitao and @nevi-me in #984 (comment) this could go all the way and take the form of following the C++ Datum model

fn eq(left: Datum, right: Datum) -> BooleanArray {
...
}

Where Datum looks something like

enum Datum {
  Scalar(ScalarValue),
  Array(ArrayRef)
}

And ScalarValue which would look something like ScalarValue from DataFusion https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/scalar.rs or Scalar from arrow2 https://github.com/jorgecarleitao/arrow2/tree/main/src/scalar

Describe alternatives you've considered
The alternative is a continued proliferation of individual kernels such as eq, eq_utf8, etc

Additional context
There is a lot of additional discussion on #984

cc @matthewmturner @Dandandan @jimexist @houqp

@alamb alamb added the enhancement Any new improvement worthy of a entry in the changelog label Dec 16, 2021
@houqp
Copy link
Member

houqp commented Dec 24, 2021

I support adding higher level easy to use API for non-performance critical code, but it would be good to still keep the specialized kernels around so users can still leverage them when type info is available at the call site.

@alamb
Copy link
Contributor Author

alamb commented Dec 26, 2021

but it would be good to still keep the specialized kernels around so users can still leverage them when type info is available at the call site.

The reason I was saying the specialized kernels might not be totally useful was, my mind, that the overhead of a type dispatch could be almost entirely amortized -- like if we are comparing 1000 elements in an array, an extra check on the array's type might not be a huge deal

I don't see anything wrong with leaving all the specialized kernels available too (maybe it would help inlining 🤔 )

@tustvold
Copy link
Contributor

I believe this overlaps with #2842

@tustvold tustvold self-assigned this May 9, 2023
@tustvold
Copy link
Contributor

tustvold commented Jun 1, 2023

So I've been playing around with this and the major challenge is avoiding a huge amount of API churn / boilerplate

Take the signature

add_dyn(a: &dyn Array, b: &dyn Array) -> Result<ArrayRef>

Its not clear how to convert this to a Datum based model. One option would be

add_dyn(a: Datum<'_>, b: Datum<'_>) -> Result<ArrayRef>

Where Datum is something like

enum Datum<'a> {
    Array(&'a dyn Array),
    Scalar(&'a dyn Scalar)
}

But this has a couple of issues

  • Callsites now have to explicitly wrap their arguments in Datum
  • There is no way to return a scalar

Making Datum a trait doesn't help here either, because the specialization rules prevent blanked implementations for both T: Scalar and T: Array.

Another option would be to make the methods generic, with impl Into<Datum>, but this also has downsides of

  • Runs into same blanket impl issues as deriving Datum trait
  • Kernels are now generic resulting in significant, localised, additional codegen

Taking a step back I had a potentially controversial thought, why not just treat a single element array as a scalar array?

This would have some pretty compelling advantages:

  • No changes to type signatures necessary
  • Unary kernels like casting or aggregation just work with no modification
  • Complete type coverage for no effort

The obvious downside is the representation is not very memory efficient. I think the question boils down to what is the purpose of the scalar representation, is it:

  1. To allow more efficient kernels where one side is known to be a scalar, e.g. scalar comparison, etc...
  2. Provide an efficient type-erased representation for row-oriented operations like grouping
  3. Provide efficient scalar operations

My 2 cents is that 2. is a use-case better served by the row representation, and 3. is beyond the scope of a vectorized execution engine, and therefore 1. is the target for this feature. As such I think this is perfectly acceptable approach. The overheads of the slightly less efficient representation will be more than outweighed by the costs of the dynamic dispatch alone.

What do people think?

@alamb
Copy link
Contributor Author

alamb commented Jun 1, 2023

Taking a step back I had a potentially controversial thought, why not just treat a single element array as a scalar array?

For what it is worth I think this is what DuckDB does (at least this is how I interpret this slide from the 22 - DuckDB Internals (CMU Advanced Databases / Spring 2023) lecture

Screenshot 2023-06-01 at 1 30 17 PM

I think the biggest potential downside of the "use a single row to mean a scalar value" could be that it is confusing as an API. The two input array sizes have to match except when one of them (would it always be the right argument?) had a single row, in which case it would have a special fast path 🤔 But it wouldn't be clear from the call signature if there was a special fast path or not.

Edit: I have another idea I am typing up

@alamb
Copy link
Contributor Author

alamb commented Jun 1, 2023

What if we made a 'new' set of kernels (that just dispatched to the existing ones, if they existed)

This seems like the ideal user experience from my perspective:

let scalar = Int64Scalar::new(42);
let array = Int64Array::from(....);

// Call the "add" kernels in a new module
let result: Datum = arrow::compute::datum::add(&scalar, &array);

Maybe the types could look like this:

enum Datum<'a> {
    Array(&'a dyn Array),
    Scalar(&'a dyn Scalar),
    OwnedArray(Arc<Array>),
    OwnedSclar(Arc<Scalar>),
}

impl Into Datum<..> for Int64Scalar {
..
}


fn add (left: impl Into<Datum>, right: impl Into<Datum>) -> Datum<'static> {
..
}

Perhaps?

@tustvold
Copy link
Contributor

tustvold commented Jun 1, 2023

dispatched to the existing ones

The existing ones don't support scalars with logical types, that's the whole issue 😅. We have to change the way we support scalars, and I'm leaning towards the one that doesn't involve bifurcating the API...

except when one

There aren't a huge number of binary kernels, so I think we could get pretty good coverage quite quickly. Certainly orders of magnitude faster than porting all of the existing kernels to support some notion of scalar.

Is this not something that could be overcome with a bit of documentation, from the UX side I can see an argument that having to treat scalars differently is actually a worse UX... It has certainly led to a fairly substantial amount of logic in DF...

Into<Datum>

This runs into the blanket impl issue, there isn't a way to provide blanket impls for the concrete types which then forces peculiar things like (&array as &dyn Array)

Edit: done some further experiments and had a great chat with @alamb will write up more tomorrow, I think there is a nicer way to do this

@alamb
Copy link
Contributor Author

alamb commented Jun 1, 2023

This runs into the blanket impl issue, there isn't a way to provide blanket impls for the concrete types which then forces peculiar things like (&array as &dyn Array)

I was imagining implementing impl Into<Datum> explicitly for all the concrete types...

@sundy-li
Copy link
Contributor

sundy-li commented Jun 2, 2023

For what it is worth I think this is what DuckDB does (at least this is how I interpret this slide from the 22 - DuckDB Internals (CMU Advanced Databases / Spring 2023) lecture

If the array is already in Flat format, it needs to construct an extra Selection vector to iterate the array, maybe it's a little more overhead than iterating the array itself.

We used an enum to represent the vector, it's hard to add support for another dictionary vector.

#[derive(Debug, Clone, PartialEq, EnumAsInner)]
pub enum ValueRef<'a, T: ValueType> {
    Scalar(T::ScalarRef<'a>),
    Column(T::Column),
}

The main drawback is code flood, so we use the code generator to help write vectorized methods, the generated file is ~6000 LOC .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants