-
Notifications
You must be signed in to change notification settings - Fork 835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Scalar
/ Datum
support to compute kernels
#1047
Comments
I support adding higher level easy to use API for non-performance critical code, but it would be good to still keep the specialized kernels around so users can still leverage them when type info is available at the call site. |
The reason I was saying the specialized kernels might not be totally useful was, my mind, that the overhead of a type dispatch could be almost entirely amortized -- like if we are comparing 1000 elements in an array, an extra check on the array's type might not be a huge deal I don't see anything wrong with leaving all the specialized kernels available too (maybe it would help inlining 🤔 ) |
I believe this overlaps with #2842 |
So I've been playing around with this and the major challenge is avoiding a huge amount of API churn / boilerplate Take the signature
Its not clear how to convert this to a Datum based model. One option would be
Where
But this has a couple of issues
Making Another option would be to make the methods generic, with
Taking a step back I had a potentially controversial thought, why not just treat a single element array as a scalar array? This would have some pretty compelling advantages:
The obvious downside is the representation is not very memory efficient. I think the question boils down to what is the purpose of the scalar representation, is it:
My 2 cents is that 2. is a use-case better served by the row representation, and 3. is beyond the scope of a vectorized execution engine, and therefore 1. is the target for this feature. As such I think this is perfectly acceptable approach. The overheads of the slightly less efficient representation will be more than outweighed by the costs of the dynamic dispatch alone. What do people think? |
For what it is worth I think this is what DuckDB does (at least this is how I interpret this slide from the 22 - DuckDB Internals (CMU Advanced Databases / Spring 2023) lecture I think the biggest potential downside of the "use a single row to mean a scalar value" could be that it is confusing as an API. The two input array sizes have to match except when one of them (would it always be the right argument?) had a single row, in which case it would have a special fast path 🤔 But it wouldn't be clear from the call signature if there was a special fast path or not. Edit: I have another idea I am typing up |
What if we made a 'new' set of kernels (that just dispatched to the existing ones, if they existed) This seems like the ideal user experience from my perspective: let scalar = Int64Scalar::new(42);
let array = Int64Array::from(....);
// Call the "add" kernels in a new module
let result: Datum = arrow::compute::datum::add(&scalar, &array); Maybe the types could look like this:
Perhaps? |
The existing ones don't support scalars with logical types, that's the whole issue 😅. We have to change the way we support scalars, and I'm leaning towards the one that doesn't involve bifurcating the API...
There aren't a huge number of binary kernels, so I think we could get pretty good coverage quite quickly. Certainly orders of magnitude faster than porting all of the existing kernels to support some notion of scalar. Is this not something that could be overcome with a bit of documentation, from the UX side I can see an argument that having to treat scalars differently is actually a worse UX... It has certainly led to a fairly substantial amount of logic in DF...
This runs into the blanket impl issue, there isn't a way to provide blanket impls for the concrete types which then forces peculiar things like Edit: done some further experiments and had a great chat with @alamb will write up more tomorrow, I think there is a nicer way to do this |
I was imagining implementing |
If the array is already in Flat format, it needs to construct an extra We used an enum to represent the vector, it's hard to add support for another dictionary vector.
The main drawback is code flood, so we use the code generator to help write vectorized methods, the generated file is ~6000 LOC . |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
When implementing analytics, users often want to do operations like
array
+constant
orarray + array
The Rust implementation of Arrow is strongly typed 🙌 , including the compute kernels, but often data is passed around in
Arc<dyn Array>
/ArrayRef
so the types can be dynamic. This then places the burden on the user of the library to determine the exact types of the inputs in order to call the appropriate kernelsLet's take for example, trying to compare an array to find all elements that contain the number
5
and your input an array may haveInt64
orInt32
valuesIn the current version of arrow, in order to do so you would need to write code like
This ends up being macroized and is non ideal because the user had to do dynamic type dispatch anyways, so there is no runtime performance benefit to strongly typed kernels
Describe the solution you'd like
It would be nice to be able to call a single function and let the rust library dynamically (at runtime) pick the correct kernel to call.
As described and suggested by @jorgecarleitao and @nevi-me in #984 (comment) this could go all the way and take the form of following the C++
Datum
modelWhere
Datum
looks something likeAnd
ScalarValue
which would look something likeScalarValue
from DataFusion https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/scalar.rs orScalar
from arrow2 https://github.com/jorgecarleitao/arrow2/tree/main/src/scalarDescribe alternatives you've considered
The alternative is a continued proliferation of individual kernels such as
eq
,eq_utf8
, etcAdditional context
There is a lot of additional discussion on #984
cc @matthewmturner @Dandandan @jimexist @houqp
The text was updated successfully, but these errors were encountered: