-
Notifications
You must be signed in to change notification settings - Fork 809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add scalar comparison kernels for DictionaryArray #984
Add scalar comparison kernels for DictionaryArray #984
Conversation
@alamb I started the work on this. Would you be able to give it a quick look when you get the chance to make sure its going in the right direction? |
let values = $left | ||
.values() | ||
.as_any() | ||
.downcast_ref::<StringArray>() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The values array can be anything (not always a StringArray
) -- perhaps this would be a good place to use the dyn_XX
kernels -- to compare the values array with $right
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from my understanding of the dyn kernels those cant be used when comparing to constant right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 yes you are correct -- we would need to add dyn_XX_lit
type kernels, but that seems a bit overkill for this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the primary use case for this PR was comparing dict array to constant then maybe it makes sense for me to do a separate PR for that first and then come back to this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the primary use case for this PR was comparing dict array to constant then maybe it makes sense for me to do a separate PR for that first and then come back to this?
I think focusing on the usecase of comparing dict array to constant is the best choice for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok! Will start with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb ive been reviewing this but i think i might be missing something. my understanding is that my code above is for getting the dictionary values, which can be of any type (of course above im only handling StringArray
).
let values = $left
.values()
.as_any()
.downcast_ref::<StringArray>()
.unwrap()
But then you mention using the new dyn_xx
kernels / creating dyn_xx_lit
kernels. Since theres no actual compute being done here, what would the dyn
kernels be used for? Or were you referring to using the kernels to replace more than just that section of code?
to me it looks like i need a macro to downcast DictionaryArray.values()
into whatever type the values are, and then i could use something like dyn_xx_lit
on that in order to get the comparison results. Is this roughly what you had in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am very sorry for confusing this conversation with mentioning dyn_xx_lit
.
What I was (inarticulately) trying to say was that once you have eq_dict_scalar
(and you will likely also need eq_dict_scalar_utf8
) we will end up with several different ways to compare an array to a scalar, depending on the array type
So I was thinking ahead to adding functions like
fn eq_scalar_utf8_dyn(array: dyn &Array, right: &str) -> Boolean {
// do dispatch to the right kernel based on type of array
}
But definitely not for this PR
let comparison = (0..$left.len()).map(|i| unsafe { | ||
let key = $left.keys().value_unchecked(i).to_usize().unwrap(); | ||
$op(values.value_unchecked(key), $right) | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one of the main points of this ticket is to avoid the call here to vaues.value_unchecked
I like to think about the goal in by thinking "what would happen with DictionaryArray with 1000000 entries but a dictionary of size 1?" -- the way you have this PR, I think we would call $op
1000000 times. The idea is to call $op
1 time.
So the pattern I think we are looking, at least for the constant kernels is:
In pseudo code:
let values = dict_array.values();
let comparison_result_on_values = apply_op_to_values();
let result = dict_array.keys().iter().map(|index| comparison_result_on_values[index]).collect()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks for explanation. I am looking into this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb im struggling with the second step in your pseudocode given that my understanding is that the values could be of any ArrowPrimativeType
. Would you be able to provide guidance on how to handle that? I've been playing with different macros and iteration options on the underlying buffers, but i feel like im missing some fundamental understanding about how to work with dynamic data type like this or how to use ArrayData
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔
Yes this is definitely tricky. Maybe taking a step back, and think about the usecase: comparing DictionaryArrays to literals.
For example, if you look at the comparison kernels (for eq
) , https://docs.rs/arrow/6.3.0/arrow/compute/kernels/comparison/index.html we find;
eq_scalar
eq_bool_scalar
eq_utf8_scalar
With each being typed based on the type of scalar (because the arrays are typed)
The issue with a DictionaryArray
is that it could have numbers, bool, strings, etc. so we can't have a single entrypoint as we do with other types of arrays
So i am thinking we would need something like
eq_dict_scalar // numeric
eq_dict_bool_scalar // boolean
eq_dict_utf8_scalar // strings
where each of those kernels would be able to downcast the array appropriately.
However, having three functions for each dict kernel seems somewhat crazy.
That is where my dyn idea was coming from. If we are going to add three new kernels for each operator (eq
, lt
, etc) we could perhaps add
eq_dyn_scalar // numeric
eq_dyn_bool_scalar // boolean
eq_dyn_utf8_scalar // strings
etc
Which handle DictionaryArray as well as dispatching to the other eq_scalar
, eq_bool_scalar
, eq_utf8_scalar
as appropriate.
Does that make sense? I can try and sketch out the interface this weekend sometime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for explanation! yes, it does make sense. i think i was trying to do too much in my macros / functions which was causing my confusion. i think if i can get one of the below to work that should give me my baseline to do the rest.
eq_dict_scalar // numeric
eq_dict_bool_scalar // boolean
eq_dict_utf8_scalar // strings
@alamb this took me much longer than it should have, and im still not even sure if its idiomatic rust / arrow but i think i might be close to getting the one thing i noticed is that it doesnt seem there is way to create a Either way, would you be able to see if this update is going in the right direction? If so, then I can extend to utf8 kernels etc. UPDATE: |
You can create
There isn't an equivalent syntax for other scalar types (e.g. |
The type system is very tricky -- let me see if I can sketch something out to help |
@alamb i made some updates and the scalar test is now passing! i think i did it right...maybe some stylistic changes to be made though. If you're okay with this implementation then i can add the other ops for scalar. i had started on the utf8_scalar macro but im still trying to figure out if i can reuse |
Codecov Report
@@ Coverage Diff @@
## master #984 +/- ##
==========================================
- Coverage 82.32% 82.31% -0.01%
==========================================
Files 168 168
Lines 48717 49051 +334
==========================================
+ Hits 40107 40378 +271
- Misses 8610 8673 +63
Continue to review full report at Codecov.
|
|
||
// Safety: | ||
// `i < $left.len()` | ||
let comparison: Vec<bool> = (0..array.len()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't think the values()
(the dictionary size) has to the same as the size of the overall array 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if youre referring to the safety comment i just hadnt removed that yet
pub fn eq_dict_scalar<T>( | ||
left: &DictionaryArray<T>, | ||
right: T::Native, | ||
) -> Result<BooleanArray> | ||
where | ||
T: ArrowNumericType, | ||
{ | ||
#[cfg(not(feature = "simd"))] | ||
println!("{}", std::any::type_name::<T>()); | ||
return compare_dict_op_scalar!(left, T, right, |a, b| a == b); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@matthewmturner this is what I was trying to say.
I think the way you have this function with a single T
generic parameter means one could not compare a DictionaryArray<Int8>
(aka that has keys / indexes of Int8
) that had values of type DataType::Unt16
Here is a sketch of how this might work:
/// Perform `left == right` operation on a [`DictionaryArray`] and a numeric scalar value.
pub fn eq_dict_scalar<T, K>(
left: &DictionaryArray<K>,
right: T::Native,
) -> Result<BooleanArray>
where
T: ArrowNumericType,
K: ArrowNumericType,
{
// compare to the dictionary values (e.g if the dictionary is {A,
// B} and the keys are {1,0,1,1} that represents the values B, A,
// B, B.
//
// So we compare just the dictionary {A, B} values to `right` and
//
// TODO macro-ize this
let dictionary_comparison = match left.values().data_type() {
DataType::Int8 => {
eq_scalar(as_primitive_array::<T>(left.values()), right)
}
// TODO fill in Int16, Int32, etc
_ => unimplemented!("Should error: dictionary did not store values of type T")
}?;
// Required for safety below
assert_eq!(dictionary_comparison.len(), left.values().len());
// Now, look up the dictionary for each output
let result: BooleanArray = left.keys()
.iter()
.map(|key| {
// figure out how the dictionary element at this index
// compared to the scalar
key.map(|key| {
// safety: the original array's indices were valid
// `(0 .. left.values().len()` and dictionary_comparisoon
// is the same size, checked above
unsafe {
// it would be nice to avoid checking the conversion each time
let key = key.to_usize().expect("Dictionary index not usize");
dictionary_comparison.value_unchecked(key)
}
})
})
.collect();
Ok(result)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx much for putting this together and the explanation. I'll work on implementing it!
I wonder if a SIMD implementation could be done as well |
If we take the approach I tried to sketch in #984 (comment) i think the comparisons themselves could be vectorized (ideally by falling back to the existing vectorized kernel) |
@alamb ive started updating to your proposed approach but now when testing i get an error that a type annotation is needed for type parameter T. Im still playing around with it but ran out of time for today. Does that mean that the scalar needs to be casted to an
my naive view is that we are casting to a or said differently, if we're using |
@alamb I thought the benefit of dictionary comparison would be that it would enable vectorization of key comparison (especially useful when values are strings), since the number of keys would usually be much larger than the number of values (although vectorization of value comparison is a nice fallback in case the dictionary isn't sorted and so a binary search wouldn't work). In your proposed approach vectorization of key comparison could still happen, but is left to the compiler (instead of using explicit vectorization from existing comparison kernels). Or have I misunderstood? |
Yes that is my understanding @yordan-pavlov
I guess I was imagining that the key comparisons would be exactly as vectorized as the existing comparison kernels; Specifically, in my sketch above there is a call to eq_scalar(as_primitive_array::<T>(left.values()), right) |
@matthewmturner I plan to answer your question in more detail tomorrow. |
@matthewmturner -- I was able to make your branch compile using this change, but that seems quite non-ideal diff --git a/arrow/src/compute/kernels/comparison.rs b/arrow/src/compute/kernels/comparison.rs
index f98e15d549..59d960220d 100644
--- a/arrow/src/compute/kernels/comparison.rs
+++ b/arrow/src/compute/kernels/comparison.rs
@@ -2135,7 +2135,7 @@ mod tests {
builder.append_null().unwrap();
builder.append(223).unwrap();
let array = builder.finish();
- let a_eq = eq_dict_scalar(&array, 123).unwrap();
+ let a_eq = eq_dict_scalar::<UInt8Type, UInt8Type>(&array, 123).unwrap();
assert_eq!(
a_eq,
BooleanArray::from(vec![Some(true), None, Some(false)]) I had been hoping something like this would work let a_eq = eq_dict_scalar(&array, 123u8).unwrap(); (as in ensure that the 123 was typed as a u8) But the compiler still says it needs type annotations. So while the direct usability of @tustvold or @carols10cents do you have any ideas how we might avoid having to add type annotations to the call of |
I am still thinking about this one |
So I messed around with this some more. The key thing I kept hitting was that the comparisons to the dictionary values effectively needed to be "dynamic" Here is one approach that we might be able to use (and skip the specialized /// Perform `left == right` operation on an array and a numeric scalar
/// value. Supports PrimtiveArrays, and DictionaryArrays that have primitive values
pub fn eq_dyn_scalar<T>(
left: &dyn Array,
right: T::Native,
) -> Result<BooleanArray>
where
T: ArrowNumericType,
{
#[cfg(not(feature = "simd"))]
println!("{}", std::any::type_name::<T>());
match left.data_type() {
DataType::UInt8 => {
// horrible (?) way to get a u8
let right: u8 = right.to_usize()
.and_then(|right| right.try_into().ok())
.ok_or_else(|| ArrowError::ComputeError(format!("Can not convert {:?} to u8 for comparison with UInt8Array", right)))?;
eq_scalar::<UInt8Type>(as_primitive::<UInt8Type>(left), right)
}
DataType::UInt16 => {
// horrible (?) way to get a u16
let right: u16 = right.to_usize()
.and_then(|right| right.try_into().ok())
.ok_or_else(|| ArrowError::ComputeError(format!("Can not convert {:?} to u16 for comparison with UInt16Array", right)))?;
eq_scalar::<UInt16Type>(as_primitive::<UInt16Type>(left), right)
}
// TODO other primitive array types
DataType::Dictionary(key_type, value_type) => {
match key_type.as_ref() {
DataType::UInt8 => {
let left = as_dictionary::<UInt8Type>(left);
unpack_dict_comparison(left, eq_dyn_scalar::<T>(left.values().as_ref(), right)?)
}
// TODO fill out the rest of the key types here
_ => todo!()
}
}
// TODO macroize / fill out rest of primitive dispatch
_ => todo!()
}
} The downside is you still need type annotations at the callsite: #[test]
fn test_dict_eq_scalar() {
...
let array = builder.finish();
// still need the UInt8Type annotations
let a_eq = eq_dyn_scalar::<UInt8Type>(&array, 123u8).unwrap();
assert_eq!(
a_eq,
BooleanArray::from(vec![Some(true), None, Some(false)])
);
} |
The other thing I can think of, that seems a bit of a hack, but might be ok would be to take something like /// Perform `left == right` operation on an array and a numeric scalar
/// value. Supports PrimtiveArrays, and DictionaryArrays that have primitive values
pub fn eq_dyn_scalar<T>(
left: &dyn Array,
right: T
) -> Result<BooleanArray>
where
T : TryInto<i128> + Copy + std::fmt::Debug
{
let right: i128 = right
.try_into()
.map_err(|_| ArrowError::ComputeError(format!("Can not convert scalar {:?} to i128", right)))?;
match left.data_type() {
DataType::UInt8 => {
let right: u8 = right
.try_into()
.map_err(|_| ArrowError::ComputeError(format!("Can not convert {:?} to u8 for comparison with UInt8Array", right)))?;
eq_scalar::<UInt8Type>(as_primitive::<UInt8Type>(left), right)
}
DataType::UInt16 => {
let right: u16 = right
.try_into()
.map_err(|_| ArrowError::ComputeError(format!("Can not convert {:?} to u16 for comparison with UInt16Array", right)))?;
eq_scalar::<UInt16Type>(as_primitive::<UInt16Type>(left), right)
}
// TODO macroize + fill out the other primitive array types here
DataType::Dictionary(key_type, value_type) => {
match key_type.as_ref() {
DataType::UInt8 => {
let left = as_dictionary::<UInt8Type>(left);
unpack_dict_comparison(left, eq_dyn_scalar(left.values().as_ref(), right)?)
}
// TODO fill out the rest of the key types here
_ => todo!()
}
}
_ => todo!()
}
}
/// unpacks the results of comparing left.values (as a boolean)
///
/// TODO add example
///
fn unpack_dict_comparison<K>(
left: &DictionaryArray<K>,
dict_comparison: BooleanArray,
) -> Result<BooleanArray>
where
K: ArrowNumericType,
{
assert_eq!(dict_comparison.len(), left.values().len());
let result: BooleanArray = left
.keys()
.iter()
.map(|key| {
key.map(|key| unsafe {
// safety lengths were verified above
let key = key.to_usize().expect("Dictionary index not usize");
dict_comparison.value_unchecked(key)
})
})
.collect();
Ok(result)
} Which then finally allows a call to #[test]
fn test_dict_eq_scalar() {
...
let array = builder.finish();
// YAY! No type annotations!
let a_eq = eq_dyn_scalar(&array, 123).unwrap();
assert_eq!(
a_eq,
BooleanArray::from(vec![Some(true), None, Some(false)])
);
} |
IMO we need a |
@jorgecarleitao -- Indeed, this is an excellent idea. 🤔 Following the existing pattern in the comparison kernels, we will likely end up with several kernels such as
Which might be nice as the user's rust code could leave the scalar @jorgecarleitao when you say "inspiration" do you mean it would be ok to incorporate https://github.com/jorgecarleitao/arrow2/tree/main/src/scalar into arrow-rs? I have somewhat lost track of what was needed IP clearance wise. Maybe since arrow2 is apache licensed, it is fine? |
@shepmaster this is a very cool idea -- in fact I think that is much better than taking an |
For reference (for anyone not following the gist), a struct DictionaryArray<K> {
indices: PrimitiveArray<K>
values: Arc<dyn Array>
} that is itemwise represented as indices
.into_iter()
.map(|optional_index| optional_index.map(|index| "values[index]")) where @alamb 's insight is that we can compare a dictionary with a scalar of the same (dynamic) type as |
"inspiration" because
|
Right -- I don't understand the implications of this statement (like if code is apache licensed, that means it can be incorporated into other projects, right?). So I don't really understand the IP clearance need but I remember it was complicated 🤷 On the topic of |
@alamb my understanding is that any meaningful chunk of code with any license needs to be cleared if it's not original creation of the dev written for the Apache Arrow project. I believe @jorgecarleitao referred to this: that trait is not IP cleared yet, thus it needs the legal process :/ |
@alamb @jorgecarleitao im not familiar enough with the rest of the code base, but would adding a The approach from @shepmaster seemed quite nice as well however would still require users setting the type of the scalar - is that acceptable? |
One approach could be to remove all the rust native scalars and replace with I think keeping with the existing style of separate kernels for separate rust native types for this PR is probable,
I defer to @jorgecarleitao on his plans, as he is the primary author of most of arrow2
I think it is acceptable (and a good idea) personally. It would result in perhaps something like: /// Perform `left == right` operation on an array and a numeric scalar
/// value. Supports PrimtiveArrays, and DictionaryArrays that have primitive numeric values
pub fn eq_dyn_scalar<T>(
left: &dyn Array,
right: T
) -> Result<BooleanArray>
where
T : IntoArrowNumericType
{
...
} ? |
@alamb ok! I can open a new issue for adding the |
Sounds good @matthewmturner . In terms of
What can I do to be most helpful for this project? |
BTW for some context, the comparison kernels for dictionaries are very important (though not critically urgent) for our usecase in IOx. as we have large amounts of dictionary string data. Thus given you are working on this I can reallocate some non trivial amount of my time to help. Thank you for all the effort so far @matthewmturner |
Yes, what @alippai wrote. Given the number of differences that we would need to apply, re-writing it shouldn't be very difficult. My comment was really just to point out that |
@alamb thank you for the additional context and all the guidance youve provided so far. just to expand on and summarize your plan to make sure im not missing any steps, can you confirm these are the steps?
Im happy to, and have enjoyed, working on this. but I dont want to slow down anything on the IOx side as a result of me getting up to speed on the different pieces here. If you're okay (and aligned on the steps above) with it we could continue as we have been on this, and ill start on the first step and ping you as I have questions / for guidance. If this isnt moving at a quick enough pace then we can work on plan for how you can assist on the implementation to speed it up? |
It's been a while, but if I recall correctly, the idea was for compute kernels in arrow to follow a similar approach to C++ where we have a The C++ impl has a It would have been desirable to move |
Hi @matthewmturner -- I will respond tomorrow - I ran out of time today |
Yes I think those are the steps I was proposing. Thank you @matthewmturner
Sounds like a good plan! Thank you! |
That is a great idea @nevi-me -- I couldn't find an existing ticket so I wrote a new one here: #1047 Maybe we can work towards that vision, starting with the equality kernels (we'll need all the pieces and tests, even when we have a fully dynamic dispatch, I suspect) |
@matthewmturner let me know if it would help to file a ticket for |
@alamb Sure that would be great! |
@shepmaster one quick question on your proposal. What would be the difference between using that approach compared to implementing the
@alamb FYI |
If it makes sense to use However, I don’t think that path is good as you’d still need to provide explicit types: what is the concrete type you are converting into? |
Actually, i think im mistaken and mixing up types and traits. I dont think we need to do a conversion. Sry about that. |
@matthewmturner - I filed #1068 to track the Before we spend time creating polished PRs for Once we had that basic framework in place then we could fill out the implementation in individual PRs. If it would help, I could try and take a crack at creating such a PR (likely based on this one) |
One area that still needs to be determined in C++ is interfaces for custom comparisons:
Maybe worth planning for this as well. |
Thanks @bkmgit -- perhaps you can file a ticket for the item ? |
I think we have had all the discussion on this ticket and have the follow up items tracked in other tickets / issues. Thus closing this PR down |
Which issue does this PR close?
Closes #869
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?