Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add scalar comparison kernels for DictionaryArray #984

Closed
Closed
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
113 changes: 111 additions & 2 deletions arrow/src/compute/kernels/comparison.rs
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,9 @@ use crate::buffer::{bitwise_bin_op_helper, buffer_unary_not, Buffer, MutableBuff
use crate::compute::binary_boolean_kernel;
use crate::compute::util::combine_option_bitmap;
use crate::datatypes::{
ArrowNumericType, DataType, Float32Type, Float64Type, Int16Type, Int32Type,
Int64Type, Int8Type, UInt16Type, UInt32Type, UInt64Type, UInt8Type,
ArrowNativeType, ArrowNumericType, ArrowPrimitiveType, DataType, Dictionary,
Float32Type, Float64Type, Int16Type, Int32Type, Int64Type, Int8Type, UInt16Type,
UInt32Type, UInt64Type, UInt8Type,
};
use crate::error::{ArrowError, Result};
use crate::util::bit_util;
Expand Down Expand Up @@ -200,6 +201,42 @@ macro_rules! compare_op_scalar_primitive {
}};
}

macro_rules! compare_dict_op_scalar {
($left:expr, $right:expr, $op:expr) => {{
let null_bit_buffer = $left
.data()
.null_buffer()
.map(|b| b.bit_slice($left.offset(), $left.len()));

let values = $left
.values()
.as_any()
.downcast_ref::<StringArray>()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The values array can be anything (not always a StringArray) -- perhaps this would be a good place to use the dyn_XX kernels -- to compare the values array with $right)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from my understanding of the dyn kernels those cant be used when comparing to constant right?

Copy link
Contributor

@alamb alamb Nov 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 yes you are correct -- we would need to add dyn_XX_lit type kernels, but that seems a bit overkill for this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the primary use case for this PR was comparing dict array to constant then maybe it makes sense for me to do a separate PR for that first and then come back to this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the primary use case for this PR was comparing dict array to constant then maybe it makes sense for me to do a separate PR for that first and then come back to this?

I think focusing on the usecase of comparing dict array to constant is the best choice for now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok! Will start with that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb ive been reviewing this but i think i might be missing something. my understanding is that my code above is for getting the dictionary values, which can be of any type (of course above im only handling StringArray).

        let values = $left
            .values()
            .as_any()
            .downcast_ref::<StringArray>()
            .unwrap()

But then you mention using the new dyn_xx kernels / creating dyn_xx_lit kernels. Since theres no actual compute being done here, what would the dyn kernels be used for? Or were you referring to using the kernels to replace more than just that section of code?

to me it looks like i need a macro to downcast DictionaryArray.values() into whatever type the values are, and then i could use something like dyn_xx_lit on that in order to get the comparison results. Is this roughly what you had in mind?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am very sorry for confusing this conversation with mentioning dyn_xx_lit.

What I was (inarticulately) trying to say was that once you have eq_dict_scalar (and you will likely also need eq_dict_scalar_utf8) we will end up with several different ways to compare an array to a scalar, depending on the array type

So I was thinking ahead to adding functions like

fn eq_scalar_utf8_dyn(array: dyn &Array, right: &str) -> Boolean {
  // do dispatch to the right kernel based on type of array
}

But definitely not for this PR

.unwrap();

// Safety:
// `i < $left.len()`
let comparison = (0..$left.len()).map(|i| unsafe {
let key = $left.keys().value_unchecked(i).to_usize().unwrap();
$op(values.value_unchecked(key), $right)
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one of the main points of this ticket is to avoid the call here to vaues.value_unchecked

I like to think about the goal in by thinking "what would happen with DictionaryArray with 1000000 entries but a dictionary of size 1?" -- the way you have this PR, I think we would call $op 1000000 times. The idea is to call $op 1 time.

So the pattern I think we are looking, at least for the constant kernels is:

In pseudo code:

let values = dict_array.values();
let comparison_result_on_values = apply_op_to_values();
let result = dict_array.keys().iter().map(|index| comparison_result_on_values[index]).collect()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks for explanation. I am looking into this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb im struggling with the second step in your pseudocode given that my understanding is that the values could be of any ArrowPrimativeType. Would you be able to provide guidance on how to handle that? I've been playing with different macros and iteration options on the underlying buffers, but i feel like im missing some fundamental understanding about how to work with dynamic data type like this or how to use ArrayData.

Copy link
Contributor

@alamb alamb Dec 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔

Yes this is definitely tricky. Maybe taking a step back, and think about the usecase: comparing DictionaryArrays to literals.

For example, if you look at the comparison kernels (for eq) , https://docs.rs/arrow/6.3.0/arrow/compute/kernels/comparison/index.html we find;

eq_scalar
eq_bool_scalar
eq_utf8_scalar

With each being typed based on the type of scalar (because the arrays are typed)

The issue with a DictionaryArray is that it could have numbers, bool, strings, etc. so we can't have a single entrypoint as we do with other types of arrays

So i am thinking we would need something like

eq_dict_scalar // numeric 
eq_dict_bool_scalar // boolean
eq_dict_utf8_scalar // strings

where each of those kernels would be able to downcast the array appropriately.

However, having three functions for each dict kernel seems somewhat crazy.

That is where my dyn idea was coming from. If we are going to add three new kernels for each operator (eq, lt, etc) we could perhaps add

eq_dyn_scalar // numeric 
eq_dyn_bool_scalar // boolean
eq_dyn_utf8_scalar // strings

etc

Which handle DictionaryArray as well as dispatching to the other eq_scalar, eq_bool_scalar, eq_utf8_scalar as appropriate.

Does that make sense? I can try and sketch out the interface this weekend sometime

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for explanation! yes, it does make sense. i think i was trying to do too much in my macros / functions which was causing my confusion. i think if i can get one of the below to work that should give me my baseline to do the rest.

eq_dict_scalar // numeric 
eq_dict_bool_scalar // boolean
eq_dict_utf8_scalar // strings

// same as $left.len()
let buffer = unsafe { MutableBuffer::from_trusted_len_iter_bool(comparison) };

let data = unsafe {
ArrayData::new_unchecked(
DataType::Boolean,
$left.len(),
None,
null_bit_buffer,
0,
vec![Buffer::from(buffer)],
vec![],
)
};
Ok(BooleanArray::from(data))
}};
}
/// Evaluate `op(left, right)` for [`PrimitiveArray`]s using a specified
/// comparison function.
pub fn no_simd_compare_op<T, F>(
Expand Down Expand Up @@ -693,6 +730,14 @@ pub fn eq_utf8_scalar<OffsetSize: StringOffsetSizeTrait>(
compare_op_scalar!(left, right, |a, b| a == b)
}

/// Perform `left == right` operation on [`DictionaryArray`] and a scalar.
pub fn eq_dict_scalar<OffsetSize: ArrowPrimitiveType>(
left: &DictionaryArray<OffsetSize>,
right: &str,
) -> Result<BooleanArray> {
compare_dict_op_scalar!(left, right, |a, b| a == b)
}

#[inline]
fn binary_boolean_op<F>(
left: &BooleanArray,
Expand Down Expand Up @@ -802,6 +847,14 @@ pub fn neq_utf8_scalar<OffsetSize: StringOffsetSizeTrait>(
compare_op_scalar!(left, right, |a, b| a != b)
}

/// Perform `left != right` operation on [`StringArray`] / [`LargeStringArray`] and a scalar.
pub fn neq_dict_scalar<OffsetSize: ArrowPrimitiveType>(
left: &DictionaryArray<OffsetSize>,
right: &str,
) -> Result<BooleanArray> {
compare_dict_op_scalar!(left, right, |a, b| a != b)
}

/// Perform `left < right` operation on [`StringArray`] / [`LargeStringArray`].
pub fn lt_utf8<OffsetSize: StringOffsetSizeTrait>(
left: &GenericStringArray<OffsetSize>,
Expand All @@ -818,6 +871,14 @@ pub fn lt_utf8_scalar<OffsetSize: StringOffsetSizeTrait>(
compare_op_scalar!(left, right, |a, b| a < b)
}

/// Perform `left < right` operation on [`DictionaryArray`] and a scalar.
pub fn lt_dict_scalar<OffsetSize: ArrowPrimitiveType>(
left: &DictionaryArray<OffsetSize>,
right: &str,
) -> Result<BooleanArray> {
compare_dict_op_scalar!(left, right, |a, b| a < b)
}

/// Perform `left <= right` operation on [`StringArray`] / [`LargeStringArray`].
pub fn lt_eq_utf8<OffsetSize: StringOffsetSizeTrait>(
left: &GenericStringArray<OffsetSize>,
Expand All @@ -834,6 +895,14 @@ pub fn lt_eq_utf8_scalar<OffsetSize: StringOffsetSizeTrait>(
compare_op_scalar!(left, right, |a, b| a <= b)
}

/// Perform `left <= right` operation on [`DictionaryArray`] and a scalar.
pub fn lt_eq_dict_scalar<OffsetSize: ArrowPrimitiveType>(
left: &DictionaryArray<OffsetSize>,
right: &str,
) -> Result<BooleanArray> {
compare_dict_op_scalar!(left, right, |a, b| a <= b)
}

/// Perform `left > right` operation on [`StringArray`] / [`LargeStringArray`].
pub fn gt_utf8<OffsetSize: StringOffsetSizeTrait>(
left: &GenericStringArray<OffsetSize>,
Expand All @@ -850,6 +919,14 @@ pub fn gt_utf8_scalar<OffsetSize: StringOffsetSizeTrait>(
compare_op_scalar!(left, right, |a, b| a > b)
}

/// Perform `left > right` operation on [`DictionaryArray`] and a scalar.
pub fn gt_dict_scalar<OffsetSize: ArrowPrimitiveType>(
left: &DictionaryArray<OffsetSize>,
right: &str,
) -> Result<BooleanArray> {
compare_dict_op_scalar!(left, right, |a, b| a > b)
}

/// Perform `left >= right` operation on [`StringArray`] / [`LargeStringArray`].
pub fn gt_eq_utf8<OffsetSize: StringOffsetSizeTrait>(
left: &GenericStringArray<OffsetSize>,
Expand All @@ -866,6 +943,14 @@ pub fn gt_eq_utf8_scalar<OffsetSize: StringOffsetSizeTrait>(
compare_op_scalar!(left, right, |a, b| a >= b)
}

/// Perform `left >= right` operation on [`DictionaryArray`] and a scalar.
pub fn gt_eq_dict_scalar<OffsetSize: ArrowPrimitiveType>(
left: &DictionaryArray<OffsetSize>,
right: &str,
) -> Result<BooleanArray> {
compare_dict_op_scalar!(left, right, |a, b| a >= b)
}

/// Helper function to perform boolean lambda function on values from two arrays using
/// SIMD.
#[cfg(feature = "simd")]
Expand Down Expand Up @@ -2032,6 +2117,30 @@ mod tests {
);
}

#[test]
fn test_dict_eq_scalar() {
let a: DictionaryArray<Int8Type> =
vec!["hi","hello", "world"].into_iter().collect();
let a_eq = eq_dict_scalar(&a, "hello").unwrap();
assert_eq!(a_eq, BooleanArray::from(vec![false, true, false]));
}

#[test]
fn test_dict_neq_scalar() {
let a: DictionaryArray<Int8Type> =
vec!["hi","hello", "world"].into_iter().collect();
let a_eq = neq_dict_scalar(&a, "hello").unwrap();
assert_eq!(a_eq, BooleanArray::from(vec![true, false, true]));
}

#[test]
fn test_dict_lt_scalar() {
let a: DictionaryArray<Int8Type> =
vec!["hi","hello", "world"].into_iter().collect();
let a_eq = lt_dict_scalar(&a, "hi").unwrap();
assert_eq!(a_eq, BooleanArray::from(vec![false, true, false]));
}

macro_rules! test_utf8_scalar {
($test_name:ident, $left:expr, $right:expr, $op:expr, $expected:expr) => {
#[test]
Expand Down