Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define eq_dyn_scalar API #1074

Merged
merged 2 commits into from
Jan 1, 2022

Conversation

matthewmturner
Copy link
Contributor

Which issue does this PR close?

Working on this in relation to #984 and #1068 with the end goal being to finalize how we want eq_dyn_scalar to work.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Dec 21, 2021
@matthewmturner
Copy link
Contributor Author

@alamb ive started the work on defining the eq_dyn_scalar api. ive tried to combine some of the different points we discussed / you proposed over the course of #984. do you think this is going in the right direction?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good @matthewmturner

One challenge is going to be making code that is maintainable and not just swaths of copy/paste (but different type code). It will be a fun exercise

builder.append_null().unwrap();
builder.append(223).unwrap();
let array = builder.finish();
let a_eq = eq_dyn_scalar(&array, 123).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good to me 👍

right
))
})?;
eq_scalar::<UInt8Type>(as_primitive_array::<UInt8Type>(left), right)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb
Copy link
Contributor

alamb commented Dec 21, 2021

cc @shepmaster , @jorgecarleitao and @alippai

@matthewmturner
Copy link
Contributor Author

@alamb yes agree. That's what I'm going to work on next, the first commit was basically just what you said (copy/paste) with minimal cleaning to show the high level api. Now going to try making macros that can be used for all dyn_scalar kernels.

Just to confirm, we're still expecting type annotation on the scalar right? I.e 123u8

@alamb
Copy link
Contributor

alamb commented Dec 21, 2021

Just to confirm, we're still expecting type annotation on the scalar right? I.e 123u8

I am hoping that we can avoid those annotations, to be honest

@matthewmturner
Copy link
Contributor Author

@alamb i think i might be getting close. I used your TryInto<i128> idea to enable using the kernel without type annotations.

Not sure why im hitting a recursion limit error though - will need to do some more work on that.

Do you think this is going in the right direction?

@matthewmturner matthewmturner marked this pull request as draft December 22, 2021 00:40
@@ -898,6 +898,126 @@ pub fn gt_eq_utf8_scalar<OffsetSize: StringOffsetSizeTrait>(
compare_op_scalar!(left, right, |a, b| a >= b)
}

macro_rules! dyn_cmp_scalar {
($LEFT: expr, $RIGHT: expr, $T: ident, $OP: ident, $TT: tt) => {{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should $T and $TT be ty?

type_name::<$T>(),
))
})?;
$OP::<$TT>(left, right)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$OP::<$TT> could probably be fused as one expr macro argument

}

/// Perform `left == right` operation on an array and a numeric scalar
/// value. Supports PrimtiveArrays, and DictionaryArrays that have primitive values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// value. Supports PrimtiveArrays, and DictionaryArrays that have primitive values
/// value. Supports PrimitiveArrays, and DictionaryArrays that have primitive values

where
K: ArrowNumericType,
{
assert_eq!(dict_comparison.len(), left.values().len());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why an assertion as opposed to an error or a "no this does not match?"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is an invariant (namely that left is the dictionary and dict_comparison is the result of comparing those values).

Perhaps we could rename the left parameter to dict to make this clearer


macro_rules! dyn_compare_scalar {
($LEFT: expr, $RIGHT: expr, $OP: ident) => {{
let right = $RIGHT.try_into().map_err(|_| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm missing where this right value is used...

Copy link
Contributor Author

@matthewmturner matthewmturner Dec 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just hadn't yet updated where $RIGHT is used in all the match arms to use right instead

where
T: IntoArrowNumericType + TryInto<i128> + Copy + std::fmt::Debug,
{
dyn_compare_scalar!(left, right, eq_scalar)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I admit this is a drive-by review, but I'm not seeing the benefit of the macros here yet. They don't do any repetition reduction. It looks like dyn_compare_scalar could be inlined and dyn_cmp_scalar could be a regular function.

@codecov-commenter
Copy link

codecov-commenter commented Dec 23, 2021

Codecov Report

Merging #1074 (e8c5e46) into master (2ad99ec) will decrease coverage by 0.02%.
The diff coverage is 69.38%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1074      +/-   ##
==========================================
- Coverage   82.34%   82.31%   -0.03%     
==========================================
  Files         168      168              
  Lines       49479    49577      +98     
==========================================
+ Hits        40743    40810      +67     
- Misses       8736     8767      +31     
Impacted Files Coverage Δ
arrow/src/compute/kernels/comparison.rs 90.87% <69.38%> (-2.60%) ⬇️
arrow/src/datatypes/datatype.rs 66.38% <0.00%> (-0.43%) ⬇️
parquet_derive/src/parquet_field.rs 65.98% <0.00%> (-0.23%) ⬇️
arrow/src/array/transform/mod.rs 85.69% <0.00%> (+0.13%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2ad99ec...e8c5e46. Read the comment docs.

@matthewmturner
Copy link
Contributor Author

matthewmturner commented Dec 23, 2021

Ive made a number of updates to simplify this, such as de-macroing a level, and have finally made our test pass. Of course there would need to be considerable cleanup to this but I think that this could demonstrate an approach to making this work. To be fair, ive thought ive been close about 5 times now and have been wrong each time so definitely looking to get more insight from others.

@alamb as always your feedback greatly appreciated.

@shepmaster thank you for the review before. ive cleaned up based on some of your feedback with the goal of trying to demonstrate in simpler terms (i.e. less macros) what were trying to do. im still quite new to rust so im not sure how to manage inlining dyn_compare_scalar. ive started to read up on it but will definitely take more time to get an understanding. If you could provide any additional color on that it would be greatly appreciated, else I can follow up once i have a better understanding. also would be interested in what @alamb has to say about it.

@matthewmturner
Copy link
Contributor Author

matthewmturner commented Dec 23, 2021

also as a note, i added the IntoArrowNumericType just for testing purposes here, of course it wouldnt actually be added there or in this PR. I also wasnt sure how to go about accessing the associated type Arrow of that trait (i.e. in a match) so added a method to get the relevant arrow type.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking really close @matthewmturner -- 🙏 -- thank you. I do wonder about having both IntoNumericType and Into<i128> I feel like like they are redundant somehow

where
K: ArrowNumericType,
{
assert_eq!(dict_comparison.len(), left.values().len());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is an invariant (namely that left is the dictionary and dict_comparison is the result of comparing those values).

Perhaps we could rename the left parameter to dict to make this clearer

Comment on lines 918 to 920
$LEFT.as_any().downcast_ref::<Int8Array>().ok_or_else(|| {
ArrowError::CastError(String::from("Left array cannot be cast"))
})?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use https://docs.rs/arrow/6.4.0/arrow/array/fn.as_primitive_array.html to simplify this

So something like

Suggested change
$LEFT.as_any().downcast_ref::<Int8Array>().ok_or_else(|| {
ArrowError::CastError(String::from("Left array cannot be cast"))
})?;
as_primitive_array::<Int8Array>($LEFT)?;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these functions require an Arc<dyn Array> /ArrayRef which would mean we need the end user to do that before passing the array to the kernel or do you think its okay to add the Arc in the kernel?

Comment on lines 1044 to 1047
let left = left
.as_any()
.downcast_ref::<DictionaryArray<UInt8Type>>()
.unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might be able to use https://docs.rs/arrow/6.4.0/arrow/array/fn.as_dictionary_array.html here too:

Suggested change
let left = left
.as_any()
.downcast_ref::<DictionaryArray<UInt8Type>>()
.unwrap();
let left = as_dictionary_array<DictionaryArray<UInt8Type>>()?;

/// value. Supports PrimitiveArrays, and DictionaryArrays that have primitive values
pub fn eq_dyn_scalar<T>(left: &dyn Array, right: T) -> Result<BooleanArray>
where
T: IntoArrowNumericType + TryInto<i128> + Copy + std::fmt::Debug,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If T is going to have TryInto<i128> , I wonder if we still IntoArrowNumericType at all? I think it may not be necessary any more.

My thinking is that since dyn_compare_scalar converts right into i128 immediately there are then conversion rules back to all of the primitive types needed to call eq_scalar

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes agree, I was starting to think the same. Will try it out.

@matthewmturner
Copy link
Contributor Author

matthewmturner commented Dec 23, 2021

@alamb to confirm, are you expecting this kernel to handle boolean and utf8 comparisons as well? Or is that only available with the future scalar API?

@matthewmturner
Copy link
Contributor Author

matthewmturner commented Dec 24, 2021

@alamb to confirm, we arent handling floats with this kernel right? My understanding is our current T wouldnt work with that.

ive also done some cleaning up - can you let me know if you think this is ok? In particular:

  1. It requires the user of the kernel to wrap the array in an Arc so that we can use the as_xx_array functions. we can put that in the kernel code itself if you dont think users should have to do that.
  2. I provided two signatures for dyn_compare_scalar - one for primitive and one for dictionary.

@matthewmturner
Copy link
Contributor Author

Ive added eq_dyn_utf8_scalar as well

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @matthewmturner -- I think this is looking great ❤️

DataType::Dictionary(key_type, _) => {
return dyn_compare_utf8_scalar!(&left, right, key_type, eq_utf8_scalar);
}
_ => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this probably wants to match on the DataType::Utf8 and DataType::LargeUtf8 but otherwise looks good to me

fn test_eq_dyn_scalar() {
let array = Int32Array::from(vec![6, 7, 8, 8, 10]);
let array = Arc::new(array);
let a_eq = eq_dyn_scalar(array, 8).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

200

fn test_eq_dyn_utf8_scalar() {
let array = StringArray::from(vec!["abc", "def", "xyz"]);
let array = Arc::new(array);
let a_eq = eq_dyn_utf8_scalar(array, "xyz").unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@alamb
Copy link
Contributor

alamb commented Dec 26, 2021

How shall we proceed -- get this one polished up (maybe also add eq_dyn_scalar_bool?) and then start hammering on the other kernels (e.g. neq, lt, gt, etc?)

@matthewmturner
Copy link
Contributor Author

How shall we proceed -- get this one polished up (maybe also add eq_dyn_scalar_bool?) and then start hammering on the other kernels (e.g. neq, lt, gt, etc?)

Sure sounds good. Should I just add those on this PR?

@matthewmturner
Copy link
Contributor Author

actually - are bool values currently supported with DictionaryArray?

I get the following when trying to make one:

a value of type `array_dictionary::DictionaryArray<datatypes::types::Int8Type>` cannot be built from an iterator over elements of type `bool`
the trait `FromIterator<bool>` is not implemented for `array_dictionary::DictionaryArray<datatypes::types::Int8Type>`

i also dont see any tests for DictionaryArray that has that as value type.

@matthewmturner matthewmturner marked this pull request as ready for review December 26, 2021 17:57
@matthewmturner
Copy link
Contributor Author

@alamb if you're okay with the latest changes and can merge then ill prioritize doing another PR with the other comparison functions.

thank you for all the guidance youve provided on this!

DataType::Int8 => {
let right: i8 = right.try_into().map_err(|_| {
ArrowError::ComputeError(String::from(
"Can not convert scalar to i128",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can not convert scalar to i128 -> Can not convert scalar to i8

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It affects the messages below as well

_ => Err(ArrowError::ComputeError(
"Kernel only supports PrimitiveArray or DictionaryArray with Primitive values".to_string(),
))
}
}

/// Perform `left == right` operation on an array and a numeric scalar
/// value. Supports StringArrays, and DictionaryArrays that have string values
pub fn eq_dyn_utf8_scalar(left: Arc<dyn Array>, right: &str) -> Result<BooleanArray> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love it

eq_bool_scalar(left, right)
}
_ => Err(ArrowError::ComputeError(
"Kernel only supports BooleanArray".to_string(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah -- users can call cast if they want to convert their array into boolean first

@alamb
Copy link
Contributor

alamb commented Dec 28, 2021

Sure sounds good. Should I just add those on this PR?

I recommend doing the code in multiple PRs to keep the reviews smaller.

actually - are bool values currently supported with DictionaryArray?

I think they would be valid per the arrow spec, but I don't think they would be very useful -- A dictionary array of bools will be (much) less efficient both in term of space and CPU than a BooleanArray. This is probably why there is no array builder for them https://docs.rs/arrow/6.4.0/arrow/array/struct.DictionaryArray.html?search=dictionarybuilder

I think it is fine to dictionary of bools unimplemented and people can implement them if they want

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks great to me. I had a few small suggestions, but otherwise all good 👍

DataType::Int8 => {
let right: i8 = right.try_into().map_err(|_| {
ArrowError::ComputeError(String::from(
"Can not convert scalar to i128",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It affects the messages below as well

))),
}
}};
($LEFT: expr, $RIGHT: expr, $KT: ident, $OP: ident) => {{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
($LEFT: expr, $RIGHT: expr, $KT: ident, $OP: ident) => {{
/// Applies `LEFT OP RIGHT` when `LEFT` is a `DictionaryArray` with keys of type `KT`
($LEFT: expr, $RIGHT: expr, $KT: ident, $OP: ident) => {{

@@ -898,6 +900,305 @@ pub fn gt_eq_utf8_scalar<OffsetSize: StringOffsetSizeTrait>(
compare_op_scalar!(left, right, |a, b| a >= b)
}

macro_rules! dyn_compare_scalar {
($LEFT: expr, $RIGHT: expr, $OP: ident) => {{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
($LEFT: expr, $RIGHT: expr, $OP: ident) => {{
/// Applies `LEFT OP RIGHT` when `LEFT` is a `DictionaryArray` with keys of type `KT`
($LEFT: expr, $RIGHT: expr, $OP: ident) => {{

@matthewmturner
Copy link
Contributor Author

@alamb updates made!

@matthewmturner
Copy link
Contributor Author

seems im getting hit with that from_str_unchecked issue

@alamb
Copy link
Contributor

alamb commented Dec 30, 2021

seems im getting hit with that from_str_unchecked issue

I think it has been resolved if you merge up from master again

@github-actions github-actions bot added arrow-flight Changes to the arrow-flight crate parquet Changes to the parquet crate labels Dec 30, 2021
@matthewmturner
Copy link
Contributor Author

having issues with rebasing. for whatever reason when i rebase to upstream master im losing all my recent commits. still looking into it.

@matthewmturner
Copy link
Contributor Author

after wandering through git hell for a bit i think i figured it out. hopefully CI passes and were good here.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @matthewmturner -- shall I merge this as is, or do you want to apply the suggestions from @liukun4515 first (fix error messages)?

Then, it seems like the next step might be to file tickets (I can do so if you wish) for the other kernels (lt_dyn_scalar, gt_dyn_scalar, etc)

@matthewmturner
Copy link
Contributor Author

@alamb odd, i had made updates for those suggestions. must have been missed with my rebase.

ive made the fixes.

i can make an issue for it - and im hoping to work on implementing it this weekend.

do you have a preference for how small (i.e how many kernels) youd like each pr / issue to be?

@alamb
Copy link
Contributor

alamb commented Dec 31, 2021

i can make an issue for it - and im hoping to work on implementing it this weekend.

Thanks @matthewmturner ❤️

do you have a preference for how small (i.e how many kernels) youd like each pr / issue to be?

My personal bias is one PR per kernel (as that makes it easiest to review) but I am also happy to review a single PR too

@liukun4515
Copy link
Contributor

i can make an issue for it - and im hoping to work on implementing it this weekend.

Thanks @matthewmturner ❤️

do you have a preference for how small (i.e how many kernels) youd like each pr / issue to be?

My personal bias is one PR per kernel (as that makes it easiest to review) but I am also happy to review a single PR too

I think one pr per kernel is better.
The pr including all updates may be large, and it is not friendly to reviewers.
@matthewmturner

Copy link
Contributor

@liukun4515 liukun4515 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@matthewmturner
Copy link
Contributor Author

@alamb @liukun4515 one PR per kernel it is :)

@alamb alamb merged commit 0d825c1 into apache:master Jan 1, 2022
@alamb
Copy link
Contributor

alamb commented Jan 1, 2022

Thanks again @matthewmturner and @liukun4515 -- this is great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants