-
Notifications
You must be signed in to change notification settings - Fork 808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ord): Support equality of StructArray #5217
feat(ord): Support equality of StructArray #5217
Conversation
And there are some points that I am not sure about
Maybe another question, how can I make these test codes shorter? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @my-vegetable-has-exploded -- this is looking pretty close to me
Can you please add tests for
distinct
/not_distinct
- A negative test that some operation like
lt
orlt_eq
returns an error (not a panic) for struct arrays? - A negative test that a struct array like
{a: int, b:int}
doesn't returntrue
when compared to a struct array with a prefix like `{a:int}
Also @tustvold do you have any suggestions for what benchmarks to run this on?
I think the output should be the union of all the null buffers. I'll try to review this in the next couple of days |
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
I think the nullbuffer for subarrays is only valid for the subarray itself. Take the Example Layout in the documentation as an example(https://arrow.apache.org/docs/format/Columnar.html#struct-layout), if use the union of all the null buffers, the second slot also gets null, which is a little different from my understanding.
|
Correct, but the semantic of these kernels is any comparison against a null results in null output for that position |
Sure.
I was wondering if it would be better to use Lines 226 to 284 in 9e060dc
|
That would be a different kernel then. We definitely could/should support distinct/not_distinct for StructArray also, the difference with standard equality is how nulls are handled. Distinct follow the intuitive notions of equality, the equality kernels follow the SQL formulation of equality and the somewhat perverse null semantics it has 😅 |
I feel like I have caught your drift this time. Because the comparison between None and any value is Unknown, So {null, 2} is also not comparable. Thanks, I will change my code based on this suggestion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a review, I like where this is headed but I don't think the null mask handling is quite right yet.
FWIW I'm not sure that separating out the null mask and values comparison makes sense, instead I would expect the logic to just recurse across the fields and union the null masks of the results (if any), with a little bit of extra logic to handle any null mask in the struct array proper.
arrow-ord/src/cmp.rs
Outdated
if l_t.is_nested() { | ||
if !l_t.equals_datatype(r_t) { | ||
return Err(ArrowError::InvalidArgumentError(format!( | ||
"Invalid comparison operation: {l_t} {op} {r_t}" | ||
))); | ||
} | ||
match (l_t, op) { | ||
(Struct(_), Op::Equal | Op::NotEqual | Op::Distinct | Op::NotDistinct) => {} | ||
_ => { | ||
return Err(ArrowError::InvalidArgumentError(format!( | ||
"Invalid comparison operation: {l_t} {op} {r_t}" | ||
))); | ||
} | ||
} | ||
} else if r_t != l_t { | ||
return Err(ArrowError::InvalidArgumentError(format!( | ||
"Invalid comparison operation: {l_t} {op} {r_t}" | ||
))); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if l_t.is_nested() { | |
if !l_t.equals_datatype(r_t) { | |
return Err(ArrowError::InvalidArgumentError(format!( | |
"Invalid comparison operation: {l_t} {op} {r_t}" | |
))); | |
} | |
match (l_t, op) { | |
(Struct(_), Op::Equal | Op::NotEqual | Op::Distinct | Op::NotDistinct) => {} | |
_ => { | |
return Err(ArrowError::InvalidArgumentError(format!( | |
"Invalid comparison operation: {l_t} {op} {r_t}" | |
))); | |
} | |
} | |
} else if r_t != l_t { | |
return Err(ArrowError::InvalidArgumentError(format!( | |
"Invalid comparison operation: {l_t} {op} {r_t}" | |
))); | |
} | |
if !l_t.equals_datatype(r_t) { | |
return Err(ArrowError::InvalidArgumentError(format!( | |
"Invalid comparison operation: {l_t} {op} {r_t}" | |
))); | |
} |
arrow-ord/src/cmp.rs
Outdated
let l_t = l.data_type(); | ||
let r_t = r.data_type(); | ||
let l_nulls = l.logical_nulls().filter(|n| n.null_count() > 0); | ||
let r_nulls = r.logical_nulls().filter(|n| n.null_count() > 0); | ||
// for [not]Distinct, the result is never null | ||
match op { | ||
Op::Distinct | Op::NotDistinct => { | ||
return Ok(None); | ||
} | ||
_ => {} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let l_t = l.data_type(); | |
let r_t = r.data_type(); | |
let l_nulls = l.logical_nulls().filter(|n| n.null_count() > 0); | |
let r_nulls = r.logical_nulls().filter(|n| n.null_count() > 0); | |
// for [not]Distinct, the result is never null | |
match op { | |
Op::Distinct | Op::NotDistinct => { | |
return Ok(None); | |
} | |
_ => {} | |
} | |
if matches!(op, Op::Distinct | Op::NotDistinct) { | |
// for [not]Distinct, the result is never null | |
return Ok(None) | |
} | |
let l_t = l.data_type(); | |
let r_t = r.data_type(); | |
let l_nulls = l.logical_nulls().filter(|n| n.null_count() > 0); | |
let r_nulls = r.logical_nulls().filter(|n| n.null_count() > 0); |
arrow-ord/src/cmp.rs
Outdated
// when one of field is equal, the result is false for not equal | ||
// so we use neg to reverse the result of equal when handle not equal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just pass the operator into compare_op_values?
arrow-ord/src/cmp.rs
Outdated
.columns() | ||
.iter() | ||
.zip(r.columns().iter()) | ||
.map(|(col_l, col_r)| compare_op_values(Op::Equal, col_l, l_s, col_r, r_s, len)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this will correctly handle the null masks for a Distinct?
arrow-ord/src/cmp.rs
Outdated
Some(vec![true, false, true, true].into()), | ||
)); | ||
let right_a = Arc::new(Int32Array::new( | ||
vec![0, 1, 2, 3].into(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vec![0, 1, 2, 3].into(), | |
vec![0, 72, 2, 3].into(), |
This helps verify the null mask comparison is correct, and not relying on the values comparison
], | ||
Buffer::from([0b0111]), | ||
)); | ||
let right_struct = StructArray::from(( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let right_struct = StructArray::from(( | |
// right [{a: 0, b: 0}, {a: NULL, b: 1}, {a: 2, b: 2}, {a: 3, b: 3} ] | |
let right_struct = StructArray::from(( |
)); | ||
let field_a = Arc::new(Field::new("a", DataType::Int32, true)); | ||
let field_b = Arc::new(Field::new("b", DataType::Int32, true)); | ||
let left_struct = StructArray::from(( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let left_struct = StructArray::from(( | |
// [{a: 0, b: 0}, {a: NULL, b: 1}, {a: 2, b: 20}, {a: 3, b: 3}] | |
let left_struct = StructArray::from(( |
I wanted to do that at first. The main reason is that I found it's hard to reuse those codes Lines 225 to 285 in 3cd6da0
If there is a better way to organize those codes, I'd like to have a try! Thanks a lot! |
It should be possible to just call compare_op recursively |
I'll have a play later today/tomorrow and see if I can't simplify this a bit |
Thanks a lot,I’m sorry to add to your workload. |
Hey @tustvold & @my-vegetable-has-exploded , do we know the status of this PR now? It's been open for a bit and it seems there has been another PR for the same issue in the meantime, #5423, so wondering if efforts should be focused on a single PR? Otherwise can keep both open but mark this as draft as there hasn't been movement for a bit? |
Sorry this is partly on me, I'm somewhat struggling to keep up with all the various things going on. I think my preference is towards something along the lines of #5672 which would allow us to handle StructArray more comprehensively in the comparison kernels, instead of having non-trivial logic just for the case of equality. I think let's mark this as a draft and I will try to find sometime next week to sort something out in this space |
Which issue does this PR close?
Closes #5199
Rationale for this change
What changes are included in this PR?
values()
to functioncompare_op_struct_values()
Are there any user-facing changes?