-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement ArrayEqual for UnionArray #1469
Changes from 7 commits
25be3f3
f47151b
d896a9b
017a1ae
48f070f
9621ceb
11564d0
c4ee09a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,171 @@ | ||
// Licensed to the Apache Software Foundation (ASF) under one | ||
// or more contributor license agreements. See the NOTICE file | ||
// distributed with this work for additional information | ||
// regarding copyright ownership. The ASF licenses this file | ||
// to you under the Apache License, Version 2.0 (the | ||
// "License"); you may not use this file except in compliance | ||
// with the License. You may obtain a copy of the License at | ||
// | ||
// http://www.apache.org/licenses/LICENSE-2.0 | ||
// | ||
// Unless required by applicable law or agreed to in writing, | ||
// software distributed under the License is distributed on an | ||
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
// KIND, either express or implied. See the License for the | ||
// specific language governing permissions and limitations | ||
// under the License. | ||
|
||
use crate::datatypes::Field; | ||
use crate::{ | ||
array::ArrayData, buffer::Buffer, datatypes::DataType, datatypes::UnionMode, | ||
}; | ||
|
||
use super::{ | ||
equal_range, equal_values, utils::child_logical_null_buffer, utils::equal_nulls, | ||
}; | ||
|
||
// Checks if corresponding slots in two UnionArrays are same data types | ||
fn equal_types( | ||
lhs_fields: &[Field], | ||
rhs_fields: &[Field], | ||
lhs_type_ids: &[i8], | ||
rhs_type_ids: &[i8], | ||
) -> bool { | ||
let lhs_slots_types = lhs_type_ids | ||
.iter() | ||
.map(|type_id| lhs_fields.get(*type_id as usize).unwrap().data_type()); | ||
|
||
let rhs_slots_types = rhs_type_ids | ||
.iter() | ||
.map(|type_id| rhs_fields.get(*type_id as usize).unwrap().data_type()); | ||
|
||
lhs_slots_types.zip(rhs_slots_types).all(|(l, r)| l == r) | ||
} | ||
|
||
fn equal_dense( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Much like equal_sparse below, by the time we are calling this method we have already established equality of the type_ids, I therefore wonder if we can simply compute equality of the offsets arrays, followed by computing equality of the underlying child arrays. This appears to be similar to what is done for list_equal 🤔 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Even type_ids equality is established, lhs and rhs start positions might be different. So we cannot simply compare offsets arrays and child arrays for dense case. |
||
lhs: &ArrayData, | ||
rhs: &ArrayData, | ||
lhs_type_ids: &[i8], | ||
rhs_type_ids: &[i8], | ||
lhs_offsets: &[i32], | ||
rhs_offsets: &[i32], | ||
) -> bool { | ||
let offsets = lhs_offsets.iter().zip(rhs_offsets.iter()); | ||
|
||
lhs_type_ids | ||
.iter() | ||
.zip(rhs_type_ids.iter()) | ||
.zip(offsets) | ||
.all(|((l_type_id, r_type_id), (l_offset, r_offset))| { | ||
let lhs_values = &lhs.child_data()[*l_type_id as usize]; | ||
let rhs_values = &rhs.child_data()[*r_type_id as usize]; | ||
|
||
equal_values( | ||
lhs_values, | ||
rhs_values, | ||
None, | ||
None, | ||
*l_offset as usize, | ||
*r_offset as usize, | ||
1, | ||
) | ||
}) | ||
} | ||
|
||
#[allow(clippy::too_many_arguments)] | ||
fn equal_sparse( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. By the time we are calling this method we have established that
As such I'm surprised this can't just do a standard equality comparison of the child data, which will take into null buffers of the children There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is related to above question. As field and type_id can be changed in position, but the resulting union array is still the same, we cannot directly compare them, but can only compare child array corresponding to type_id. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suppose it really does come down to "what does it mean for two Can it ever be true that Given my reading of the cpp Maybe we could follow the cpp implementation -- both for consistency as well as for simplicity? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, if it is the case, we can simplify the comparison here. What I have in mind and implemented here can be thought as logically equal. That means corresponding slots of two union arrays are equal, but their physical layout may be different -- type_ids and fields can be different separately. If we only want to compare the physical layout, then fields and type_ids are required to be exactly equal. Thanks for pointing out the cpp implementation. We should follow it, I agree. I will revise this change for that. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
lhs: &ArrayData, | ||
rhs: &ArrayData, | ||
lhs_nulls: Option<&Buffer>, | ||
rhs_nulls: Option<&Buffer>, | ||
lhs_type_ids: &[i8], | ||
rhs_type_ids: &[i8], | ||
lhs_start: usize, | ||
rhs_start: usize, | ||
) -> bool { | ||
lhs_type_ids | ||
.iter() | ||
.zip(rhs_type_ids.iter()) | ||
.enumerate() | ||
.all(|(index, (l_type_id, r_type_id))| { | ||
let lhs_values = &lhs.child_data()[*l_type_id as usize]; | ||
let rhs_values = &rhs.child_data()[*r_type_id as usize]; | ||
|
||
// merge the null data | ||
let lhs_merged_nulls = child_logical_null_buffer(lhs, lhs_nulls, lhs_values); | ||
let rhs_merged_nulls = child_logical_null_buffer(rhs, rhs_nulls, rhs_values); | ||
|
||
equal_range( | ||
lhs_values, | ||
rhs_values, | ||
lhs_merged_nulls.as_ref(), | ||
rhs_merged_nulls.as_ref(), | ||
lhs_start + index, | ||
rhs_start + index, | ||
1, | ||
) | ||
}) | ||
} | ||
|
||
pub(super) fn union_equal( | ||
lhs: &ArrayData, | ||
rhs: &ArrayData, | ||
lhs_nulls: Option<&Buffer>, | ||
rhs_nulls: Option<&Buffer>, | ||
lhs_start: usize, | ||
rhs_start: usize, | ||
len: usize, | ||
) -> bool { | ||
let lhs_type_ids = lhs.buffer::<i8>(0); | ||
let rhs_type_ids = rhs.buffer::<i8>(0); | ||
|
||
match (lhs.data_type(), rhs.data_type()) { | ||
( | ||
DataType::Union(lhs_fields, UnionMode::Dense), | ||
DataType::Union(rhs_fields, UnionMode::Dense), | ||
) => { | ||
let lhs_offsets = lhs.buffer::<i32>(1); | ||
let rhs_offsets = rhs.buffer::<i32>(1); | ||
|
||
let lhs_type_id_range = &lhs_type_ids[lhs_start..lhs_start + len]; | ||
let rhs_type_id_range = &rhs_type_ids[rhs_start..rhs_start + len]; | ||
|
||
let lhs_offsets_range = &lhs_offsets[lhs_start..lhs_start + len]; | ||
let rhs_offsets_range = &rhs_offsets[rhs_start..rhs_start + len]; | ||
|
||
// nullness is kept in the parent UnionArray, so we compare its nulls here | ||
equal_types(lhs_fields, rhs_fields, lhs_type_ids, rhs_type_ids) | ||
&& equal_nulls(lhs, rhs, lhs_nulls, rhs_nulls, lhs_start, rhs_start, len) | ||
&& equal_dense( | ||
lhs, | ||
rhs, | ||
lhs_type_id_range, | ||
rhs_type_id_range, | ||
lhs_offsets_range, | ||
rhs_offsets_range, | ||
) | ||
} | ||
( | ||
DataType::Union(lhs_fields, UnionMode::Sparse), | ||
DataType::Union(rhs_fields, UnionMode::Sparse), | ||
) => { | ||
let lhs_type_id_range = &lhs_type_ids[lhs_start..lhs_start + len]; | ||
let rhs_type_id_range = &rhs_type_ids[rhs_start..rhs_start + len]; | ||
|
||
equal_types(lhs_fields, rhs_fields, lhs_type_ids, rhs_type_ids) | ||
&& equal_sparse( | ||
lhs, | ||
rhs, | ||
lhs_nulls, | ||
rhs_nulls, | ||
lhs_type_id_range, | ||
rhs_type_id_range, | ||
lhs_start, | ||
rhs_start, | ||
) | ||
} | ||
_ => unimplemented!( | ||
"Logical equality not yet implemented between dense and sparse union arrays" | ||
), | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to be equal the two arrays must have the same schema and therefore
lhs_fields == rhs_fields
I think therefore this could simply just compare the type_id arrays for equality directly?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, even type_ids are different, isn't it still possible that the union arrays are equal?
For example, the fields are int, int, float, string. If lhs type_ids is 1, but rhs type_id is 0, they are both int. We don't know if the actual element at child arrays are equal or not.
So here I compare if the type_ids are the same type or not, not its values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And two union arrays could be equal even if two fields/type_ids are different. For example, we can change the field position and type_id, but the resulting union array is still the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of two fields with the same type, I believe these must still have distinct field names and therefore there is a logical difference between values stored in one vs the other. I'm not sure about this though, possibly worth an upstream clarification 🤔
On the case of different field ordering the arrays can never compare equal as they have different data types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
equal_types
was removed now.