-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix PartialOrd for ScalarValue::List/FixSizeList/LargeList #8253
Conversation
datafusion/common/src/utils.rs
Outdated
@@ -390,6 +390,16 @@ pub fn arrays_into_list_array( | |||
)) | |||
} | |||
|
|||
/// Get the child arrays from a `ArrayRef`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jayzhan211
I'm thinking of naming it nested instead of children, what do you think?
And would be super helpful to have a doc test or example input/output in the comments
datafusion/common/src/utils.rs
Outdated
@@ -390,6 +390,16 @@ pub fn arrays_into_list_array( | |||
)) | |||
} | |||
|
|||
/// Get the child arrays from a `ArrayRef`. | |||
pub fn array_into_children_array_vec(list_arr: &ArrayRef) -> Vec<ArrayRef> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this handles nulls and sliced arrays incorrectly, as it ignores the offsets and nulls on the parent. Perhaps you could just use the comparison kernels on the lists directly, instead of partially decomposing them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tustvold compare_op
does not work on nested, any other comparison kernel I can consider?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Err(InvalidArgumentError("Invalid comparison operation: List(Field { name: \"item\", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) < List(Field { name: \"item\", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })"))
let lt_res =
arrow::compute::kernels::cmp::lt(&arr1, &arr2);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In which case do we need to support this on scalar value at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In which case do we need to support this on scalar value at all?
Probably none. I support it since it has supported before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me split it to two function for now.
754f39c
to
8b6d60f
Compare
Three of them all have the same logic now, I think it is fine we only have unit test for one of them (List). |
datafusion/common/src/scalar.rs
Outdated
@@ -3458,6 +3436,7 @@ impl ScalarType<i64> for TimestampNanosecondType { | |||
} | |||
|
|||
#[cfg(test)] | |||
#[cfg(feature = "parquet")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To enable running test. See #8250
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I look at the linked issue, but I still don't exactly understand what this file has to do with the parquet
feature. It looks a design smell, but maybe I'm missing some context. @alamb, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also puzzled by this but haven't spent the time to figure out what in scalar.rs
is referring to parquet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should remove this cfg prior to merging
#[cfg(feature = "parquet")] |
} else if let Some(arr) = arr.as_fixed_size_list_opt() { | ||
arr.value(0) | ||
} else { | ||
unreachable!("Since only List / LargeList / FixedSizeList are supported, this should never happen") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer internal error here
unreachable!("Since only List / LargeList / FixedSizeList are supported, this should never happen") | |
internal_err!("Since only List / LargeList / FixedSizeList are supported, this should never happen") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I think unreachable
is the better choice in this case, otherwise when should we use unreachable 😕
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'This was likely caused by a bug in DataFusion's code and we would welcome that you file a bug report in our issue tracker'.
I rechecked the Internal Error definition, which is for an unobserved bug report. Because here is an if-else branch, it would be more proper for internal error 🤔 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we should have internal_err for an unobserved bug report. If we can't ensure the value we will get, I think internal_err
is appropriate, but in this case, we have type check already, so I think it is ok to just panic if we got to this point. The code should never reach that line unless rust compiler or arrow::DataType is broken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree panic'ing is find at this case as the types are checked in the match arms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jayzhan211 and @Weijun-H -- I think this PR is ready to go and is a nice improvement
Ideally we would remove the #cfg
prior to merging as suggested by @ozankabak in https://github.com/apache/arrow-datafusion/pull/8253/files#r1406116258
if list_arr1.len() != list_arr2.len() { | ||
return None; | ||
|
||
let arr1 = first_array_for_list(arr1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code would probably be faster and simpler if it used the single lt_eq
kernel: https://docs.rs/arrow/latest/arrow/compute/kernels/cmp/fn.lt_eq.html
However, i see this just follows the existing logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With lt_eq, I think we still need to differentiate lt and eq, with either eq or lt.
datafusion/common/src/scalar.rs
Outdated
@@ -3458,6 +3436,7 @@ impl ScalarType<i64> for TimestampNanosecondType { | |||
} | |||
|
|||
#[cfg(test)] | |||
#[cfg(feature = "parquet")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should remove this cfg prior to merging
#[cfg(feature = "parquet")] |
} else if let Some(arr) = arr.as_fixed_size_list_opt() { | ||
arr.value(0) | ||
} else { | ||
unreachable!("Since only List / LargeList / FixedSizeList are supported, this should never happen") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree panic'ing is find at this case as the types are checked in the match arms
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
8b6d60f
to
74e60c1
Compare
I merged up from main, and once all tests pass I think this PR will be good to merge |
Thanks again @jayzhan211 |
* list cmp Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * remove cfg Signed-off-by: jayzhan211 <jayzhan211@gmail.com> --------- Signed-off-by: jayzhan211 <jayzhan211@gmail.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Which issue does this PR close?
Bugs in #8221
FixedsizeList should converted by
as_fixed_size_list
notas_list_array
. However, I just found we can get the child data so we dont need to downcast array via their type.Rationale for this change
What changes are included in this PR?
Are these changes tested?
Existing test
test_list_partial_cmp
I removed test that array length is greater than 1 since ScalarValue::List should contains length 1 array.
Are there any user-facing changes?