Fix PartialOrd for ScalarValue::List/FixSizeList/LargeList #8253

jayzhan211 · 2023-11-17T15:01:48Z

Which issue does this PR close?

Bugs in #8221

FixedsizeList should converted by as_fixed_size_list not as_list_array. However, I just found we can get the child data so we dont need to downcast array via their type.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Existing test test_list_partial_cmp
I removed test that array length is greater than 1 since ScalarValue::List should contains length 1 array.

Are there any user-facing changes?

comphead · 2023-11-17T16:46:13Z

datafusion/common/src/utils.rs

@@ -390,6 +390,16 @@ pub fn arrays_into_list_array(
    ))
 }

+/// Get the child arrays from a `ArrayRef`.


Thanks @jayzhan211
I'm thinking of naming it nested instead of children, what do you think?

And would be super helpful to have a doc test or example input/output in the comments

tustvold · 2023-11-17T18:51:57Z

datafusion/common/src/utils.rs

@@ -390,6 +390,16 @@ pub fn arrays_into_list_array(
    ))
 }

+/// Get the child arrays from a `ArrayRef`.
+pub fn array_into_children_array_vec(list_arr: &ArrayRef) -> Vec<ArrayRef> {


I think this handles nulls and sliced arrays incorrectly, as it ignores the offsets and nulls on the parent. Perhaps you could just use the comparison kernels on the lists directly, instead of partially decomposing them?

@tustvold compare_op does not work on nested, any other comparison kernel I can consider?

Err(InvalidArgumentError("Invalid comparison operation: List(Field { name: \"item\", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) < List(Field { name: \"item\", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })"))

let lt_res = arrow::compute::kernels::cmp::lt(&arr1, &arr2);

In which case do we need to support this on scalar value at all?

In which case do we need to support this on scalar value at all?

Probably none. I support it since it has supported before.

Let me split it to two function for now.

jayzhan211 · 2023-11-23T01:24:37Z

Three of them all have the same logic now, I think it is fine we only have unit test for one of them (List).

ozankabak · 2023-11-27T06:37:24Z

datafusion/common/src/scalar.rs

@@ -3458,6 +3436,7 @@ impl ScalarType<i64> for TimestampNanosecondType {
 }

 #[cfg(test)]
+#[cfg(feature = "parquet")]


Why do we have this?

To enable running test. See #8250

I look at the linked issue, but I still don't exactly understand what this file has to do with the parquet feature. It looks a design smell, but maybe I'm missing some context. @alamb, WDYT?

I am also puzzled by this but haven't spent the time to figure out what in scalar.rs is referring to parquet

I think we should remove this cfg prior to merging

Suggested change

#[cfg(feature = "parquet")]

Weijun-H · 2023-11-27T09:05:21Z

datafusion/common/src/scalar.rs

+                    } else if let Some(arr) = arr.as_fixed_size_list_opt() {
+                        arr.value(0)
+                    } else {
+                        unreachable!("Since only List / LargeList / FixedSizeList are supported, this should never happen")


I prefer internal error here

Suggested change

unreachable!("Since only List / LargeList / FixedSizeList are supported, this should never happen")

internal_err!("Since only List / LargeList / FixedSizeList are supported, this should never happen")

But I think unreachable is the better choice in this case, otherwise when should we use unreachable 😕

'This was likely caused by a bug in DataFusion's code and we would welcome that you file a bug report in our issue tracker'.

I rechecked the Internal Error definition, which is for an unobserved bug report. Because here is an if-else branch, it would be more proper for internal error 🤔 .

I agree we should have internal_err for an unobserved bug report. If we can't ensure the value we will get, I think internal_err is appropriate, but in this case, we have type check already, so I think it is ok to just panic if we got to this point. The code should never reach that line unless rust compiler or arrow::DataType is broken.

I agree panic'ing is find at this case as the types are checked in the match arms

alamb

Thank you @jayzhan211 and @Weijun-H -- I think this PR is ready to go and is a nice improvement

Ideally we would remove the #cfg prior to merging as suggested by @ozankabak in https://github.com/apache/arrow-datafusion/pull/8253/files#r1406116258

alamb · 2023-12-05T21:16:45Z

datafusion/common/src/scalar.rs

-                    if list_arr1.len() != list_arr2.len() {
-                        return None;
+
+                let arr1 = first_array_for_list(arr1);


This code would probably be faster and simpler if it used the single lt_eq kernel: https://docs.rs/arrow/latest/arrow/compute/kernels/cmp/fn.lt_eq.html

However, i see this just follows the existing logic

With lt_eq, I think we still need to differentiate lt and eq, with either eq or lt.

alamb · 2023-12-05T21:17:18Z

datafusion/common/src/scalar.rs

@@ -3458,6 +3436,7 @@ impl ScalarType<i64> for TimestampNanosecondType {
 }

 #[cfg(test)]
+#[cfg(feature = "parquet")]


I think we should remove this cfg prior to merging

Suggested change

#[cfg(feature = "parquet")]

alamb · 2023-12-05T21:17:52Z

datafusion/common/src/scalar.rs

+                    } else if let Some(arr) = arr.as_fixed_size_list_opt() {
+                        arr.value(0)
+                    } else {
+                        unreachable!("Since only List / LargeList / FixedSizeList are supported, this should never happen")


I agree panic'ing is find at this case as the types are checked in the match arms

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

alamb · 2023-12-06T22:34:21Z

I merged up from main, and once all tests pass I think this PR will be good to merge

alamb · 2023-12-07T15:23:05Z

Thanks again @jayzhan211

* list cmp Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * remove cfg Signed-off-by: jayzhan211 <jayzhan211@gmail.com> --------- Signed-off-by: jayzhan211 <jayzhan211@gmail.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

jayzhan211 changed the title ~~Fix PartialOrd in ScalarValue::FixSizeList~~ Fix PartialOrd for ScalarValue::FixSizeList Nov 17, 2023

comphead reviewed Nov 17, 2023

View reviewed changes

tustvold reviewed Nov 17, 2023

View reviewed changes

jayzhan211 requested review from comphead and tustvold November 21, 2023 01:12

This was referenced Nov 22, 2023

support LargeList for arrow_cast, support ScalarValue::LargeList #8290

Merged

Minor: refactor List PartialOrd #8303

Closed

jayzhan211 changed the title ~~Fix PartialOrd for ScalarValue::FixSizeList~~ Fix PartialOrd for ScalarValue::List/FixSizeList/LargeList Nov 23, 2023

jayzhan211 marked this pull request as draft November 23, 2023 00:55

jayzhan211 force-pushed the fixsizelist-cmp branch from 754f39c to 8b6d60f Compare November 23, 2023 01:21

jayzhan211 marked this pull request as ready for review November 23, 2023 01:24

ozankabak reviewed Nov 27, 2023

View reviewed changes

Weijun-H reviewed Nov 27, 2023

View reviewed changes

jayzhan211 mentioned this pull request Nov 28, 2023

Convert list array and non-list array to scalars #7862

Closed

alamb approved these changes Dec 5, 2023

View reviewed changes

alamb mentioned this pull request Dec 5, 2023

DataFusion weekly project plan (Andrew Lamb) - Dec 4, 2023 #8420

Closed

7 tasks

jayzhan211 added 2 commits December 6, 2023 07:26

list cmp

aee7c4c

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

remove cfg

74e60c1

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

jayzhan211 force-pushed the fixsizelist-cmp branch from 8b6d60f to 74e60c1 Compare December 5, 2023 23:30

jayzhan211 requested a review from alamb December 5, 2023 23:31

Merge remote-tracking branch 'apache/main' into fixsizelist-cmp

f3bf927

alamb merged commit 5e8b0e0 into apache:main Dec 7, 2023
22 checks passed

matthewgapp mentioned this pull request Jan 11, 2024

matt/feat/recursive ctes/config flag matthewgapp/arrow-datafusion#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix PartialOrd for ScalarValue::List/FixSizeList/LargeList #8253

Fix PartialOrd for ScalarValue::List/FixSizeList/LargeList #8253

jayzhan211 commented Nov 17, 2023 •

edited

Loading

comphead Nov 17, 2023

tustvold Nov 17, 2023 •

edited

Loading

jayzhan211 Nov 18, 2023

jayzhan211 Nov 18, 2023

tustvold Nov 18, 2023

jayzhan211 Nov 18, 2023

jayzhan211 Nov 18, 2023

jayzhan211 commented Nov 23, 2023

ozankabak Nov 27, 2023

jayzhan211 Nov 27, 2023

ozankabak Nov 27, 2023

alamb Dec 1, 2023

alamb Dec 5, 2023

Weijun-H Nov 27, 2023

jayzhan211 Nov 27, 2023

Weijun-H Nov 27, 2023 •

edited

Loading

jayzhan211 Nov 27, 2023

alamb Dec 5, 2023

alamb left a comment

alamb Dec 5, 2023

jayzhan211 Dec 5, 2023

alamb Dec 5, 2023

alamb Dec 5, 2023

alamb commented Dec 6, 2023

alamb commented Dec 7, 2023

	unreachable!("Since only List / LargeList / FixedSizeList are supported, this should never happen")
	internal_err!("Since only List / LargeList / FixedSizeList are supported, this should never happen")

Fix PartialOrd for ScalarValue::List/FixSizeList/LargeList #8253

Fix PartialOrd for ScalarValue::List/FixSizeList/LargeList #8253

Conversation

jayzhan211 commented Nov 17, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

tustvold Nov 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan211 commented Nov 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Weijun-H Nov 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Dec 6, 2023

alamb commented Dec 7, 2023

jayzhan211 commented Nov 17, 2023 •

edited

Loading

tustvold Nov 17, 2023 •

edited

Loading

Weijun-H Nov 27, 2023 •

edited

Loading