Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix PartialOrd for ScalarValue::List/FixSizeList/LargeList #8253

Merged
merged 3 commits into from
Dec 7, 2023

Conversation

jayzhan211
Copy link
Contributor

@jayzhan211 jayzhan211 commented Nov 17, 2023

Which issue does this PR close?

Bugs in #8221

FixedsizeList should converted by as_fixed_size_list not as_list_array. However, I just found we can get the child data so we dont need to downcast array via their type.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Existing test test_list_partial_cmp
I removed test that array length is greater than 1 since ScalarValue::List should contains length 1 array.

Are there any user-facing changes?

@jayzhan211 jayzhan211 changed the title Fix PartialOrd in ScalarValue::FixSizeList Fix PartialOrd for ScalarValue::FixSizeList Nov 17, 2023
@@ -390,6 +390,16 @@ pub fn arrays_into_list_array(
))
}

/// Get the child arrays from a `ArrayRef`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jayzhan211
I'm thinking of naming it nested instead of children, what do you think?

And would be super helpful to have a doc test or example input/output in the comments

@@ -390,6 +390,16 @@ pub fn arrays_into_list_array(
))
}

/// Get the child arrays from a `ArrayRef`.
pub fn array_into_children_array_vec(list_arr: &ArrayRef) -> Vec<ArrayRef> {
Copy link
Contributor

@tustvold tustvold Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this handles nulls and sliced arrays incorrectly, as it ignores the offsets and nulls on the parent. Perhaps you could just use the comparison kernels on the lists directly, instead of partially decomposing them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tustvold compare_op does not work on nested, any other comparison kernel I can consider?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Err(InvalidArgumentError("Invalid comparison operation: List(Field { name: \"item\", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) < List(Field { name: \"item\", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })"))
                    let lt_res =
                            arrow::compute::kernels::cmp::lt(&arr1, &arr2);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which case do we need to support this on scalar value at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which case do we need to support this on scalar value at all?

Probably none. I support it since it has supported before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me split it to two function for now.

@jayzhan211 jayzhan211 changed the title Fix PartialOrd for ScalarValue::FixSizeList Fix PartialOrd for ScalarValue::List/FixSizeList/LargeList Nov 23, 2023
@jayzhan211 jayzhan211 marked this pull request as draft November 23, 2023 00:55
@jayzhan211
Copy link
Contributor Author

Three of them all have the same logic now, I think it is fine we only have unit test for one of them (List).

@jayzhan211 jayzhan211 marked this pull request as ready for review November 23, 2023 01:24
@@ -3458,6 +3436,7 @@ impl ScalarType<i64> for TimestampNanosecondType {
}

#[cfg(test)]
#[cfg(feature = "parquet")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To enable running test. See #8250

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I look at the linked issue, but I still don't exactly understand what this file has to do with the parquet feature. It looks a design smell, but maybe I'm missing some context. @alamb, WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also puzzled by this but haven't spent the time to figure out what in scalar.rs is referring to parquet

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove this cfg prior to merging

Suggested change
#[cfg(feature = "parquet")]

} else if let Some(arr) = arr.as_fixed_size_list_opt() {
arr.value(0)
} else {
unreachable!("Since only List / LargeList / FixedSizeList are supported, this should never happen")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer internal error here

Suggested change
unreachable!("Since only List / LargeList / FixedSizeList are supported, this should never happen")
internal_err!("Since only List / LargeList / FixedSizeList are supported, this should never happen")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think unreachable is the better choice in this case, otherwise when should we use unreachable 😕

Copy link
Member

@Weijun-H Weijun-H Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'This was likely caused by a bug in DataFusion's code and we would welcome that you file a bug report in our issue tracker'.

I rechecked the Internal Error definition, which is for an unobserved bug report. Because here is an if-else branch, it would be more proper for internal error 🤔 .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should have internal_err for an unobserved bug report. If we can't ensure the value we will get, I think internal_err is appropriate, but in this case, we have type check already, so I think it is ok to just panic if we got to this point. The code should never reach that line unless rust compiler or arrow::DataType is broken.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree panic'ing is find at this case as the types are checked in the match arms

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jayzhan211 and @Weijun-H -- I think this PR is ready to go and is a nice improvement

Ideally we would remove the #cfg prior to merging as suggested by @ozankabak in https://github.com/apache/arrow-datafusion/pull/8253/files#r1406116258

if list_arr1.len() != list_arr2.len() {
return None;

let arr1 = first_array_for_list(arr1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code would probably be faster and simpler if it used the single lt_eq kernel: https://docs.rs/arrow/latest/arrow/compute/kernels/cmp/fn.lt_eq.html

However, i see this just follows the existing logic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With lt_eq, I think we still need to differentiate lt and eq, with either eq or lt.

@@ -3458,6 +3436,7 @@ impl ScalarType<i64> for TimestampNanosecondType {
}

#[cfg(test)]
#[cfg(feature = "parquet")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove this cfg prior to merging

Suggested change
#[cfg(feature = "parquet")]

} else if let Some(arr) = arr.as_fixed_size_list_opt() {
arr.value(0)
} else {
unreachable!("Since only List / LargeList / FixedSizeList are supported, this should never happen")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree panic'ing is find at this case as the types are checked in the match arms

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@alamb
Copy link
Contributor

alamb commented Dec 6, 2023

I merged up from main, and once all tests pass I think this PR will be good to merge

@alamb alamb merged commit 5e8b0e0 into apache:main Dec 7, 2023
22 checks passed
@alamb
Copy link
Contributor

alamb commented Dec 7, 2023

Thanks again @jayzhan211

appletreeisyellow pushed a commit to appletreeisyellow/datafusion that referenced this pull request Dec 15, 2023
* list cmp

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* remove cfg

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

---------

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants