Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change ScalarValue::List to store ArrayRef #7629

Merged
merged 74 commits into from
Oct 16, 2023

Conversation

jayzhan211
Copy link
Contributor

@jayzhan211 jayzhan211 commented Sep 23, 2023

Which issue does this PR close?

Closes #7352
Closes #996

Rationale for this change

What changes are included in this PR?

Old ScalarValue::List is replaced with ListArray. convert_array_to_scalar_vec and scalars_to_list_array are two core utilities function that transform between Vec and ListArray.

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added sql SQL Planner physical-expr Physical Expressions optimizer Optimizer rules core Core DataFusion crate labels Sep 23, 2023
@jayzhan211 jayzhan211 changed the title Arrayref for scalarvaluelist Make ScalarValue an ArrayRef Wrapper Sep 23, 2023
@jayzhan211 jayzhan211 force-pushed the arrayref-for-scalarvaluelist branch from dc280bf to 73e78c0 Compare September 25, 2023 14:42
@jayzhan211 jayzhan211 changed the title Make ScalarValue an ArrayRef Wrapper Change ScalarValue::List to store ArrayRef Sep 29, 2023
@jayzhan211 jayzhan211 force-pushed the arrayref-for-scalarvaluelist branch 3 times, most recently from a87d75b to 8ce6e93 Compare October 4, 2023 00:58
@jayzhan211 jayzhan211 marked this pull request as ready for review October 4, 2023 14:35
@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Oct 4, 2023

Ready for review! Although this PR is quite large, I hope it is easy to read. @tustvold @alamb

@alamb
Copy link
Contributor

alamb commented Oct 4, 2023

I plan to give this a look later today or tomorrow. Thanks @jayzhan211

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jayzhan211 -- this is an epic PR. I am sorry for the delay in reviewing. It is hard for me to find enough contiguous time to review large PRs like this.

I left some comments. Other than the todo!() in PartialOrd (which would result in a panic) I think this PR is basically ready to go.

datafusion/common/src/utils.rs Outdated Show resolved Hide resolved
datafusion/common/src/scalar.rs Outdated Show resolved Hide resolved
datafusion/common/src/scalar.rs Outdated Show resolved Hide resolved
@@ -1653,42 +1619,66 @@ impl ScalarValue {
Ok(array)
}

fn iter_to_array_list(
scalars: impl IntoIterator<Item = ScalarValue>,
// This function does not contains nulls but empty array instead.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does "function not contain nulls" mean?

Copy link
Contributor Author

@jayzhan211 jayzhan211 Oct 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NullBuffer for ListArray is None in this function, you can't get NULL in the array from this function, only an empty array is possible.

I tried to have one function for iter_to_array_list_without_nulls and iter_to_array_list but failed since one needs nulls, and another needs an empty array.

datafusion/common/src/scalar.rs Show resolved Hide resolved
datafusion/physical-expr/src/aggregate/bit_and_or_xor.rs Outdated Show resolved Hide resolved
in_data.data_type()
)
array_agg: Vec<Vec<ScalarValue>>,
) -> Result<Vec<Vec<Vec<ScalarValue>>>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to document what the three levels of Vec mean semantically ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I don't understand either 😢 Let me take a chance to learn more about aggregate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left the comment from merge_ordered_arrays for now. I think there might be some space to improve OrderSensitiveArrayAggAccumulator, but it is better to do it in another PR.

@@ -424,59 +427,6 @@ fn scalar_values_error_serialization() {
Some(vec![]),
vec![Field::new("item", DataType::Int16, true)].into(),
),
// Should fail due to inconsistent types in the list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why were these tests removed? Do they pass now? If so, perhaps we can move them to the positive cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it is not possible to build inconsistent types with ListArray, therefore I just removed the whole test.

@jayzhan211 jayzhan211 marked this pull request as draft October 11, 2023 13:44
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@jayzhan211 jayzhan211 force-pushed the arrayref-for-scalarvaluelist branch from 96f6f7d to 577612b Compare October 11, 2023 14:23
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@alamb
Copy link
Contributor

alamb commented Oct 12, 2023

@jayzhan211 please mark this PR as ready for review when you are ready for me to give it another look.

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@jayzhan211 jayzhan211 marked this pull request as ready for review October 13, 2023 00:43
// in_data is Vec<ScalarValue> where ScalarValue does not include ScalarValue::List
for in_data in array_agg.into_iter() {
let ordering = in_data.into_iter().map(|struct_vals| {
if let ScalarValue::Struct(Some(orderings), _) = struct_vals {
Copy link
Contributor Author

@jayzhan211 jayzhan211 Oct 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason we have 3 levels of Vec in orderings is because we got Vec\<ScalarValue> in Struct, but orderings seems to have only one ScalarValue, orderings.len() == 1 based on the test.

Is it reasonable to have >1 ScalarValue for orderings?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe @mustafasrepo or @metegenez can offer some guidance here

@jayzhan211 jayzhan211 requested a review from alamb October 13, 2023 12:16
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jayzhan211 -- this is an epic PR and reflects a lot of work. Thank you for the contribution

// in_data is Vec<ScalarValue> where ScalarValue does not include ScalarValue::List
for in_data in array_agg.into_iter() {
let ordering = in_data.into_iter().map(|struct_vals| {
if let ScalarValue::Struct(Some(orderings), _) = struct_vals {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe @mustafasrepo or @metegenez can offer some guidance here

@jayzhan211
Copy link
Contributor Author

Thank you @jayzhan211 -- this is an epic PR and reflects a lot of work. Thank you for the contribution

Thanks for your review!

@alamb
Copy link
Contributor

alamb commented Oct 16, 2023

I merged this PR up from main and plan to merge it when it passes CI.

I will also make a follow on PR to clean up the to/from proto serialization.

@alamb alamb merged commit cb2d03c into apache:main Oct 16, 2023
22 checks passed
@alamb
Copy link
Contributor

alamb commented Oct 16, 2023

Thanks again @jayzhan211 -- epic work.

@Dandandan
Copy link
Contributor

Awesome, thanks @jayzhan211

@alamb
Copy link
Contributor

alamb commented Oct 26, 2023

I believe we found another bug here: #7938

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate optimizer Optimizer rules physical-expr Physical Expressions sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Change ScalarValue::List to store ArrayRef Optimize ScalarValue list variant using Array trait object
3 participants