-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Implement equality = and inequality <> support for StringView #10985
feat: Implement equality = and inequality <> support for StringView #10985
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pr looks good to me, left some minor comments
datafusion/common/src/scalar/mod.rs
Outdated
@@ -1715,6 +1715,22 @@ impl ScalarValue { | |||
)?; | |||
Arc::new(array) | |||
} | |||
DataType::Utf8View => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just build_array_string!(StringViewArray, Utf8View)
?
Can you also move this case closer to where we handle Utf8
and LargeUtf8
?
@@ -932,6 +932,7 @@ fn string_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataType> | |||
(LargeUtf8, Utf8) => Some(LargeUtf8), | |||
(Utf8, LargeUtf8) => Some(LargeUtf8), | |||
(LargeUtf8, LargeUtf8) => Some(LargeUtf8), | |||
(Utf8View, Utf8View) => Some(Utf8View), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we handle (Utf8View, Utf8)
, (Utf8, Utf8View)
etc?
It's possible that we read a StringViewArray from parquet, and compare it against a Utf8
. What do you think the result type should look like? @alamb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we should permit to/coercion from any "string" / "binary" type to the associated view type.
It is probably good to make sure the coercion does the cheap direction when possible
For example, we shouldn't be converting StringViewArray
--> StringArray
as that will copy all the strings. Instead we should convert StringArray
--> StringViewArray
which is relatively cheap (we just have to copy the views)
I think this feature -- coercion -- is probably worth its own ticket and doesn't need to be done in this PR. I'll file a follow on ticket in the morning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I handled the Utf8View
coercion in this PR, but it would be better to handle the BinaryView
in a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with (Utf8View, Utf8)
is that it can be quite common to have the following queries:
select * from t where c1 <> 'small';
Where the c1 is read to be StringViewArray
, then we need to decide what is the result type (in order to implement the coercion). I agree with @alamb that StringViewArray
is generally cheaper than StringArray
, so we might just use Utf8View
if at least one operand is using Utf8View
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I handled the
Utf8View
coercion in this PR, but it would be better to handle theBinaryView
in a follow-up.
Filed #10996 to track Binary view support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with
(Utf8View, Utf8)
is that it can be quite common to have the following queries:select * from t where c1 <> 'small';Where the c1 is read to be
StringViewArray
, then we need to decide what is the result type (in order to implement the coercion). I agree with @alamb thatStringViewArray
is generally cheaper thanStringArray
, so we might just useUtf8View
if at least one operand is usingUtf8View
.
I wrote a bunch of tests in #10997 and the coercion mostly seems to work well. There is one exception which i think is minor and i will file a ticket about
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Weijun-H -- I think once @XiangpengHao 's comments are addressed this PR will be good to go. 🙏
@@ -932,6 +932,7 @@ fn string_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataType> | |||
(LargeUtf8, Utf8) => Some(LargeUtf8), | |||
(Utf8, LargeUtf8) => Some(LargeUtf8), | |||
(LargeUtf8, LargeUtf8) => Some(LargeUtf8), | |||
(Utf8View, Utf8View) => Some(Utf8View), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we should permit to/coercion from any "string" / "binary" type to the associated view type.
It is probably good to make sure the coercion does the cheap direction when possible
For example, we shouldn't be converting StringViewArray
--> StringArray
as that will copy all the strings. Instead we should convert StringArray
--> StringViewArray
which is relatively cheap (we just have to copy the views)
I think this feature -- coercion -- is probably worth its own ticket and doesn't need to be done in this PR. I'll file a follow on ticket in the morning
Andrew X | ||
|
||
query ?? | ||
select * from test where column1 = 'Andrew'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for this test to run, we need to add (Utf8View, Utf8) => Utf8View
, see #10985 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @XiangpengHao -- I think this pR is a good step forward and so I think we should merge it into the string-view
branch to keep making progress.
I will do so and file some follow on issues to track remaining work (specifically for coercion and BinaryView)
…velopment branch (#11402) * Update `string-view` branch to arrow-rs main (#10966) * Pin to arrow main * Fix clippy with latest arrow * Uncomment test that needs new arrow-rs to work * Update datafusion-cli Cargo.lock * Update Cargo.lock * tapelo * feat: Implement equality = and inequality <> support for StringView (#10985) * feat: Implement equality = and inequality <> support for StringView * chore: Add tests for the StringView * chore * chore: Update tests for NULL * fix: Used build_array_string! * chore: Update string_coercion function to handle Utf8View type in binary.rs * chore: add tests * chore: ci * Add more StringView comparison test coverage (#10997) * Add more StringView comparison test coverage * add reference * Add another test showing casting on columns works correctly * feat: Implement equality = and inequality <> support for BinaryView (#11004) * feat: Implement equality = and inequality <> support for BinaryView Signed-off-by: Chojan Shang <psiace@apache.org> * chore: make fmt happy Signed-off-by: Chojan Shang <psiace@apache.org> --------- Signed-off-by: Chojan Shang <psiace@apache.org> * Implement support for LargeString and LargeBinary for StringView and BinaryView (#11034) * implement large binary * add tests for large string * better comments for string coercion * Improve filter predicates with `Utf8View` literals (#11043) * refactor: Improve type coercion logic in TypeCoercionRewriter * refactor: Improve type coercion logic in TypeCoercionRewriter * chore * chore: Update test * refactor: Improve type coercion logic in TypeCoercionRewriter * refactor: Remove unused import and update code formatting in unwrap_cast_in_comparison.rs * Remove arrow-patch --------- Signed-off-by: Chojan Shang <psiace@apache.org> Co-authored-by: Alex Huang <huangweijun1001@gmail.com> Co-authored-by: Chojan Shang <psiace@apache.org> Co-authored-by: Xiangpeng Hao <haoxiangpeng123@gmail.com>
…velopment branch (apache#11402) * Update `string-view` branch to arrow-rs main (apache#10966) * Pin to arrow main * Fix clippy with latest arrow * Uncomment test that needs new arrow-rs to work * Update datafusion-cli Cargo.lock * Update Cargo.lock * tapelo * feat: Implement equality = and inequality <> support for StringView (apache#10985) * feat: Implement equality = and inequality <> support for StringView * chore: Add tests for the StringView * chore * chore: Update tests for NULL * fix: Used build_array_string! * chore: Update string_coercion function to handle Utf8View type in binary.rs * chore: add tests * chore: ci * Add more StringView comparison test coverage (apache#10997) * Add more StringView comparison test coverage * add reference * Add another test showing casting on columns works correctly * feat: Implement equality = and inequality <> support for BinaryView (apache#11004) * feat: Implement equality = and inequality <> support for BinaryView Signed-off-by: Chojan Shang <psiace@apache.org> * chore: make fmt happy Signed-off-by: Chojan Shang <psiace@apache.org> --------- Signed-off-by: Chojan Shang <psiace@apache.org> * Implement support for LargeString and LargeBinary for StringView and BinaryView (apache#11034) * implement large binary * add tests for large string * better comments for string coercion * Improve filter predicates with `Utf8View` literals (apache#11043) * refactor: Improve type coercion logic in TypeCoercionRewriter * refactor: Improve type coercion logic in TypeCoercionRewriter * chore * chore: Update test * refactor: Improve type coercion logic in TypeCoercionRewriter * refactor: Remove unused import and update code formatting in unwrap_cast_in_comparison.rs * Remove arrow-patch --------- Signed-off-by: Chojan Shang <psiace@apache.org> Co-authored-by: Alex Huang <huangweijun1001@gmail.com> Co-authored-by: Chojan Shang <psiace@apache.org> Co-authored-by: Xiangpeng Hao <haoxiangpeng123@gmail.com>
…velopment branch (apache#11402) * Update `string-view` branch to arrow-rs main (apache#10966) * Pin to arrow main * Fix clippy with latest arrow * Uncomment test that needs new arrow-rs to work * Update datafusion-cli Cargo.lock * Update Cargo.lock * tapelo * feat: Implement equality = and inequality <> support for StringView (apache#10985) * feat: Implement equality = and inequality <> support for StringView * chore: Add tests for the StringView * chore * chore: Update tests for NULL * fix: Used build_array_string! * chore: Update string_coercion function to handle Utf8View type in binary.rs * chore: add tests * chore: ci * Add more StringView comparison test coverage (apache#10997) * Add more StringView comparison test coverage * add reference * Add another test showing casting on columns works correctly * feat: Implement equality = and inequality <> support for BinaryView (apache#11004) * feat: Implement equality = and inequality <> support for BinaryView Signed-off-by: Chojan Shang <psiace@apache.org> * chore: make fmt happy Signed-off-by: Chojan Shang <psiace@apache.org> --------- Signed-off-by: Chojan Shang <psiace@apache.org> * Implement support for LargeString and LargeBinary for StringView and BinaryView (apache#11034) * implement large binary * add tests for large string * better comments for string coercion * Improve filter predicates with `Utf8View` literals (apache#11043) * refactor: Improve type coercion logic in TypeCoercionRewriter * refactor: Improve type coercion logic in TypeCoercionRewriter * chore * chore: Update test * refactor: Improve type coercion logic in TypeCoercionRewriter * refactor: Remove unused import and update code formatting in unwrap_cast_in_comparison.rs * Remove arrow-patch --------- Signed-off-by: Chojan Shang <psiace@apache.org> Co-authored-by: Alex Huang <huangweijun1001@gmail.com> Co-authored-by: Chojan Shang <psiace@apache.org> Co-authored-by: Xiangpeng Hao <haoxiangpeng123@gmail.com>
…velopment branch (apache#11402) * Update `string-view` branch to arrow-rs main (apache#10966) * Pin to arrow main * Fix clippy with latest arrow * Uncomment test that needs new arrow-rs to work * Update datafusion-cli Cargo.lock * Update Cargo.lock * tapelo * feat: Implement equality = and inequality <> support for StringView (apache#10985) * feat: Implement equality = and inequality <> support for StringView * chore: Add tests for the StringView * chore * chore: Update tests for NULL * fix: Used build_array_string! * chore: Update string_coercion function to handle Utf8View type in binary.rs * chore: add tests * chore: ci * Add more StringView comparison test coverage (apache#10997) * Add more StringView comparison test coverage * add reference * Add another test showing casting on columns works correctly * feat: Implement equality = and inequality <> support for BinaryView (apache#11004) * feat: Implement equality = and inequality <> support for BinaryView Signed-off-by: Chojan Shang <psiace@apache.org> * chore: make fmt happy Signed-off-by: Chojan Shang <psiace@apache.org> --------- Signed-off-by: Chojan Shang <psiace@apache.org> * Implement support for LargeString and LargeBinary for StringView and BinaryView (apache#11034) * implement large binary * add tests for large string * better comments for string coercion * Improve filter predicates with `Utf8View` literals (apache#11043) * refactor: Improve type coercion logic in TypeCoercionRewriter * refactor: Improve type coercion logic in TypeCoercionRewriter * chore * chore: Update test * refactor: Improve type coercion logic in TypeCoercionRewriter * refactor: Remove unused import and update code formatting in unwrap_cast_in_comparison.rs * Remove arrow-patch --------- Signed-off-by: Chojan Shang <psiace@apache.org> Co-authored-by: Alex Huang <huangweijun1001@gmail.com> Co-authored-by: Chojan Shang <psiace@apache.org> Co-authored-by: Xiangpeng Hao <haoxiangpeng123@gmail.com>
Note: targets
string-view
branchWhich issue does this PR close?
Closes #10919
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?