Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Implement equality = and inequality <> support for StringView #10985

Merged
merged 8 commits into from
Jun 19, 2024

Conversation

Weijun-H
Copy link
Member

@Weijun-H Weijun-H commented Jun 18, 2024

Note: targets string-view branch

Which issue does this PR close?

Closes #10919

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt) labels Jun 18, 2024
Copy link
Contributor

@XiangpengHao XiangpengHao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pr looks good to me, left some minor comments

@@ -1715,6 +1715,22 @@ impl ScalarValue {
)?;
Arc::new(array)
}
DataType::Utf8View => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just build_array_string!(StringViewArray, Utf8View)?
Can you also move this case closer to where we handle Utf8 and LargeUtf8?

@@ -932,6 +932,7 @@ fn string_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataType>
(LargeUtf8, Utf8) => Some(LargeUtf8),
(Utf8, LargeUtf8) => Some(LargeUtf8),
(LargeUtf8, LargeUtf8) => Some(LargeUtf8),
(Utf8View, Utf8View) => Some(Utf8View),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we handle (Utf8View, Utf8), (Utf8, Utf8View) etc?
It's possible that we read a StringViewArray from parquet, and compare it against a Utf8. What do you think the result type should look like? @alamb

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should permit to/coercion from any "string" / "binary" type to the associated view type.

It is probably good to make sure the coercion does the cheap direction when possible

For example, we shouldn't be converting StringViewArray --> StringArray as that will copy all the strings. Instead we should convert StringArray --> StringViewArray which is relatively cheap (we just have to copy the views)

I think this feature -- coercion -- is probably worth its own ticket and doesn't need to be done in this PR. I'll file a follow on ticket in the morning

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I handled the Utf8View coercion in this PR, but it would be better to handle the BinaryView in a follow-up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with (Utf8View, Utf8) is that it can be quite common to have the following queries:

select * from t where c1 <> 'small';

Where the c1 is read to be StringViewArray, then we need to decide what is the result type (in order to implement the coercion). I agree with @alamb that StringViewArray is generally cheaper than StringArray, so we might just use Utf8View if at least one operand is using Utf8View.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I handled the Utf8View coercion in this PR, but it would be better to handle the BinaryView in a follow-up.

Filed #10996 to track Binary view support

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with (Utf8View, Utf8) is that it can be quite common to have the following queries:

select * from t where c1 <> 'small';

Where the c1 is read to be StringViewArray, then we need to decide what is the result type (in order to implement the coercion). I agree with @alamb that StringViewArray is generally cheaper than StringArray, so we might just use Utf8View if at least one operand is using Utf8View.

I wrote a bunch of tests in #10997 and the coercion mostly seems to work well. There is one exception which i think is minor and i will file a ticket about

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Weijun-H -- I think once @XiangpengHao 's comments are addressed this PR will be good to go. 🙏

@@ -932,6 +932,7 @@ fn string_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataType>
(LargeUtf8, Utf8) => Some(LargeUtf8),
(Utf8, LargeUtf8) => Some(LargeUtf8),
(LargeUtf8, LargeUtf8) => Some(LargeUtf8),
(Utf8View, Utf8View) => Some(Utf8View),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should permit to/coercion from any "string" / "binary" type to the associated view type.

It is probably good to make sure the coercion does the cheap direction when possible

For example, we shouldn't be converting StringViewArray --> StringArray as that will copy all the strings. Instead we should convert StringArray --> StringViewArray which is relatively cheap (we just have to copy the views)

I think this feature -- coercion -- is probably worth its own ticket and doesn't need to be done in this PR. I'll file a follow on ticket in the morning

Andrew X

query ??
select * from test where column1 = 'Andrew';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for this test to run, we need to add (Utf8View, Utf8) => Utf8View, see #10985 (comment)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @XiangpengHao -- I think this pR is a good step forward and so I think we should merge it into the string-view branch to keep making progress.

I will do so and file some follow on issues to track remaining work (specifically for coercion and BinaryView)

@alamb alamb merged commit 507d978 into apache:string-view Jun 19, 2024
26 checks passed
alamb added a commit that referenced this pull request Jul 16, 2024
…velopment branch (#11402)

* Update `string-view` branch to arrow-rs main (#10966)

* Pin to arrow main

* Fix clippy with latest arrow

* Uncomment test that needs new arrow-rs to work

* Update datafusion-cli Cargo.lock

* Update Cargo.lock

* tapelo

* feat: Implement equality = and inequality <> support for StringView (#10985)

* feat: Implement equality = and inequality <> support for StringView

* chore: Add tests for the StringView

* chore

* chore: Update tests for NULL

* fix: Used build_array_string!

* chore: Update string_coercion function to handle Utf8View type in binary.rs

* chore: add tests

* chore: ci

* Add more StringView comparison test coverage (#10997)

* Add more StringView comparison test coverage

* add reference

* Add another test showing casting on columns works correctly

* feat: Implement equality = and inequality <> support for BinaryView (#11004)

* feat: Implement equality = and inequality <> support for BinaryView

Signed-off-by: Chojan Shang <psiace@apache.org>

* chore: make fmt happy

Signed-off-by: Chojan Shang <psiace@apache.org>

---------

Signed-off-by: Chojan Shang <psiace@apache.org>

* Implement support for LargeString and LargeBinary for StringView and BinaryView (#11034)

* implement large binary

* add tests for large string

* better comments for string coercion

* Improve filter predicates with `Utf8View` literals (#11043)

* refactor: Improve type coercion logic in TypeCoercionRewriter

* refactor: Improve type coercion logic in TypeCoercionRewriter

* chore

* chore: Update test

* refactor: Improve type coercion logic in TypeCoercionRewriter

* refactor: Remove unused import and update code formatting in unwrap_cast_in_comparison.rs

* Remove arrow-patch

---------

Signed-off-by: Chojan Shang <psiace@apache.org>
Co-authored-by: Alex Huang <huangweijun1001@gmail.com>
Co-authored-by: Chojan Shang <psiace@apache.org>
Co-authored-by: Xiangpeng Hao <haoxiangpeng123@gmail.com>
xinlifoobar pushed a commit to xinlifoobar/datafusion that referenced this pull request Jul 17, 2024
…velopment branch (apache#11402)

* Update `string-view` branch to arrow-rs main (apache#10966)

* Pin to arrow main

* Fix clippy with latest arrow

* Uncomment test that needs new arrow-rs to work

* Update datafusion-cli Cargo.lock

* Update Cargo.lock

* tapelo

* feat: Implement equality = and inequality <> support for StringView (apache#10985)

* feat: Implement equality = and inequality <> support for StringView

* chore: Add tests for the StringView

* chore

* chore: Update tests for NULL

* fix: Used build_array_string!

* chore: Update string_coercion function to handle Utf8View type in binary.rs

* chore: add tests

* chore: ci

* Add more StringView comparison test coverage (apache#10997)

* Add more StringView comparison test coverage

* add reference

* Add another test showing casting on columns works correctly

* feat: Implement equality = and inequality <> support for BinaryView (apache#11004)

* feat: Implement equality = and inequality <> support for BinaryView

Signed-off-by: Chojan Shang <psiace@apache.org>

* chore: make fmt happy

Signed-off-by: Chojan Shang <psiace@apache.org>

---------

Signed-off-by: Chojan Shang <psiace@apache.org>

* Implement support for LargeString and LargeBinary for StringView and BinaryView (apache#11034)

* implement large binary

* add tests for large string

* better comments for string coercion

* Improve filter predicates with `Utf8View` literals (apache#11043)

* refactor: Improve type coercion logic in TypeCoercionRewriter

* refactor: Improve type coercion logic in TypeCoercionRewriter

* chore

* chore: Update test

* refactor: Improve type coercion logic in TypeCoercionRewriter

* refactor: Remove unused import and update code formatting in unwrap_cast_in_comparison.rs

* Remove arrow-patch

---------

Signed-off-by: Chojan Shang <psiace@apache.org>
Co-authored-by: Alex Huang <huangweijun1001@gmail.com>
Co-authored-by: Chojan Shang <psiace@apache.org>
Co-authored-by: Xiangpeng Hao <haoxiangpeng123@gmail.com>
xinlifoobar pushed a commit to xinlifoobar/datafusion that referenced this pull request Jul 18, 2024
…velopment branch (apache#11402)

* Update `string-view` branch to arrow-rs main (apache#10966)

* Pin to arrow main

* Fix clippy with latest arrow

* Uncomment test that needs new arrow-rs to work

* Update datafusion-cli Cargo.lock

* Update Cargo.lock

* tapelo

* feat: Implement equality = and inequality <> support for StringView (apache#10985)

* feat: Implement equality = and inequality <> support for StringView

* chore: Add tests for the StringView

* chore

* chore: Update tests for NULL

* fix: Used build_array_string!

* chore: Update string_coercion function to handle Utf8View type in binary.rs

* chore: add tests

* chore: ci

* Add more StringView comparison test coverage (apache#10997)

* Add more StringView comparison test coverage

* add reference

* Add another test showing casting on columns works correctly

* feat: Implement equality = and inequality <> support for BinaryView (apache#11004)

* feat: Implement equality = and inequality <> support for BinaryView

Signed-off-by: Chojan Shang <psiace@apache.org>

* chore: make fmt happy

Signed-off-by: Chojan Shang <psiace@apache.org>

---------

Signed-off-by: Chojan Shang <psiace@apache.org>

* Implement support for LargeString and LargeBinary for StringView and BinaryView (apache#11034)

* implement large binary

* add tests for large string

* better comments for string coercion

* Improve filter predicates with `Utf8View` literals (apache#11043)

* refactor: Improve type coercion logic in TypeCoercionRewriter

* refactor: Improve type coercion logic in TypeCoercionRewriter

* chore

* chore: Update test

* refactor: Improve type coercion logic in TypeCoercionRewriter

* refactor: Remove unused import and update code formatting in unwrap_cast_in_comparison.rs

* Remove arrow-patch

---------

Signed-off-by: Chojan Shang <psiace@apache.org>
Co-authored-by: Alex Huang <huangweijun1001@gmail.com>
Co-authored-by: Chojan Shang <psiace@apache.org>
Co-authored-by: Xiangpeng Hao <haoxiangpeng123@gmail.com>
wiedld pushed a commit to influxdata/arrow-datafusion that referenced this pull request Jul 31, 2024
…velopment branch (apache#11402)

* Update `string-view` branch to arrow-rs main (apache#10966)

* Pin to arrow main

* Fix clippy with latest arrow

* Uncomment test that needs new arrow-rs to work

* Update datafusion-cli Cargo.lock

* Update Cargo.lock

* tapelo

* feat: Implement equality = and inequality <> support for StringView (apache#10985)

* feat: Implement equality = and inequality <> support for StringView

* chore: Add tests for the StringView

* chore

* chore: Update tests for NULL

* fix: Used build_array_string!

* chore: Update string_coercion function to handle Utf8View type in binary.rs

* chore: add tests

* chore: ci

* Add more StringView comparison test coverage (apache#10997)

* Add more StringView comparison test coverage

* add reference

* Add another test showing casting on columns works correctly

* feat: Implement equality = and inequality <> support for BinaryView (apache#11004)

* feat: Implement equality = and inequality <> support for BinaryView

Signed-off-by: Chojan Shang <psiace@apache.org>

* chore: make fmt happy

Signed-off-by: Chojan Shang <psiace@apache.org>

---------

Signed-off-by: Chojan Shang <psiace@apache.org>

* Implement support for LargeString and LargeBinary for StringView and BinaryView (apache#11034)

* implement large binary

* add tests for large string

* better comments for string coercion

* Improve filter predicates with `Utf8View` literals (apache#11043)

* refactor: Improve type coercion logic in TypeCoercionRewriter

* refactor: Improve type coercion logic in TypeCoercionRewriter

* chore

* chore: Update test

* refactor: Improve type coercion logic in TypeCoercionRewriter

* refactor: Remove unused import and update code formatting in unwrap_cast_in_comparison.rs

* Remove arrow-patch

---------

Signed-off-by: Chojan Shang <psiace@apache.org>
Co-authored-by: Alex Huang <huangweijun1001@gmail.com>
Co-authored-by: Chojan Shang <psiace@apache.org>
Co-authored-by: Xiangpeng Hao <haoxiangpeng123@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants