Fix: `substr()` on StringView column's behavior is inconsistent with the old version #12383

2010YOUY01 · 2024-09-08T10:34:48Z

Which issue does this PR close?

N.A.

Rationale for this change

Recently native StringView type support is implemented on substr() string function, but its behaviour is not consistent with the older implementation on StringArray string physical column type
The function semantics on the original StringArray type is consistent with postgres https://www.postgresql.org/docs/9.1/functions-string.html

The postgres version of substr(s, start, count)'s behavior is: start is 1-indexed, then we have a maybe out of bounds , and 0-indexed character range [start-1, start-1+count) , the valid char range is [0, s.chars().count()), and this function will take the intersection of two.

They handle negative start index differently and also handle invalid range differently

DataFusion CLI v41.0.0
> select substr('foo', -2,4), substr(arrow_cast('foo', 'Utf8View'), -2, 4);
+----------------------------------------+---------------------------------------------------------------------+
| substr(Utf8("foo"),Int64(-2),Int64(4)) | substr(arrow_cast(Utf8("foo"),Utf8("Utf8View")),Int64(-2),Int64(4)) |
+----------------------------------------+---------------------------------------------------------------------+
| f                                      | foo                                                                 |
+----------------------------------------+---------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.104 seconds.

> select substr('', -10, 5)='', substr(arrow_cast('', 'Utf8View'), -10, 5)='';
+-------------------------------------------------+------------------------------------------------------------------------------+
| substr(Utf8(""),Int64(-10),Int64(5)) = Utf8("") | substr(arrow_cast(Utf8(""),Utf8("Utf8View")),Int64(-10),Int64(5)) = Utf8("") |
+-------------------------------------------------+------------------------------------------------------------------------------+
| true                                            |                                                                              |
+-------------------------------------------------+------------------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.010 seconds.

What changes are included in this PR?

substr's behavior is already consistent with postgres if input is StringArray
This PR changes implementation to keep StringView version also consistent with postgres

Are these changes tested?

Added more sqllogictests

Are there any user-facing changes?

No

2010YOUY01 · 2024-09-08T10:39:53Z

datafusion/functions/src/unicode/substr.rs

    match args.len() {
        1 => {
-            for (idx, (raw, start)) in string_view_array
-                .views()
+            for ((str_opt, raw_view), start_opt) in string_view_array


The old implementation will treat empty string in view as NULL (within make_and_append_view()), so it handled NULL case correctly but handled the empty string case wrong.
Since it's not possible to use view to check if the current element is null, so here we also have to iterate on the string array to check whether each element is NULL

Another potential thing that might be faster is to iterate on the null array directly (rather than the Option<&str>) so the code doesn't have to figure out where the &str comes from

Thanks for correcting my mistakes. I really appreciate it! I'll be more careful with my code next time.

Another potential thing that might be faster is to iterate on the null array directly (rather than the Option<&str>) so the code doesn't have to figure out where the &str comes from

Great point, there are also other potential optimizations like specializing not null columns, scalar literal arguments, and ASCII case. So I prefer only to make a simple and correct implementation in this PR, and evaluate those optimizations as a future task

alamb · 2024-09-09T11:14:00Z

FYI @Kev1n8 who I think added substring support in #12044

alamb

Thank you very much @2010YOUY01 and @dmitrybugakov for the review

alamb · 2024-09-09T11:17:01Z

datafusion/functions/src/unicode/substr.rs

    match args.len() {
        1 => {
-            for (idx, (raw, start)) in string_view_array
-                .views()
+            for ((str_opt, raw_view), start_opt) in string_view_array


Another potential thing that might be faster is to iterate on the null array directly (rather than the Option<&str>) so the code doesn't have to figure out where the &str comes from

alamb · 2024-09-09T11:18:10Z

datafusion/functions/src/unicode/substr.rs

@@ -118,15 +118,37 @@ pub fn substr(args: &[ArrayRef]) -> Result<ArrayRef> {
    }
 }

-// Return the exact byte index for [start, end), set count to -1 to ignore count
-fn get_true_start_end(input: &str, start: usize, count: i64) -> (usize, usize) {
+// Convert the given `start` and `count` to valid byte indices within `input` string


😍 for the comments

Kev1n8

Thanks for fixing my mistakes and all the improvements!

Kev1n8 · 2024-09-09T13:06:43Z

datafusion/functions/src/unicode/substr.rs

    match args.len() {
        1 => {
-            for (idx, (raw, start)) in string_view_array
-                .views()
+            for ((str_opt, raw_view), start_opt) in string_view_array


Thanks for correcting my mistakes. I really appreciate it! I'll be more careful with my code next time.

alamb · 2024-09-09T13:20:18Z

Thanks for correcting my mistakes. I really appreciate it! I'll be more careful with my code next time.

Agreed -- thanks @2010YOUY01 (and sorry I didn't catch this in review either)

In general I wonder if there is a better way to test the string functions (since many of them now have multiple implementations for different input type combinations, but each implementation should get the same answer)

Rachelint · 2024-09-09T16:37:47Z

datafusion/functions/src/unicode/substr.rs

+    let sub_view = if substr_len > 12 {
+        let view = ByteView::from(*raw);
+        make_view(substr.as_bytes(), view.buffer_index, view.offset + start)
    } else {
-        let sub_view = if substr_len > 12 {
-            let view = ByteView::from(*raw);
-            make_view(substr.as_bytes(), view.buffer_index, view.offset + start)
-        } else {
-            // inline value does not need block id or offset
-            make_view(substr.as_bytes(), 0, 0)
-        };
-        views_buffer.push(sub_view);
-        null_builder.append_non_null();
-    }
+        // inline value does not need block id or offset
+        make_view(substr.as_bytes(), 0, 0)


Seems the inline check here is duplicated with the one within make_view?
I guess inline len(12) may be the internal logic in string view, and we should not expose and use it in datafusion?

I would say the point we are checking size here is to help with the operation of "modifying the views directly". But I agree that maybe we could add more API upstream in arrow to hide the logic.

I would say the point we are checking size here is to help with the operation of "modifying the views directly". But I agree that maybe we could add more API upstream in arrow to hide the logic.

Maybe we can just let it run as same as the > 12 branch? ByteView::from(*raw) seems cheap, maybe it is not more expensive than a if else? I am not sure.

Oh I see what you mean. I think ByteVIew is not equal to inlined_view. The former one indicates the str whose size > 12, while the inlined str is represented by a simple u128.

Maybe we can just let it run as same as the > 12 branch?

It might actually work since from<u128> for ByteView would load meaningless(which are actually the str) prefix, buffer_index, offset, which does no harm. But I think it would be ambiguous. It's a good suggestion tho.

Oh I see what you mean. I think ByteVIew is not equal to inlined_view. The former one indicates the str whose size > 12, while the inlined str is represented by a simple u128.

Maybe we can just let it run as same as the > 12 branch?

It might actually work since from<u128> for ByteView would load meaningless prefix, buffer_index, offset, which does no harm. But I think it would be ambiguous. It's a good suggestion tho.

Ok, it is indeed more clear to distinguish inlined and not inlned.

Rachelint · 2024-09-09T16:40:50Z

It is really cool, and also fix the potential problem in #12395 !

2010YOUY01 · 2024-09-10T09:54:55Z

Thanks for correcting my mistakes. I really appreciate it! I'll be more careful with my code next time.

Agreed -- thanks @2010YOUY01 (and sorry I didn't catch this in review either)

In general I wonder if there is a better way to test the string functions (since many of them now have multiple implementations for different input type combinations, but each implementation should get the same answer)

One thing to do is to duplicate existing SQL test cases and convert StringArray columns to StringView columns, I think not every PR for string function native StringView support did this 🤔
I can do a sample PR and open several issues for string functions missing StringView SQL level tests

alamb · 2024-09-10T10:37:28Z

One thing to do is to duplicate existing SQL test cases and convert StringArray columns to StringView columns, I think not every PR for string function native StringView support did this 🤔
I can do a sample PR and open several issues for string functions missing StringView SQL level tests

Thanks @2010YOUY01

The potential downside of doing that is that it will not evolve as we add new string functions (e.g. we add a new string function that doesn't handle StringView, how will we know?)

I wonder if we can potentially do something more automated like have the same queries and expected files and run them against different test cases.

I think we could do this in sqllogictest by separating the table definitions into different .slt files and then includeing the same queries file. Here is an example:

https://github.com/apache/datafusion/blob/02eab80cd62e02fcb68dee8b99d63aaac680a66c/datafusion/sqllogictest/test_files/join_disable_repartition_joins.slt#L26-L25

alamb · 2024-09-10T10:48:31Z

I filed #12415 to track the idea of a more systematic testing

alamb · 2024-09-10T10:48:41Z

Thanks again @2010YOUY01 @dmitrybugakov @Kev1n8 and @Rachelint

Fix substr()

55114e8

github-actions bot added sqllogictest SQL Logic Tests (.slt) functions labels Sep 8, 2024

2010YOUY01 commented Sep 8, 2024

View reviewed changes

demetribu approved these changes Sep 9, 2024

View reviewed changes

alamb approved these changes Sep 9, 2024

View reviewed changes

Kev1n8 reviewed Sep 9, 2024

View reviewed changes

Kev1n8 mentioned this pull request Sep 9, 2024

Improve performance of trim for string view (10%) #12395

Merged

Rachelint reviewed Sep 9, 2024

View reviewed changes

alamb mentioned this pull request Sep 10, 2024

Automate testing / ensuring that string functions get the same answer for String, LargeString, StringView, DictionaryString, etc #12415

Closed

alamb merged commit 376a0b8 into apache:main Sep 10, 2024
25 checks passed

alamb mentioned this pull request Sep 16, 2024

DataFusion weekly project plan (Andrew Lamb) - Sep 16, 2024 #12494

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: `substr()` on StringView column's behavior is inconsistent with the old version #12383

Fix: `substr()` on StringView column's behavior is inconsistent with the old version #12383

2010YOUY01 commented Sep 8, 2024 •

edited

Loading

2010YOUY01 Sep 8, 2024

alamb Sep 9, 2024

Kev1n8 Sep 9, 2024

2010YOUY01 Sep 10, 2024

alamb commented Sep 9, 2024

alamb left a comment

alamb Sep 9, 2024

alamb Sep 9, 2024

Kev1n8 left a comment

Kev1n8 Sep 9, 2024

alamb commented Sep 9, 2024

Rachelint Sep 9, 2024

Kev1n8 Sep 9, 2024

Rachelint Sep 9, 2024 •

edited

Loading

Kev1n8 Sep 9, 2024 •

edited

Loading

Rachelint Sep 9, 2024

Rachelint commented Sep 9, 2024

2010YOUY01 commented Sep 10, 2024

alamb commented Sep 10, 2024

alamb commented Sep 10, 2024

alamb commented Sep 10, 2024 •

edited

Loading

Fix: substr() on StringView column's behavior is inconsistent with the old version #12383

Fix: substr() on StringView column's behavior is inconsistent with the old version #12383

Conversation

2010YOUY01 commented Sep 8, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Sep 9, 2024

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Kev1n8 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Sep 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Rachelint Sep 9, 2024 • edited Loading

Choose a reason for hiding this comment

Kev1n8 Sep 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Rachelint commented Sep 9, 2024

2010YOUY01 commented Sep 10, 2024

alamb commented Sep 10, 2024

alamb commented Sep 10, 2024

alamb commented Sep 10, 2024 • edited Loading

Fix: `substr()` on StringView column's behavior is inconsistent with the old version #12383

Fix: `substr()` on StringView column's behavior is inconsistent with the old version #12383

2010YOUY01 commented Sep 8, 2024 •

edited

Loading

Rachelint Sep 9, 2024 •

edited

Loading

Kev1n8 Sep 9, 2024 •

edited

Loading

alamb commented Sep 10, 2024 •

edited

Loading