-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: substr()
on StringView column's behavior is inconsistent with the old version
#12383
Conversation
match args.len() { | ||
1 => { | ||
for (idx, (raw, start)) in string_view_array | ||
.views() | ||
for ((str_opt, raw_view), start_opt) in string_view_array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old implementation will treat empty string in view as NULL
(within make_and_append_view()
), so it handled NULL
case correctly but handled the empty string case wrong.
Since it's not possible to use view to check if the current element is null, so here we also have to iterate on the string array to check whether each element is NULL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another potential thing that might be faster is to iterate on the null array directly (rather than the Option<&str>
) so the code doesn't have to figure out where the &str
comes from
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for correcting my mistakes. I really appreciate it! I'll be more careful with my code next time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another potential thing that might be faster is to iterate on the null array directly (rather than the
Option<&str>
) so the code doesn't have to figure out where the&str
comes from
Great point, there are also other potential optimizations like specializing not null columns, scalar literal arguments, and ASCII case. So I prefer only to make a simple and correct implementation in this PR, and evaluate those optimizations as a future task
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much @2010YOUY01 and @dmitrybugakov for the review
match args.len() { | ||
1 => { | ||
for (idx, (raw, start)) in string_view_array | ||
.views() | ||
for ((str_opt, raw_view), start_opt) in string_view_array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another potential thing that might be faster is to iterate on the null array directly (rather than the Option<&str>
) so the code doesn't have to figure out where the &str
comes from
@@ -118,15 +118,37 @@ pub fn substr(args: &[ArrayRef]) -> Result<ArrayRef> { | |||
} | |||
} | |||
|
|||
// Return the exact byte index for [start, end), set count to -1 to ignore count | |||
fn get_true_start_end(input: &str, start: usize, count: i64) -> (usize, usize) { | |||
// Convert the given `start` and `count` to valid byte indices within `input` string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😍 for the comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing my mistakes and all the improvements!
match args.len() { | ||
1 => { | ||
for (idx, (raw, start)) in string_view_array | ||
.views() | ||
for ((str_opt, raw_view), start_opt) in string_view_array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for correcting my mistakes. I really appreciate it! I'll be more careful with my code next time.
Agreed -- thanks @2010YOUY01 (and sorry I didn't catch this in review either) In general I wonder if there is a better way to test the string functions (since many of them now have multiple implementations for different input type combinations, but each implementation should get the same answer) |
let sub_view = if substr_len > 12 { | ||
let view = ByteView::from(*raw); | ||
make_view(substr.as_bytes(), view.buffer_index, view.offset + start) | ||
} else { | ||
let sub_view = if substr_len > 12 { | ||
let view = ByteView::from(*raw); | ||
make_view(substr.as_bytes(), view.buffer_index, view.offset + start) | ||
} else { | ||
// inline value does not need block id or offset | ||
make_view(substr.as_bytes(), 0, 0) | ||
}; | ||
views_buffer.push(sub_view); | ||
null_builder.append_non_null(); | ||
} | ||
// inline value does not need block id or offset | ||
make_view(substr.as_bytes(), 0, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems the inline check here is duplicated with the one within make_view
?
I guess inline len(12) may be the internal logic in string view, and we should not expose and use it in datafusion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say the point we are checking size here is to help with the operation of "modifying the views directly". But I agree that maybe we could add more API upstream in arrow to hide the logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say the point we are checking size here is to help with the operation of "modifying the views directly". But I agree that maybe we could add more API upstream in arrow to hide the logic.
Maybe we can just let it run as same as the > 12
branch? ByteView::from(*raw)
seems cheap, maybe it is not more expensive than a if else
? I am not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see what you mean. I think ByteVIew
is not equal to inlined_view
. The former one indicates the str
whose size > 12, while the inlined str
is represented by a simple u128
.
Maybe we can just let it run as same as the > 12 branch?
It might actually work since from<u128> for ByteView
would load meaningless(which are actually the str
) prefix, buffer_index, offset
, which does no harm. But I think it would be ambiguous. It's a good suggestion tho.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see what you mean. I think
ByteVIew
is not equal to inlined_view. The former one indicates the str whose size > 12, while the inlined str is represented by a simpleu128
.Maybe we can just let it run as same as the > 12 branch?
It might actually work since
from<u128> for ByteView
would load meaninglessprefix, buffer_index, offset
, which does no harm. But I think it would be ambiguous. It's a good suggestion tho.
Ok, it is indeed more clear to distinguish inlined and not inlned.
It is really cool, and also fix the potential problem in #12395 ! |
One thing to do is to duplicate existing SQL test cases and convert |
Thanks @2010YOUY01 The potential downside of doing that is that it will not evolve as we add new string functions (e.g. we add a new string function that doesn't handle StringView, how will we know?) I wonder if we can potentially do something more automated like have the same queries and expected files and run them against different test cases. I think we could do this in sqllogictest by separating the table definitions into different .slt files and then |
I filed #12415 to track the idea of a more systematic testing |
Thanks again @2010YOUY01 @dmitrybugakov @Kev1n8 and @Rachelint |
Which issue does this PR close?
N.A.
Rationale for this change
Recently native
StringView
type support is implemented onsubstr()
string function, but its behaviour is not consistent with the older implementation onStringArray
string physical column typeThe function semantics on the original
StringArray
type is consistent with postgres https://www.postgresql.org/docs/9.1/functions-string.htmlThe postgres version of
substr(s, start, count)
's behavior is:start
is 1-indexed, then we have a maybe out of bounds , and 0-indexed character range[start-1, start-1+count)
, the valid char range is[0, s.chars().count())
, and this function will take the intersection of two.They handle negative start index differently and also handle invalid range differently
What changes are included in this PR?
substr
's behavior is already consistent with postgres if input isStringArray
This PR changes implementation to keep
StringView
version also consistent with postgresAre these changes tested?
Added more sqllogictests
Are there any user-facing changes?
No