-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster character_length()
string function for ASCII-only case
#12356
Conversation
// String characters are variable length encoded in UTF-8, counting the | ||
// number of chars requires expensive decoding, however checking if the | ||
// string is ASCII only is relatively cheap. | ||
// If strings are ASCII only, count bytes instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
That sounds simple and efficient
T::Native::from_usize(string.chars().count()) | ||
.expect("should not fail as string.chars will always return integer") | ||
if is_array_ascii_only { | ||
T::Native::from_usize(string.len()).expect( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wondering should we use usize_as
to deal without Options if we sure there is usize always
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have any strong preference here -- this seems ok to me, but keeping the code cleaner with usize_as
also seems fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think conversion can't fail here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR wants to show how the change will look like and decide if we should continue with other string functions like lower()
I think it is a good idea. Thank you @2010YOUY01 for working on this
Though speed up for char_length() is not big,
Well, I suppose it depends on what you mean by "big" 😆 -- I think this benchmark shows this PR is 40% faster for some strings, right?
character_length_StringArray_ascii_str_len_128 1.00 123.9±7.62µs ? ?/sec 1.42 175.8±7.33µs ? ?/sec
I guess it's because counting character's implementation is also relatively more CPU friendly, but according to Velox's benchmarks we can expect more speed up for string functions like lower()/upper()
I think something else that velox / photon likely do is to mark which batches are ascii only on creation and the propagate that flag through
The current implementation of arrow seems to actually examine all the bytes in the array:
https://docs.rs/arrow-array/53.0.0/src/arrow_array/array/byte_array.rs.html#262
I will leave some additional comments on #12356 as well
T::Native::from_usize(string.chars().count()) | ||
.expect("should not fail as string.chars will always return integer") | ||
if is_array_ascii_only { | ||
T::Native::from_usize(string.len()).expect( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have any strong preference here -- this seems ok to me, but keeping the code cleaner with usize_as
also seems fine.
let mut rng = StdRng::seed_from_u64(42); | ||
let rng_ref = &mut rng; | ||
|
||
let corpus = "DataFusionДатаФусион数据融合📊🔥"; // includes utf8 encoding with 1~4 bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ДатаФусион )))
Which issue does this PR close?
Part of #12306
Rationale for this change
See the issue.
This PR wants to show how the change will look like and decide if we should continue with other string functions like
lower()
Though speed up for
char_length()
is not big, I guess it's because counting character's implementation is also relatively more CPU friendly, but according to Velox's benchmarks we can expect more speed up for string functions likelower()/upper()
What changes are included in this PR?
character_length()
first check whether the current string array is ASCII-only, if so directly count bytes instead of decode characters one by oneResult
Are these changes tested?
Current sqllogictest has covered non-ascii case
char_length()
functionAre there any user-facing changes?
No