Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster character_length() string function for ASCII-only case #12356

Merged
merged 3 commits into from
Sep 7, 2024

Conversation

2010YOUY01
Copy link
Contributor

Which issue does this PR close?

Part of #12306

Rationale for this change

See the issue.

This PR wants to show how the change will look like and decide if we should continue with other string functions like lower()
Though speed up for char_length() is not big, I guess it's because counting character's implementation is also relatively more CPU friendly, but according to Velox's benchmarks we can expect more speed up for string functions like lower()/upper()

What changes are included in this PR?

  1. Inside character_length() first check whether the current string array is ASCII-only, if so directly count bytes instead of decode characters one by one
  2. Micro benchmark (first commit is bench only)

Result

group                                                  after                                  before
-----                                                  -----                                  ------
character_length_StringArray_ascii_str_len_128         1.00    123.9±7.62µs        ? ?/sec    1.42    175.8±7.33µs        ? ?/sec
character_length_StringArray_ascii_str_len_32          1.00    81.8±12.05µs        ? ?/sec    1.71    139.9±2.66µs        ? ?/sec
character_length_StringArray_ascii_str_len_4096        1.00      3.1±0.03ms        ? ?/sec    1.02      3.1±0.19ms        ? ?/sec
character_length_StringArray_ascii_str_len_8           1.00     66.0±3.38µs        ? ?/sec    1.35     89.0±2.00µs        ? ?/sec
character_length_StringArray_utf8_str_len_128          1.00   313.5±24.58µs        ? ?/sec    1.03   323.0±10.86µs        ? ?/sec
character_length_StringArray_utf8_str_len_32           1.00    236.5±3.47µs        ? ?/sec    1.04    245.8±3.84µs        ? ?/sec
character_length_StringArray_utf8_str_len_4096         1.00      6.0±0.78ms        ? ?/sec    1.03      6.2±1.23ms        ? ?/sec
character_length_StringArray_utf8_str_len_8            1.00    162.5±6.09µs        ? ?/sec    1.03   167.4±13.90µs        ? ?/sec
character_length_StringViewArray_ascii_str_len_128     1.00   143.4±28.39µs        ? ?/sec    1.31   187.4±52.47µs        ? ?/sec
character_length_StringViewArray_ascii_str_len_32      1.00    104.1±6.05µs        ? ?/sec    1.35   140.0±31.13µs        ? ?/sec
character_length_StringViewArray_ascii_str_len_4096    1.00      2.6±0.04ms        ? ?/sec    1.22      3.1±0.27ms        ? ?/sec
character_length_StringViewArray_ascii_str_len_8       1.02     92.9±0.57µs        ? ?/sec    1.00    91.2±22.02µs        ? ?/sec
character_length_StringViewArray_utf8_str_len_128      1.01   323.8±17.53µs        ? ?/sec    1.00    319.0±4.83µs        ? ?/sec
character_length_StringViewArray_utf8_str_len_32       1.02   248.8±29.66µs        ? ?/sec    1.00   244.0±19.25µs        ? ?/sec
character_length_StringViewArray_utf8_str_len_4096     1.00      5.9±0.42ms        ? ?/sec    1.03      6.1±0.43ms        ? ?/sec
character_length_StringViewArray_utf8_str_len_8        1.01    171.9±6.96µs        ? ?/sec    1.00   169.5±10.64µs        ? ?/sec

Are these changes tested?

Current sqllogictest has covered non-ascii case char_length() function

Are there any user-facing changes?

No

// String characters are variable length encoded in UTF-8, counting the
// number of chars requires expensive decoding, however checking if the
// string is ASCII only is relatively cheap.
// If strings are ASCII only, count bytes instead.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍
That sounds simple and efficient

T::Native::from_usize(string.chars().count())
.expect("should not fail as string.chars will always return integer")
if is_array_ascii_only {
T::Native::from_usize(string.len()).expect(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering should we use usize_as to deal without Options if we sure there is usize always

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any strong preference here -- this seems ok to me, but keeping the code cleaner with usize_as also seems fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think conversion can't fail here

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR wants to show how the change will look like and decide if we should continue with other string functions like lower()

I think it is a good idea. Thank you @2010YOUY01 for working on this

Though speed up for char_length() is not big,

Well, I suppose it depends on what you mean by "big" 😆 -- I think this benchmark shows this PR is 40% faster for some strings, right?

character_length_StringArray_ascii_str_len_128 1.00 123.9±7.62µs ? ?/sec 1.42 175.8±7.33µs ? ?/sec

I guess it's because counting character's implementation is also relatively more CPU friendly, but according to Velox's benchmarks we can expect more speed up for string functions like lower()/upper()

I think something else that velox / photon likely do is to mark which batches are ascii only on creation and the propagate that flag through

The current implementation of arrow seems to actually examine all the bytes in the array:

https://docs.rs/arrow-array/53.0.0/src/arrow_array/array/byte_array.rs.html#262

I will leave some additional comments on #12356 as well

T::Native::from_usize(string.chars().count())
.expect("should not fail as string.chars will always return integer")
if is_array_ascii_only {
T::Native::from_usize(string.len()).expect(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any strong preference here -- this seems ok to me, but keeping the code cleaner with usize_as also seems fine.

let mut rng = StdRng::seed_from_u64(42);
let rng_ref = &mut rng;

let corpus = "DataFusionДатаФусион数据融合📊🔥"; // includes utf8 encoding with 1~4 bytes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ДатаФусион )))

@Dandandan Dandandan merged commit 82fb5b9 into apache:main Sep 7, 2024
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants