Faster `character_length()` string function for ASCII-only case #12356

2010YOUY01 · 2024-09-06T12:12:12Z

Which issue does this PR close?

Rationale for this change

See the issue.

This PR wants to show how the change will look like and decide if we should continue with other string functions like lower()
Though speed up for char_length() is not big, I guess it's because counting character's implementation is also relatively more CPU friendly, but according to Velox's benchmarks we can expect more speed up for string functions like lower()/upper()

What changes are included in this PR?

Inside character_length() first check whether the current string array is ASCII-only, if so directly count bytes instead of decode characters one by one
Micro benchmark (first commit is bench only)

Result

group                                                  after                                  before
-----                                                  -----                                  ------
character_length_StringArray_ascii_str_len_128         1.00    123.9±7.62µs        ? ?/sec    1.42    175.8±7.33µs        ? ?/sec
character_length_StringArray_ascii_str_len_32          1.00    81.8±12.05µs        ? ?/sec    1.71    139.9±2.66µs        ? ?/sec
character_length_StringArray_ascii_str_len_4096        1.00      3.1±0.03ms        ? ?/sec    1.02      3.1±0.19ms        ? ?/sec
character_length_StringArray_ascii_str_len_8           1.00     66.0±3.38µs        ? ?/sec    1.35     89.0±2.00µs        ? ?/sec
character_length_StringArray_utf8_str_len_128          1.00   313.5±24.58µs        ? ?/sec    1.03   323.0±10.86µs        ? ?/sec
character_length_StringArray_utf8_str_len_32           1.00    236.5±3.47µs        ? ?/sec    1.04    245.8±3.84µs        ? ?/sec
character_length_StringArray_utf8_str_len_4096         1.00      6.0±0.78ms        ? ?/sec    1.03      6.2±1.23ms        ? ?/sec
character_length_StringArray_utf8_str_len_8            1.00    162.5±6.09µs        ? ?/sec    1.03   167.4±13.90µs        ? ?/sec
character_length_StringViewArray_ascii_str_len_128     1.00   143.4±28.39µs        ? ?/sec    1.31   187.4±52.47µs        ? ?/sec
character_length_StringViewArray_ascii_str_len_32      1.00    104.1±6.05µs        ? ?/sec    1.35   140.0±31.13µs        ? ?/sec
character_length_StringViewArray_ascii_str_len_4096    1.00      2.6±0.04ms        ? ?/sec    1.22      3.1±0.27ms        ? ?/sec
character_length_StringViewArray_ascii_str_len_8       1.02     92.9±0.57µs        ? ?/sec    1.00    91.2±22.02µs        ? ?/sec
character_length_StringViewArray_utf8_str_len_128      1.01   323.8±17.53µs        ? ?/sec    1.00    319.0±4.83µs        ? ?/sec
character_length_StringViewArray_utf8_str_len_32       1.02   248.8±29.66µs        ? ?/sec    1.00   244.0±19.25µs        ? ?/sec
character_length_StringViewArray_utf8_str_len_4096     1.00      5.9±0.42ms        ? ?/sec    1.03      6.1±0.43ms        ? ?/sec
character_length_StringViewArray_utf8_str_len_8        1.01    171.9±6.96µs        ? ?/sec    1.00   169.5±10.64µs        ? ?/sec

Are these changes tested?

Current sqllogictest has covered non-ascii case char_length() function

Are there any user-facing changes?

No

comphead · 2024-09-06T16:05:12Z

datafusion/functions/src/unicode/character_length.rs

+    // String characters are variable length encoded in UTF-8, counting the
+    // number of chars requires expensive decoding, however checking if the
+    // string is ASCII only is relatively cheap.
+    // If strings are ASCII only, count bytes instead.


👍
That sounds simple and efficient

comphead · 2024-09-06T16:10:11Z

datafusion/functions/src/unicode/character_length.rs

-                T::Native::from_usize(string.chars().count())
-                    .expect("should not fail as string.chars will always return integer")
+                if is_array_ascii_only {
+                    T::Native::from_usize(string.len()).expect(


wondering should we use usize_as to deal without Options if we sure there is usize always

I don't have any strong preference here -- this seems ok to me, but keeping the code cleaner with usize_as also seems fine.

Yes, I think conversion can't fail here

alamb

This PR wants to show how the change will look like and decide if we should continue with other string functions like lower()

I think it is a good idea. Thank you @2010YOUY01 for working on this

Though speed up for char_length() is not big,

Well, I suppose it depends on what you mean by "big" 😆 -- I think this benchmark shows this PR is 40% faster for some strings, right?

character_length_StringArray_ascii_str_len_128 1.00 123.9±7.62µs ? ?/sec 1.42 175.8±7.33µs ? ?/sec

I guess it's because counting character's implementation is also relatively more CPU friendly, but according to Velox's benchmarks we can expect more speed up for string functions like lower()/upper()

I think something else that velox / photon likely do is to mark which batches are ascii only on creation and the propagate that flag through

The current implementation of arrow seems to actually examine all the bytes in the array:

https://docs.rs/arrow-array/53.0.0/src/arrow_array/array/byte_array.rs.html#262

I will leave some additional comments on #12356 as well

alamb · 2024-09-06T20:23:18Z

datafusion/functions/src/unicode/character_length.rs

-                T::Native::from_usize(string.chars().count())
-                    .expect("should not fail as string.chars will always return integer")
+                if is_array_ascii_only {
+                    T::Native::from_usize(string.len()).expect(


I don't have any strong preference here -- this seems ok to me, but keeping the code cleaner with usize_as also seems fine.

alamb · 2024-09-06T20:24:39Z

datafusion/functions/benches/character_length.rs

+    let mut rng = StdRng::seed_from_u64(42);
+    let rng_ref = &mut rng;
+
+    let corpus = "DataFusionДатаФусион数据融合📊🔥"; // includes utf8 encoding with 1~4 bytes


ДатаФусион )))

2010YOUY01 added 2 commits September 6, 2024 19:22

charcter_length() benchmark

c29603b

char_length() ascii fast path

eae8d7e

github-actions bot added the functions label Sep 6, 2024

comphead reviewed Sep 6, 2024

View reviewed changes

alamb approved these changes Sep 6, 2024

View reviewed changes

alamb mentioned this pull request Sep 6, 2024

ASCII fast path for some String scalar functions #12306

Open

6 tasks

use usize_as

3e75f48

This was referenced Sep 7, 2024

Optimize lower()/upper() string function with ASCII fast path #12365

Open

Optimize strpos() string function with ASCII fast path #12366

Closed

Optimize substr() string function with ASCII fast path #12367

Closed

Dandandan approved these changes Sep 7, 2024

View reviewed changes

Dandandan merged commit 82fb5b9 into apache:main Sep 7, 2024
26 checks passed

alamb mentioned this pull request Sep 11, 2024

DataFusion weekly project plan (Andrew Lamb) - Sep 9, 2024 #12391

Closed

5 tasks

2010YOUY01 mentioned this pull request Sep 12, 2024

Optimize reverse() string function with ASCII fast path #12445

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster `character_length()` string function for ASCII-only case #12356

Faster `character_length()` string function for ASCII-only case #12356

2010YOUY01 commented Sep 6, 2024

comphead Sep 6, 2024

comphead Sep 6, 2024

alamb Sep 6, 2024

2010YOUY01 Sep 7, 2024

alamb left a comment

alamb Sep 6, 2024

alamb Sep 6, 2024

comphead Sep 6, 2024

Faster character_length() string function for ASCII-only case #12356

Faster character_length() string function for ASCII-only case #12356

Conversation

2010YOUY01 commented Sep 6, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

comphead Sep 6, 2024

Choose a reason for hiding this comment

comphead Sep 6, 2024

Choose a reason for hiding this comment

alamb Sep 6, 2024

Choose a reason for hiding this comment

2010YOUY01 Sep 7, 2024

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb Sep 6, 2024

Choose a reason for hiding this comment

alamb Sep 6, 2024

Choose a reason for hiding this comment

comphead Sep 6, 2024

Choose a reason for hiding this comment

Faster `character_length()` string function for ASCII-only case #12356

Faster `character_length()` string function for ASCII-only case #12356