feat(c++): Support the UTF-8 to UTF-16 with SIMD #1990

pandalee99 · 2024-12-25T14:24:26Z

What does this PR do?

To support the utf8 utf16 and using simd to accelerate the optimization

std::string utf16ToUtf8(const std::u16string &utf16, bool is_little_endian)

The logic of converting UTF-8 to UTF-16 isn't that complicated. but there are still lots of optimizations that I haven't come up with yet.

So, I'll first design a version that's a bit faster than the original one, and then think about how to make further optimizations.

Judging from the tests, the logic is correct:

[----------] 9 tests from UTF8ToUTF16Test
[ RUN      ] UTF8ToUTF16Test.BasicConversion
[       OK ] UTF8ToUTF16Test.BasicConversion (0 ms)
[ RUN      ] UTF8ToUTF16Test.EmptyString
[       OK ] UTF8ToUTF16Test.EmptyString (0 ms)
[ RUN      ] UTF8ToUTF16Test.SurrogatePairs
[       OK ] UTF8ToUTF16Test.SurrogatePairs (0 ms)
[ RUN      ] UTF8ToUTF16Test.BoundaryValues
[       OK ] UTF8ToUTF16Test.BoundaryValues (0 ms)
[ RUN      ] UTF8ToUTF16Test.SpecialCharacters
[       OK ] UTF8ToUTF16Test.SpecialCharacters (0 ms)
[ RUN      ] UTF8ToUTF16Test.LittleEndian
[       OK ] UTF8ToUTF16Test.LittleEndian (0 ms)
[ RUN      ] UTF8ToUTF16Test.BigEndian
[       OK ] UTF8ToUTF16Test.BigEndian (0 ms)
[ RUN      ] UTF8ToUTF16Test.RoundTripConversion
[       OK ] UTF8ToUTF16Test.RoundTripConversion (0 ms)

And from the performance perspective, it's improved compared to serial processing:

The speed of execution has been significantly improved

Actually, this code doesn't use libraries like AVX2 or really apply SIMD to process. The main reason is that the structure of UTF-8 encoding is complex and not fixed. It involves multi-byte encoding, and we need to analyze it byte by byte when dealing with different bytes. So, without clear rules and a uniform length, it becomes really hard to directly parallelize the processing of each byte. During the process of converting UTF-8 to UTF-16, we have to handle characters of different lengths, ranging from 1 to 4 bytes, which makes it difficult to break it down into structures that can be directly applied to SIMD operations.
There are also some code style changes, uniform writing

Related issues

Close #1964

Does this PR introduce any user-facing change?

Does this PR introduce any public API change?
Does this PR introduce any binary protocol compatibility change?

Benchmark

pandalee99 added 3 commits December 24, 2024 20:03

make utf8 to utf16

4b4eaf0

code style

38baad7

fix

d63097a

pandalee99 requested review from chaokunyang and PragmaTwice as code owners December 25, 2024 14:24

pandalee99 added 3 commits December 25, 2024 23:06

fix bug

672868e

clang-format

cd2df8d

fix

c60dff8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(c++): Support the UTF-8 to UTF-16 with SIMD #1990

feat(c++): Support the UTF-8 to UTF-16 with SIMD #1990

pandalee99 commented Dec 25, 2024 •

edited

Loading

feat(c++): Support the UTF-8 to UTF-16 with SIMD #1990

Are you sure you want to change the base?

feat(c++): Support the UTF-8 to UTF-16 with SIMD #1990

Conversation

pandalee99 commented Dec 25, 2024 • edited Loading

What does this PR do?

Related issues

Does this PR introduce any user-facing change?

Benchmark

pandalee99 commented Dec 25, 2024 •

edited

Loading