Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(c++): Support the UTF-8 to UTF-16 with SIMD #1990

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

pandalee99
Copy link
Contributor

@pandalee99 pandalee99 commented Dec 25, 2024

What does this PR do?

To support the utf8 utf16 and using simd to accelerate the optimization

std::string utf16ToUtf8(const std::u16string &utf16, bool is_little_endian)

The logic of converting UTF-8 to UTF-16 isn't that complicated. but there are still lots of optimizations that I haven't come up with yet.

So, I'll first design a version that's a bit faster than the original one, and then think about how to make further optimizations.

Judging from the tests, the logic is correct:

[----------] 9 tests from UTF8ToUTF16Test
[ RUN      ] UTF8ToUTF16Test.BasicConversion
[       OK ] UTF8ToUTF16Test.BasicConversion (0 ms)
[ RUN      ] UTF8ToUTF16Test.EmptyString
[       OK ] UTF8ToUTF16Test.EmptyString (0 ms)
[ RUN      ] UTF8ToUTF16Test.SurrogatePairs
[       OK ] UTF8ToUTF16Test.SurrogatePairs (0 ms)
[ RUN      ] UTF8ToUTF16Test.BoundaryValues
[       OK ] UTF8ToUTF16Test.BoundaryValues (0 ms)
[ RUN      ] UTF8ToUTF16Test.SpecialCharacters
[       OK ] UTF8ToUTF16Test.SpecialCharacters (0 ms)
[ RUN      ] UTF8ToUTF16Test.LittleEndian
[       OK ] UTF8ToUTF16Test.LittleEndian (0 ms)
[ RUN      ] UTF8ToUTF16Test.BigEndian
[       OK ] UTF8ToUTF16Test.BigEndian (0 ms)
[ RUN      ] UTF8ToUTF16Test.RoundTripConversion
[       OK ] UTF8ToUTF16Test.RoundTripConversion (0 ms)
image

And from the performance perspective, it's improved compared to serial processing:
image

The speed of execution has been significantly improved

Actually, this code doesn't use libraries like AVX2 or really apply SIMD to process. The main reason is that the structure of UTF-8 encoding is complex and not fixed. It involves multi-byte encoding, and we need to analyze it byte by byte when dealing with different bytes. So, without clear rules and a uniform length, it becomes really hard to directly parallelize the processing of each byte. During the process of converting UTF-8 to UTF-16, we have to handle characters of different lengths, ranging from 1 to 4 bytes, which makes it difficult to break it down into structures that can be directly applied to SIMD operations.
There are also some code style changes, uniform writing

Related issues

Close #1964

Does this PR introduce any user-facing change?

  • Does this PR introduce any public API change?
  • Does this PR introduce any binary protocol compatibility change?

Benchmark

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] To support the utf8 utf16 and using simd to accelerate the optimization
1 participant