feat(c++): Support the UTF-8 to UTF-16 with SIMD #1990
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
To support the utf8 utf16 and using simd to accelerate the optimization
The logic of converting UTF-8 to UTF-16 isn't that complicated. but there are still lots of optimizations that I haven't come up with yet.
So, I'll first design a version that's a bit faster than the original one, and then think about how to make further optimizations.
Judging from the tests, the logic is correct:
And from the performance perspective, it's improved compared to serial processing:
The speed of execution has been significantly improved
Actually, this code doesn't use libraries like AVX2 or really apply SIMD to process. The main reason is that the structure of UTF-8 encoding is complex and not fixed. It involves multi-byte encoding, and we need to analyze it byte by byte when dealing with different bytes. So, without clear rules and a uniform length, it becomes really hard to directly parallelize the processing of each byte. During the process of converting UTF-8 to UTF-16, we have to handle characters of different lengths, ranging from 1 to 4 bytes, which makes it difficult to break it down into structures that can be directly applied to SIMD operations.
There are also some code style changes, uniform writing
Related issues
Close #1964
Does this PR introduce any user-facing change?
Benchmark