Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider using SimdUnicode.UTF8.GetPointerToFirstInvalidByte for UTF-8 validation #103781

Open
lemire opened this issue Jun 20, 2024 · 2 comments

Comments

@lemire
Copy link

lemire commented Jun 20, 2024

The runtime has a great and fast function for UTF-8 validation: Utf8Utility.GetPointerToFirstInvalidByte. But we might be able to do better.

We implemented in C#, the 'lookup' UTF-8validation algorithm from

The algorithm is used by Oracle GraalVM, the Node.js and Bun JavaScript runtimes. For example, Node.js is capable of validating Arabic or Chinese strings at 17 GB/s on an 2 GHz Intel server (from JavaScript).

We adapted it so that we can match exactly the functionality of Utf8Utility.GetPointerToFirstInvalidByte with a function called SimdUnicode.UTF8.GetPointerToFirstInvalidByte. It is available on GitHub at simdutf/SimdUnicode. We have good tests, and decent benchmarks. We use .NET's excellent runtime dispatching functionality to select the best function (SSE4.2, AVX2, AVX-512, fallback, NEON). We used @EgorBo's Disasmo to help tune the code, although we make no claim that it is optimal (it probably is not).

Intel Ice Lake results:

data set SimdUnicode AVX-512 (GB/s) .NET speed (GB/s) speed up
Twitter.json 29 12 2.4 x
Arabic-Lipsum 12 2.3 5.2 x
Chinese-Lipsum 12 3.9 3.0 x
Emoji-Lipsum 12 0.9 13 x
Hebrew-Lipsum 12 2.3 5.2 x
Hindi-Lipsum 12 2.1 5.7 x
 Japanese-Lipsum 10  3.5 2.9 x
Korean-Lipsum 10 1.3 7.7 x
Latin-Lipsum 76 76 ---
Russian-Lipsum 12 1.2 10 x

Twitter.json
 SimdUnicode ▏   29 GB/s █████████████████████████
.NET Runtime ▏   12 GB/s ██████████▎

Arabic-Lipsum
 SimdUnicode ▏   12 GB/s █████████████████████████
.NET Runtime ▏  2.3 GB/s ████▊

Chinese-Lipsum
 SimdUnicode ▏   12 GB/s █████████████████████████
.NET Runtime ▏  3.9 GB/s ████████▏

Emoji-Lipsum
 SimdUnicode ▏   12 GB/s █████████████████████████
.NET Runtime ▏  0.9 GB/s █▉

Japanese-Lipsum
 SimdUnicode ▏   10 GB/s █████████████████████████
.NET Runtime ▏  3.5 GB/s ████████▊

Apple M2 results:

data set SimdUnicode speed (GB/s) .NET speed (GB/s) speed up
Twitter.json 25 14 1.8 x
Arabic-Lipsum 7.4 3.5 2.1 x
Chinese-Lipsum 7.4 4.8 1.5 x
Emoji-Lipsum 7.4 2.5 3.0 x
Hebrew-Lipsum 7.4 3.5 2.1 x
Hindi-Lipsum 7.3 3.0 2.4 x
 Japanese-Lipsum 7.3 4.6  1.6 x
Korean-Lipsum 7.4 1.8 4.1 x
Latin-Lipsum 87 38 2.3 x
Russian-Lipsum 7.4 2.7 2.7 x

On a Neoverse V1 (Graviton 3), our validation function is 1.3 to over five times
faster than the standard library.

data set SimdUnicode speed (GB/s) .NET speed (GB/s) speed up
Twitter.json 14 8.7 1.4 x
Arabic-Lipsum 4.2 2.0 2.1 x
Chinese-Lipsum 4.2 2.6 1.6 x
Emoji-Lipsum 4.2 0.8 5.3 x
Hebrew-Lipsum 4.2 2.0 2.1 x
Hindi-Lipsum 4.2 1.6 2.6 x
 Japanese-Lipsum 4.2 2.4  1.8 x
Korean-Lipsum 4.2 1.3 3.2 x
Latin-Lipsum 42 17 2.5 x
Russian-Lipsum 4.2 0.95 4.4 x

On a Qualcomm 8cx gen3 (Windows Dev Kit 2023), we get roughly the same relative performance
boost as the Neoverse V1.

data set SimdUnicode speed (GB/s) .NET speed (GB/s) speed up
Twitter.json 17 10 1.7 x
Arabic-Lipsum 5.0 2.3 2.2 x
Chinese-Lipsum 5.0 2.9 1.7 x
Emoji-Lipsum 5.0 0.9 5.5 x
Hebrew-Lipsum 5.0 2.3 2.2 x
Hindi-Lipsum 5.0 1.9 2.6 x
 Japanese-Lipsum 5.0 2.7  1.9 x
Korean-Lipsum 5.0 1.5 3.3 x
Latin-Lipsum 50 20 2.5 x
Russian-Lipsum 5.0 1.2 5.2 x

On a Neoverse N1 (Graviton 2), our validation function is up to over three times
faster than the standard library.

data set SimdUnicode speed (GB/s) .NET speed (GB/s) speed up
Twitter.json 7.8 5.7 1.4 x
Arabic-Lipsum 2.5 0.9 2.8 x
Chinese-Lipsum 2.5 1.8 1.4 x
Emoji-Lipsum 2.5 0.7 3.6 x
Hebrew-Lipsum 2.5 0.9 2.7 x
Hindi-Lipsum 2.3 1.0 2.3 x
 Japanese-Lipsum 2.4 1.7  1.4 x
Korean-Lipsum 2.5 1.0 2.5 x
Latin-Lipsum 23 13 1.8 x
Russian-Lipsum 2.3 0.7 3.3 x

Importantly, there is no patent involved, and no licensing issue. We are eager for reviews, feedback and so forth.

Note that we have other fast Unicode algorithms that could be implemented in C#, including fast transcoding functions. UTF-8 validation is simply the simplest non-trivial case.

This is joint work with @Nick-Nuon

Further reading: Validating gigabytes of Unicode strings per second… in C#? (blog post)

@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jun 20, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Jun 20, 2024
@GrabYourPitchforks GrabYourPitchforks added area-System.Text.Encoding and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Jun 20, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-text-encoding
See info in area-owners.md if you want to be subscribed.

@tarekgh tarekgh added this to the Future milestone Jun 20, 2024
@tarekgh tarekgh removed the untriaged New issue has not been triaged by the area owner label Jun 20, 2024
@lemire
Copy link
Author

lemire commented Sep 21, 2024

The SimdUnicode C# code has been ported to Mojo (Chris Lattner's new programming language) and it is now part of their standard library... https://github.com/modularml/mojo/blob/nightly/stdlib/src/utils/_utf8_validation.mojo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants