Consider using SimdUnicode.UTF8.GetPointerToFirstInvalidByte for UTF-8 validation #103781

lemire · 2024-06-20T19:55:47Z

The runtime has a great and fast function for UTF-8 validation: Utf8Utility.GetPointerToFirstInvalidByte. But we might be able to do better.

We implemented in C#, the 'lookup' UTF-8validation algorithm from

Validating UTF-8 In Less Than One Instruction Per Byte, Software: Practice and Experience 51 (5), 2021

The algorithm is used by Oracle GraalVM, the Node.js and Bun JavaScript runtimes. For example, Node.js is capable of validating Arabic or Chinese strings at 17 GB/s on an 2 GHz Intel server (from JavaScript).

We adapted it so that we can match exactly the functionality of Utf8Utility.GetPointerToFirstInvalidByte with a function called SimdUnicode.UTF8.GetPointerToFirstInvalidByte. It is available on GitHub at simdutf/SimdUnicode. We have good tests, and decent benchmarks. We use .NET's excellent runtime dispatching functionality to select the best function (SSE4.2, AVX2, AVX-512, fallback, NEON). We used @EgorBo's Disasmo to help tune the code, although we make no claim that it is optimal (it probably is not).

Intel Ice Lake results:

data set	SimdUnicode AVX-512 (GB/s)	.NET speed (GB/s)	speed up
Twitter.json	29	12	2.4 x
Arabic-Lipsum	12	2.3	5.2 x
Chinese-Lipsum	12	3.9	3.0 x
Emoji-Lipsum	12	0.9	13 x
Hebrew-Lipsum	12	2.3	5.2 x
Hindi-Lipsum	12	2.1	5.7 x
Japanese-Lipsum	10	3.5	2.9 x
Korean-Lipsum	10	1.3	7.7 x
Latin-Lipsum	76	76	---
Russian-Lipsum	12	1.2	10 x


Twitter.json
 SimdUnicode ▏   29 GB/s █████████████████████████
.NET Runtime ▏   12 GB/s ██████████▎

Arabic-Lipsum
 SimdUnicode ▏   12 GB/s █████████████████████████
.NET Runtime ▏  2.3 GB/s ████▊

Chinese-Lipsum
 SimdUnicode ▏   12 GB/s █████████████████████████
.NET Runtime ▏  3.9 GB/s ████████▏

Emoji-Lipsum
 SimdUnicode ▏   12 GB/s █████████████████████████
.NET Runtime ▏  0.9 GB/s █▉

Japanese-Lipsum
 SimdUnicode ▏   10 GB/s █████████████████████████
.NET Runtime ▏  3.5 GB/s ████████▊

Apple M2 results:

data set	SimdUnicode speed (GB/s)	.NET speed (GB/s)	speed up
Twitter.json	25	14	1.8 x
Arabic-Lipsum	7.4	3.5	2.1 x
Chinese-Lipsum	7.4	4.8	1.5 x
Emoji-Lipsum	7.4	2.5	3.0 x
Hebrew-Lipsum	7.4	3.5	2.1 x
Hindi-Lipsum	7.3	3.0	2.4 x
Japanese-Lipsum	7.3	4.6	1.6 x
Korean-Lipsum	7.4	1.8	4.1 x
Latin-Lipsum	87	38	2.3 x
Russian-Lipsum	7.4	2.7	2.7 x

On a Neoverse V1 (Graviton 3), our validation function is 1.3 to over five times
faster than the standard library.

data set	SimdUnicode speed (GB/s)	.NET speed (GB/s)	speed up
Twitter.json	14	8.7	1.4 x
Arabic-Lipsum	4.2	2.0	2.1 x
Chinese-Lipsum	4.2	2.6	1.6 x
Emoji-Lipsum	4.2	0.8	5.3 x
Hebrew-Lipsum	4.2	2.0	2.1 x
Hindi-Lipsum	4.2	1.6	2.6 x
Japanese-Lipsum	4.2	2.4	1.8 x
Korean-Lipsum	4.2	1.3	3.2 x
Latin-Lipsum	42	17	2.5 x
Russian-Lipsum	4.2	0.95	4.4 x

On a Qualcomm 8cx gen3 (Windows Dev Kit 2023), we get roughly the same relative performance
boost as the Neoverse V1.

data set	SimdUnicode speed (GB/s)	.NET speed (GB/s)	speed up
Twitter.json	17	10	1.7 x
Arabic-Lipsum	5.0	2.3	2.2 x
Chinese-Lipsum	5.0	2.9	1.7 x
Emoji-Lipsum	5.0	0.9	5.5 x
Hebrew-Lipsum	5.0	2.3	2.2 x
Hindi-Lipsum	5.0	1.9	2.6 x
Japanese-Lipsum	5.0	2.7	1.9 x
Korean-Lipsum	5.0	1.5	3.3 x
Latin-Lipsum	50	20	2.5 x
Russian-Lipsum	5.0	1.2	5.2 x

On a Neoverse N1 (Graviton 2), our validation function is up to over three times
faster than the standard library.

data set	SimdUnicode speed (GB/s)	.NET speed (GB/s)	speed up
Twitter.json	7.8	5.7	1.4 x
Arabic-Lipsum	2.5	0.9	2.8 x
Chinese-Lipsum	2.5	1.8	1.4 x
Emoji-Lipsum	2.5	0.7	3.6 x
Hebrew-Lipsum	2.5	0.9	2.7 x
Hindi-Lipsum	2.3	1.0	2.3 x
Japanese-Lipsum	2.4	1.7	1.4 x
Korean-Lipsum	2.5	1.0	2.5 x
Latin-Lipsum	23	13	1.8 x
Russian-Lipsum	2.3	0.7	3.3 x

Importantly, there is no patent involved, and no licensing issue. We are eager for reviews, feedback and so forth.

Note that we have other fast Unicode algorithms that could be implemented in C#, including fast transcoding functions. UTF-8 validation is simply the simplest non-trivial case.

This is joint work with @Nick-Nuon

Further reading: Validating gigabytes of Unicode strings per second… in C#? (blog post)

The text was updated successfully, but these errors were encountered:

dotnet-policy-service · 2024-06-20T20:24:04Z

Tagging subscribers to this area: @dotnet/area-system-text-encoding
See info in area-owners.md if you want to be subscribed.

lemire · 2024-09-21T13:48:43Z

The SimdUnicode C# code has been ported to Mojo (Chris Lattner's new programming language) and it is now part of their standard library... https://github.com/modularml/mojo/blob/nightly/stdlib/src/utils/_utf8_validation.mojo

dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jun 20, 2024

dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Jun 20, 2024

GrabYourPitchforks added area-System.Text.Encoding and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Jun 20, 2024

tarekgh added this to the Future milestone Jun 20, 2024

tarekgh removed the untriaged New issue has not been triaged by the area owner label Jun 20, 2024

This was referenced Jun 23, 2024

Try SimdUnicode for Utf8 validation #103860

Closed

Integrate SimdUnicode for AVX-512 #104199

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider using SimdUnicode.UTF8.GetPointerToFirstInvalidByte for UTF-8 validation #103781

Consider using SimdUnicode.UTF8.GetPointerToFirstInvalidByte for UTF-8 validation #103781

lemire commented Jun 20, 2024 •

edited

Loading

dotnet-policy-service bot commented Jun 20, 2024

lemire commented Sep 21, 2024

Consider using SimdUnicode.UTF8.GetPointerToFirstInvalidByte for UTF-8 validation #103781

Consider using SimdUnicode.UTF8.GetPointerToFirstInvalidByte for UTF-8 validation #103781

Comments

lemire commented Jun 20, 2024 • edited Loading

dotnet-policy-service bot commented Jun 20, 2024

lemire commented Sep 21, 2024

lemire commented Jun 20, 2024 •

edited

Loading