Merge pull request #24 from rusticstuff/v.next

Prepare v0.1.1
rusticstuff · Apr 26, 2021 · 3063e4c · 3063e4c
2 parents 84b79cf + acea3c2
commit 3063e4c
Show file tree

Hide file tree

Showing 32 changed files with 26,044 additions and 16,448 deletions.
diff --git a/Cargo.toml b/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "simdutf8"
-version = "0.1.0"
+version = "0.1.1"
 authors = ["Hans Kratz <hans@appfour.com>"]
 edition = "2018"
 description = "SIMD-accelerated UTF-8 validation."
@@ -16,8 +16,10 @@ exclude = ["/.github", "/.vscode", "/bench", "/afl", "/fuzz", "/img", "expected-
 [features]
 default = ["std"]
 
+# enable CPU feature detection, on by default, turn off for no-std support
 std = []
 
+# expose SIMD implementations in basic::imp::* and compat::imp::*
 public_imp = []
 
 # use branch hints - requires nightly

diff --git a/README.md b/README.md
@@ -8,15 +8,14 @@ Blazingly fast API-compatible UTF-8 validation for Rust using SIMD extensions, b
 [simdjson](https://github.com/simdjson/simdjson). Originally ported to Rust by the developers of [simd-json.rs](https://simd-json.rs).
 
 ## Disclaimer
-This software should be considered alpha quality and should not (yet) be used in production, though it has been tested
-with sample data as well as a fuzzer and there are no known bugs. It will be tested more rigorously before the first
-production release.
+This software should not (yet) be used in production, though it has been tested with sample data as well as
+fuzzing and there are no known bugs.
 
 ## Features
 * `basic` API for the fastest validation, optimized for valid UTF-8
 * `compat` API as a fully compatible replacement for `std::str::from_utf8()`
-* Up to twenty times faster than the std library on non-ASCII, up to twice as fast on ASCII
-* Up to 28% faster on non-ASCII input compared to the original simdjson implementation
+* Up to 22 times faster than the std library on non-ASCII, up to three times faster on ASCII
+* As fast as or faster than the original simdjson implementation
 * Supports AVX 2 and SSE 4.2 implementations on x86 and x86-64. ARMv7 and ARMv8 neon support is planned
 * Selects the fastest implementation at runtime based on CPU support
 * Written in pure Rust
@@ -28,7 +27,7 @@ production release.
 Add the dependency to your Cargo.toml file:
 ```toml
 [dependencies]
-simdutf8 = { version = "0.1.0" }
+simdutf8 = { version = "0.1.1" }
 ```
 
 Use `simdutf8::basic::from_utf8` as a drop-in replacement for `std::str::from_utf8()`.
@@ -59,7 +58,8 @@ is not valid UTF-8. `simdutf8::basic::Utf8Error` is a zero-sized error struct.
 
 ### Compat flavor
 The `compat` flavor is fully API-compatible with `std::str::from_utf8`. In particular, `simdutf8::compat::from_utf8()`
-returns a `simdutf8::compat::Utf8Error`, which has `valid_up_to()` and `error_len()` methods. The first is useful for verification of streamed data. The second is useful e.g. for replacing invalid byte sequences with a replacement character.
+returns a `simdutf8::compat::Utf8Error`, which has `valid_up_to()` and `error_len()` methods. The first is useful for
+verification of streamed data. The second is useful e.g. for replacing invalid byte sequences with a replacement character.
 
 It also fails early: errors are checked on-the-fly as the string is processed and once
 an invalid UTF-8 sequence is encountered, it returns without processing the rest of the data.
@@ -75,47 +75,56 @@ For no-std support (compiled with `--no-default-features`) the implementation is
 the targeted CPU. Use `RUSTFLAGS="-C target-feature=+avx2"` for the AVX 2 implementation or `RUSTFLAGS="-C target-feature=+sse4.2"`
 for the SSE 4.2 implementation.
 
-If you want to be able to call A SIMD implementation directly, use the `public_imp` feature flag. The validation
+If you want to be able to call a SIMD implementation directly, use the `public_imp` feature flag. The validation
 implementations are then accessible via `simdutf8::(basic|compat)::imp::x86::(avx2|sse42)::validate_utf8()`.
 
 ## When not to use
-If you are only processing short byte sequences (less than 64 bytes), the excellent scalar algorithm in the standard
-library is likely faster. Also, this library uses unsafe code which has not been battle-tested and should not (yet)
-be used in production.
+This library uses unsafe code which has not been battle-tested and should not (yet) be used in production.
 
 ## Minimum Supported Rust Version (MSRV)
 This crate's minimum supported Rust version is 1.38.0.
 
 ## Benchmarks
-
 The benchmarks have been done with [criterion](https://bheisler.github.io/criterion.rs/book/index.html), the tables
 are created with [critcmp](https://github.com/BurntSushi/critcmp). Source code and data are in the
 [bench directory](https://github.com/rusticstuff/simdutf8/tree/main/bench).
 
 The name schema is id-charset/size. _0-empty_ is the empty byte slice, _x-error/66536_ is a 64KiB slice where the very
 first character is invalid UTF-8. All benchmarks were run on a laptop with an Intel Core i7-10750H CPU (Comet Lake) on
-Windows with Rust 1.51.0. Library versions are simdutf8 v0.1.0 and simdjson v0.9.2.
+Windows with Rust 1.51.0 if not otherwise stated. Library versions are simdutf8 v0.1.1 and simdjson v0.9.2. When comparing
+with simdjson simdutf8 is compiled with `#inline(never)`.
 
 ### simdutf8 basic vs std library UTF-8 validation
-![critcmp stimdutf8 basic vs std lib](https://raw.githubusercontent.com/rusticstuff/simdutf8/main/img/basic-vs-std.png)
-simdutf8 performs better except for inputs ≤ 64 bytes.
+![critcmp stimdutf8 v0.1.1 basic vs std lib](https://user-images.githubusercontent.com/3736990/116121179-a8271f80-a6c0-11eb-9b2b-6233c3c824f2.png)
+simdutf8 performs better or as well as the std library.
+
+### simdutf8 basic vs simdjson UTF-8 validation on Intel Comet Lake
+![critcmp stimdutf8 v0.1.1 basic vs simdjson WSL](https://user-images.githubusercontent.com/3736990/116121748-38656480-a6c1-11eb-8cb4-385c7516a46a.png)
+simdutf8 beats simdjson on almost all inputs on this CPU. This benchmark is run on 
+[WSL](https://docs.microsoft.com/en-us/windows/wsl/install-win10) 
+since I could not get simdjson to reach maximum performance on Windows with any C++ toolchain (see also simdjson issues 
+[847](https://github.com/simdjson/simdjson/issues/847) and [848](https://github.com/simdjson/simdjson/issues/848)).
+
+### simdutf8 basic vs simdjson UTF-8 validation on AMD Zen 2
+![critcmp stimdutf8 v0.1.1 basic vs simdjson AMD Zen 2](https://user-images.githubusercontent.com/3736990/116122729-731bcc80-a6c2-11eb-82a5-6e297778a1c4.png)
 
-### simdutf8 basic vs simdjson UTF-8 validation
-![critcmp st lib vs stimdutf8 basic](https://raw.githubusercontent.com/rusticstuff/simdutf8/main/img/basic-vs-simdjson.png)
-simdutf8 is faster than simdjson except for some crazy optimization by clang for the pure ASCII
-loop (to be investigated). simdjson is compiled using clang and gcc from MSYS.
+On AMD Zen 2 aligning reads apparently does not matter at all. The extra step for aligning even hurts performance a bit around
+an input size of 4096.
 
 ### simdutf8 basic vs simdutf8 compat UTF-8 validation
-![critcmp st lib vs stimdutf8 basic](https://raw.githubusercontent.com/rusticstuff/simdutf8/main/img/basic-vs-compat.png)
+![image](https://user-images.githubusercontent.com/3736990/116122427-0dc7db80-a6c2-11eb-8434-f9879742d90d.png)
 There is a small performance penalty to continuously checking the error status while processing data, but detecting
 errors early provides a huge benefit for the _x-error/66536_ benchmark.
 
 ## Technical details
-The implementation is similar to the one in simdjson except that it aligns reads to the block size of the
-SIMD extension, which leads to better peak performance compared to the implementation in simdjson. This alignment
-means that an incomplete block needs to be processed before the aligned data is read, which would lead to worse
-performance on short byte sequences. Thus, aligned reads are only used with 2048 bytes of data or more. Incomplete
-reads for the first unaligned and the last incomplete block are done in two aligned 64-byte buffers.
+On X86 for inputs shorter than 64 bytes validation is delegated to `core::str::from_utf8()`.
+
+The SIMD implementation is similar to the one in simdjson except that it aligns reads to the block size of the
+SIMD extension, which leads to better peak performance compared to the implementation in simdjson on some CPUs.
+This alignment means that an incomplete block needs to be processed before the aligned data is read, which
+leads to worse performance on byte sequences shorter than 2048 bytes. Thus, aligned reads are only used with
+2048 bytes of data or more. Incomplete reads for the first unaligned and the last incomplete block are done in
+two aligned 64-byte buffers.
 
 For the compat API we need to check the error buffer on each 64-byte block instead of just aggregating it. If an
 error is found, the last bytes of the previous block are checked for a cross-block continuation and then
@@ -137,5 +146,4 @@ the MIT license and Apache 2.0 license.
 simdjson itself is distributed under the Apache License 2.0.
 
 ## References
-
 John Keiser, Daniel Lemire, [Validating UTF-8 In Less Than One Instruction Per Byte](https://arxiv.org/abs/2010.03090), Software: Practice and Experience 51 (5), 2021
diff --git a/TODO.md b/TODO.md
@@ -9,4 +9,3 @@
 * investigate aarch64 support
 
 # NEXT
-* v0.1.1 benchmarks
diff --git a/bench/BENCHMARKING.md b/bench/BENCHMARKING.md
@@ -46,6 +46,5 @@ Adding `-- --save-baseline some_name` to the bench commandline and then using [c
 * Beware of BD PROCHOT on aged machines, can cause severe throttling
 
 ### Test machines
-* Intel(R) Xeon(R) CPU E3-1225 v3 @ 3.20GHz (Sandy bridge)
-* Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz (Skylake)
-* Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz (Comet Lake)
+* Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz (Comet Lake)
+* AMD Ryzen 7 PRO 3700 8-Core Processor @ 3.60 GHz (Zen 2)
diff --git a/bench/Cargo.toml b/bench/Cargo.toml
@@ -24,6 +24,10 @@ simdjson-utf8 = { version = "*", path = "simdjson-utf8", optional = true }
 name = "throughput_basic"
 harness = false
 
+[[bench]]
+name = "throughput_basic_noinline"
+harness = false
+
 [[bench]]
 name = "throughput_compat"
 harness = false
Original file line number	Diff line number	Diff line change
Expand Up		@@ -9,4 +9,3 @@
		* investigate aarch64 support

		# NEXT
		* v0.1.1 benchmarks