Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use REP MOVSQ/STOSQ on x86_64 #365

Merged
merged 10 commits into from
Oct 24, 2020
Merged

Use REP MOVSQ/STOSQ on x86_64 #365

merged 10 commits into from
Oct 24, 2020

Conversation

josephlr
Copy link
Contributor

@josephlr josephlr commented Jul 8, 2020

Addresses part of #339 (see also rust-osdev/cargo-xbuild#77)

The implementations for these functions are quite simple and (on many recent processors) are the fastest way to implement memcpy/memmove/memset. The implementations in MUSL and in the Linux kernel were used for inspiration.

Benchmarks for the memory functions were also added in this PR, and can be invoked by running cargo bench --package=testcrate. The results of running this benchmarks on different hardware show that the qword-based variants are almost always faster than the byte-based variants.

While the implementations for memcmp/bcmp could be made faster though use of SIMD intrinsics, using rep cmpsb/rep cmpsq makes them slower, so they are left as-is in this PR.

Note that #164 added some optimized versions for memcmp on ARM.

@alexcrichton
Copy link
Member

Thanks for this!

I forget, but do we already have tests for memset/memcmp/etc? If not, could you add some as part of this PR?

Additionally, do you have some benchmark numbers for how these perform?

@josephlr josephlr force-pushed the ermsb branch 2 times, most recently from 97ad0fa to 012085a Compare July 8, 2020 23:29
@josephlr
Copy link
Contributor Author

josephlr commented Jul 9, 2020

Additionally, do you have some benchmark numbers for how these perform?

I updated this PR to add memcpy/memset/memcmp benchmarks to the testcrate crate. It allows comparing the libc functions (via copy_from_slice/slice.cmp/etc...) to the Rust functions provided by this crate.

I ran a bunch of trials, results are below, the main takeaways are:

  • Using the rep movsb/stosb makes the Rust memcpy/memset implementation as fast as musl/glibc's
  • Using repe cmpsb for memcmp actually makes things worse, so I'll remove it.
    • This is consistent with the Intel's optimization guide.
    • The memcmp implementation is very slow, there's room to improve here.

memcpy

Implementation 4 KiB blocks (GiB/sec) 1 MiB blocks (GiB/sec)
Current, simple Rust loop 57.7 30.1
This PR (rep movsb) 98.7 (+71%) 37.3 (+24%)
x86_64 Linux musl libc 94.9 (+64%) 38.1 (+27%)
x86_64 Linux GNU libc 126.0 (+118%) 35.7 (+19%)

memset

Implementation 4 KiB blocks (GiB/sec) 1 MiB blocks (GiB/sec)
Current, simple Rust loop 68.6 45.3
This PR (rep stosb) 121.7 (+77%) 63.6 (+40%)
x86_64 Linux musl libc 112.2 (+63%) 63.8 (+41%)
x86_64 Linux GNU libc 112.7 (+64%) 63.7 (+41%)

memcmp

Implementation 4 KiB blocks (GiB/sec) 1 MiB blocks (GiB/sec)
Current, simple Rust loop 3.6 3.6
This PR (repe cmpsb) 2.2 (-38%) 2.2 (-37%)
x86_64 Linux musl libc 3.5 (-1%) 3.6 (-1%)
x86_64 Linux GNU libc 78.8 (+2110%) 81.9 (+2182%)

@alexcrichton
Copy link
Member

Nice! Thos are some pretty slick wins and also nice find that memcmp doesn't speed up all that much. Also that's pretty crazy how much faster glibc is for memcmp than a simple loop!

@CryZe
Copy link
Contributor

CryZe commented Jul 9, 2020

This claims to close my issue, but isn't this only about x86, while the performance problems seem to be happening across the board? (Especially on WASM it seems rather bad atm)

@alexcrichton
Copy link
Member

We can leave it open for other platforms, but FWIW there's not really much else we can do for wasm. The bulk memory proposal fixes this issue, however, because the memory.copy instructions is basically the exact same as call memcpy and it implemented by the engine so is much faster.

@CryZe
Copy link
Contributor

CryZe commented Jul 9, 2020

If you use the WASI target it does a loop copying 32-bit values rather than individual bytes because it uses the musl implementation in the wasilibc then. I haven't done any real benchmarking, but I'd expect that to be faster. But yeah the bulk memory proposal also fixes that issue if you use that target feature.

@josephlr josephlr force-pushed the ermsb branch 2 times, most recently from 7a52621 to 7321f5c Compare July 10, 2020 09:45
src/mem/x86_64.rs Outdated Show resolved Hide resolved
@alexcrichton
Copy link
Member

Also, to confirm, do we have tests for this in-repo? If not could some be added?

@josephlr
Copy link
Contributor Author

Also, to confirm, do we have tests for this in-repo? If not could some be added?

I would want better test coverage than what we currently have. I'm planning to add a bunch before this CL is ready for review again.

@@ -0,0 +1,69 @@
use super::c_int;

// On recent Intel processors, "rep movsb" and "rep stosb" have been enhanced to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this implementation fare on non-intel implementations of x86_64?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been investigating performance on AMD hardware (the only other x86 platform where anyone cares about performance). This has led me to modify the implementation. When I have some time, I'll post the results and update this comment to clarify the impact on AMD as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Intel fast string support is detectable by two flags in CPUID and one enable bit in IA32_MISC_ENABLE, if necessary. AMD may provide the same detection tools.

Copy link
Contributor Author

@josephlr josephlr Oct 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it looks like virtually all newish AMD and Intel processors support some sort of "REP MOVS enhancement" (i.e. rep movs is somehow better than a normal loop). However, if the ermsb feature flag isn't present (like on all AMD processors) then rep movsq seems better than rep movsb. With ermsb the two variants are about the same speed.

Given this, I just implemented the rep movsq version unconditionally without any CPUID checking. Variants that use rep movsb when Intel's ERMSB/FSRM feature is enabled could be added later, but there doesn't seem to be much of a gain (at least with the benchmarks I'm running here).

@Soveu
Copy link

Soveu commented Oct 13, 2020

These implementations assume that the direction flag is not set, which could not be always the case

src/mem/x86_64.rs Outdated Show resolved Hide resolved
Signed-off-by: Joe Richey <joerichey@google.com>
Signed-off-by: Joe Richey <joerichey@google.com>
This allows comparing the "normal" implementations to the
implementations provided by this crate.

Signed-off-by: Joe Richey <joerichey@google.com>
@josephlr
Copy link
Contributor Author

These implementations assume that the direction flag is not set, which could not be always the case

Per the asm! docs, "On x86, the direction flag (DF in EFLAGS) is clear on entry to an asm block and must be clear on exit.", so I think these impls are fine.

The assembly generated seems correct:
    https://rust.godbolt.org/z/GGnec8

Signed-off-by: Joe Richey <joerichey@google.com>
@josephlr
Copy link
Contributor Author

For AMD performance, I'm getting some conflicting results about which is better rep movsb or rep movsq. I think it might be because I'm using a VM. I don't have any real AMD CPU hardware, could anyone in this thread run the benchmarks against dd7b7dc and 2f9f61f and tell me what they find?

@andyhhp
Copy link

andyhhp commented Oct 17, 2020

DF

Even C doesn't tolerate DF being set generally. There are two legitimate uses of it which I have encountered. One is memmove(), and one is code dumps in backtraces, where you need to be wary of hitting page/permission boundaries (see https://github.com/xen-project/xen/blob/master/xen/arch/x86/traps.c#L175-L204).

For performance, things are very tricky, and once size does not fit all. Presumably here we're talking about mem*() calls which have survived through LLVM's optimisations passes, and are the variations which don't decompose nicely?

If alignment information is available at compile time, then rep stos{l,q} is faster than rep stosb on earlier hardware. Intel have some forthcoming features literally named Fast Zero-Length MOVSB, Fast Short STOSB, Fast short CMPSB/SCASB (https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-module-1eas.pdf Page 120. Not sure if this was intended to be public right now, but it is.) which should give anyone a hint that the current variations aren't great for small %ecx inputs.

Frankly, study a popular libc and follow their lead. A lot of time and effort has gone into optimising them generally across multiple generations of processor. Alternatively, if you do feel like doing feature-based dispatch, that will get better results if you can pick the optimum algorithm for the CPU you're on.

@Soveu
Copy link

Soveu commented Oct 17, 2020

Linux encourages to use rep movsb/stosb with memcpy/memset
memcpy
memset

@Soveu
Copy link

Soveu commented Oct 18, 2020

i wonder why memmove got faster than memcpy lol
edit: literally changing the commits changes the result of some *_rust functions, probably that weird function alignment problem on amd

Signed-off-by: Joe Richey <joerichey@google.com>
@josephlr
Copy link
Contributor Author

@alexcrichton the tests have been added so this is now ready for final review and merging.

This implementation sticks with the rep movsq/rep stosq implementation used by MUSL and Linux (see the links in the PR description). The final assembly looks optimal (memcpy/memset are identical to Linux's implementation).

For performance numbers, see my link in #365 (comment)

Signed-off-by: Joe Richey <joerichey@google.com>
@alexcrichton
Copy link
Member

Thanks again for this! This all looks great to me. As one final thing, though, I'm not sure if the asm feature is actually ever enabled on CI, so could you add a line here to test the feature?

Signed-off-by: Joe Richey <joerichey@google.com>
Signed-off-by: Joe Richey <joerichey@google.com>
@josephlr
Copy link
Contributor Author

Thanks again for this! This all looks great to me. As one final thing, though, I'm not sure if the asm feature is actually ever enabled on CI, so could you add a line here to test the feature?

Done (for tests and builds), also the testcrate enables the "asm" feature by default, should that be changed?

@alexcrichton
Copy link
Member

Hm yeah ideally that would change but that's probably best left to another PR, thanks again!

@alexcrichton alexcrichton merged commit 33ad366 into rust-lang:master Oct 24, 2020
@josephlr josephlr deleted the ermsb branch October 26, 2020 10:05
josephlr added a commit to josephlr/rust that referenced this pull request Oct 26, 2020
This change is needed for compiler-builtins to check for this feature
when implementing memcpy/memset. See:
  rust-lang/compiler-builtins#365

The change just does compile-time detection. I think that runtime
detection will have to come in a follow-up CL to std-detect.

Like all the CPU feature flags, this just references rust-lang#44839

Signed-off-by: Joe Richey <joerichey@google.com>
JohnTitor added a commit to JohnTitor/rust that referenced this pull request Oct 26, 2020
Add compiler support for LLVM's x86_64 ERMSB feature

This change is needed for compiler-builtins to check for this feature
when implementing memcpy/memset. See:
  rust-lang/compiler-builtins#365

Without this change, the following code compiles, but does nothing:
```rust
#[cfg(target_feature = "ermsb")]
pub unsafe fn ermsb_memcpy() { ... }
```

The change just does compile-time detection. I think that runtime
detection will have to come in a follow-up CL to std-detect.

Like all the CPU feature flags, this just references rust-lang#44839

Signed-off-by: Joe Richey <joerichey@google.com>
stlankes added a commit to stlankes/hermit-rs that referenced this pull request Nov 21, 2020
bors bot added a commit to hermit-os/hermit-rs that referenced this pull request Nov 21, 2020
78: using of the asm feature to improve the performance of basic functions r=jbreitbart a=stlankes

-  PR uses rust-lang/compiler-builtins#365 to improve the performance
- fix broken CI and build the bootloader on windows correctly


Co-authored-by: Stefan Lankes <slankes@eonerc.rwth-aachen.de>
bors bot added a commit to hermit-os/hermit-rs that referenced this pull request Nov 21, 2020
78: using of the asm feature to improve the performance of basic functions r=jbreitbart a=stlankes

-  PR uses rust-lang/compiler-builtins#365 to improve the performance
- fix broken CI and build the bootloader on windows correctly


Co-authored-by: Stefan Lankes <slankes@eonerc.rwth-aachen.de>
bors bot added a commit to hermit-os/hermit-rs that referenced this pull request Nov 21, 2020
78: using of the asm feature to improve the performance of basic functions r=jbreitbart a=stlankes

-  PR uses rust-lang/compiler-builtins#365 to improve the performance
- fix broken CI and build the bootloader on windows correctly


Co-authored-by: Stefan Lankes <slankes@eonerc.rwth-aachen.de>
bors bot added a commit to hermit-os/hermit-rs that referenced this pull request Nov 21, 2020
78: using of the asm feature to improve the performance of basic functions r=jbreitbart a=stlankes

-  PR uses rust-lang/compiler-builtins#365 to improve the performance
- fix broken CI and build the bootloader on windows correctly


Co-authored-by: Stefan Lankes <slankes@eonerc.rwth-aachen.de>
AaronKutch pushed a commit to AaronKutch/compiler-builtins that referenced this pull request Nov 28, 2020
* mem: Move mem* functions to separate directory

Signed-off-by: Joe Richey <joerichey@google.com>

* memcpy: Create separate memcpy.rs file

Signed-off-by: Joe Richey <joerichey@google.com>

* benches: Add benchmarks for mem* functions

This allows comparing the "normal" implementations to the
implementations provided by this crate.

Signed-off-by: Joe Richey <joerichey@google.com>

* mem: Add REP MOVSB/STOSB implementations

The assembly generated seems correct:
    https://rust.godbolt.org/z/GGnec8

Signed-off-by: Joe Richey <joerichey@google.com>

* mem: Add documentations for REP string insturctions

Signed-off-by: Joe Richey <joerichey@google.com>

* Use quad-word rep string instructions

Signed-off-by: Joe Richey <joerichey@google.com>

* Prevent panic when compiled in debug mode

Signed-off-by: Joe Richey <joerichey@google.com>

* Add tests for mem* functions

Signed-off-by: Joe Richey <joerichey@google.com>

* Add build/test with the "asm" feature

Signed-off-by: Joe Richey <joerichey@google.com>

* Add byte length to Bencher

Signed-off-by: Joe Richey <joerichey@google.com>
dspencer12 added a commit to dspencer12/blog_os that referenced this pull request Feb 23, 2021
The referenced issue in compiler-builtins (rust-lang/compiler-builtins#365) has been merged.
phil-opp pushed a commit to phil-opp/blog_os that referenced this pull request Feb 23, 2021
The referenced issue in compiler-builtins (rust-lang/compiler-builtins#365) has been merged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants