Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emitted memset and memcpy are really slow on WASM #92436

Open
torokati44 opened this issue Dec 30, 2021 · 3 comments
Open

Emitted memset and memcpy are really slow on WASM #92436

torokati44 opened this issue Dec 30, 2021 · 3 comments
Labels
C-bug Category: This is a bug. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such O-wasm Target: WASM (WebAssembly), http://webassembly.org/

Comments

@torokati44
Copy link

While both functions do the "usual" double-loop pattern, with one loop processing 4 or 8 bytes at a time, and the other one doing the remaining few, sadly, the former one also only does 1-byte wide loads and stores, just has more of them per iteration in series.
This does not look right, it could very well operate on single 32bit or 64bit wide primitives per iteration.

This is what rustc emits in --release mode:
memset
memcpy

Running them through wasm-opt -O doesn't do much, only reorders some locals, and changes an "add -1" to a "sub 1" for some reason:
memset after wasm-opt
memcpy after wasm-opt
(I have also extended these sources to be complete modules for experimentation purposes.)

These functions constitute a double-digit percentage of total runtime in some cases, see: WebAssembly/binaryen#4403 (comment)

@torokati44 torokati44 added the C-bug Category: This is a bug. label Dec 30, 2021
@torokati44
Copy link
Author

torokati44 commented Dec 30, 2021

To reproduce, compile this as a binary package with cargo build --release --target wasm32-unknown-unknown (1.57.0):

pub fn foobar(x: u8) -> [u8; 1024] {
    [x; 1024]
}

pub fn main() {
    let mut line  = String::new();
    std::io::stdin().read_line(&mut line).unwrap();
    let number : i32 = line.trim().parse().unwrap();

    let fb = foobar(number as u8);
    println!("{}", fb[0]);
}

(The weird IO stuff is to prevent the optimizer from eliding all of the memory operations.)

Then: wasm2wat target/wasm32-unknown-unknown/release/<packagename>.wasm | grep "func .memset" -A 100
and: wasm2wat target/wasm32-unknown-unknown/release/<packagename>.wasm | grep "func .memcpy" -A 100

@CryZe
Copy link
Contributor

CryZe commented Dec 30, 2021

There's an open issue about this here: rust-lang/compiler-builtins#339

@torokati44
Copy link
Author

torokati44 commented Dec 31, 2021

I see. And these loops are looking a lot less bad already even in the current 1.58.0 beta!
While still nowhere near bulk-memory intrinsics levels of performance, at least not quite as terrible as right now. (It still only does an i32 per iteration, it could maybe do better with i64?)

@jieyouxu jieyouxu added O-wasm Target: WASM (WebAssembly), http://webassembly.org/ C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such and removed needs-triage-legacy labels Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: This is a bug. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such O-wasm Target: WASM (WebAssembly), http://webassembly.org/
Projects
None yet
Development

No branches or pull requests

4 participants