Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Work around swap_bytes on WebAssembly #222

Open
CryZe opened this issue Mar 12, 2024 · 3 comments
Open

Work around swap_bytes on WebAssembly #222

CryZe opened this issue Mar 12, 2024 · 3 comments

Comments

@CryZe
Copy link

CryZe commented Mar 12, 2024

Unfortunately neither u128 nor swap_bytes are supported directly by WebAssembly. So both implementations of folded_multiply are very slow.

I think an algorithm that takes both u64 values, turns them into a v128 vector and then does a bunch of swizzling and vector multiplications and co. probably would be the much faster solution. Here a Godbolt link with a little sketch:

https://rust.godbolt.org/z/jGGhYjGs8

I don't have enough knowledge about how to verify the quality, so I decided to not directly open a PR and instead first discuss the feasibility.

@tkaitchuck
Copy link
Owner

Do you have a good way to benchmark this? If so open a PR for that, and I'll play with it to make sure that quality is there.

@alexcrichton
Copy link

I ran across this and thought I might be able to offer some small insight, but in the end not a huge amount. I personally use wasmtime for local benchmarking but I'm biased because I work on it too. Using that what I did was to put this in ~/.cargo/config.toml:

[target.wasm32-wasip1]
runner = 'wasmtime --dir .'

then I ran:

$ cargo bench --bench ahash -- fallback --save-baseline native
...
$ cargo bench --bench ahash --target wasm32-wasip1 -- fallback --baseline native

and got:

Output of cargo bench
Gnuplot not found, using plotters backend
Benchmarking fallbackhash/u8
Benchmarking fallbackhash/u8: Warming up for 3.0000 s
Benchmarking fallbackhash/u8: Collecting 100 samples in estimated 5.0000 s (683M iterations)
Benchmarking fallbackhash/u8: Analyzing
fallbackhash/u8         time:   [1.1904 ns 1.1941 ns 1.1982 ns]
                        change: [+170.14% +170.58% +171.08%] (p = 0.00 < 0.05)
                        Performance has regressed.
...
Benchmarking fallbackhash/u16
Benchmarking fallbackhash/u16: Warming up for 3.0000 s
Benchmarking fallbackhash/u16: Collecting 100 samples in estimated 5.0000 s (658M iterations)
Benchmarking fallbackhash/u16: Analyzing
fallbackhash/u16        time:   [1.5567 ns 1.5578 ns 1.5590 ns]
                        change: [+209.63% +210.00% +210.36%] (p = 0.00 < 0.05)
                        Performance has regressed.
...
Benchmarking fallbackhash/u32
Benchmarking fallbackhash/u32: Warming up for 3.0000 s
Benchmarking fallbackhash/u32: Collecting 100 samples in estimated 5.0000 s (587M iterations)
Benchmarking fallbackhash/u32: Analyzing
fallbackhash/u32        time:   [2.4698 ns 2.4707 ns 2.4720 ns]
                        change: [+389.02% +389.89% +390.66%] (p = 0.00 < 0.05)
                        Performance has regressed.
...
Benchmarking fallbackhash/u64
Benchmarking fallbackhash/u64: Warming up for 3.0000 s
Benchmarking fallbackhash/u64: Collecting 100 samples in estimated 5.0000 s (396M iterations)
Benchmarking fallbackhash/u64: Analyzing
fallbackhash/u64        time:   [1.5749 ns 1.5755 ns 1.5763 ns]
                        change: [+210.19% +211.21% +212.21%] (p = 0.00 < 0.05)
                        Performance has regressed.
...
Benchmarking fallbackhash/u128
Benchmarking fallbackhash/u128: Warming up for 3.0000 s
Benchmarking fallbackhash/u128: Collecting 100 samples in estimated 5.0001 s (205M iterations)
Benchmarking fallbackhash/u128: Analyzing
fallbackhash/u128       time:   [1.9035 ns 1.9046 ns 1.9060 ns]
                        change: [+206.84% +207.75% +208.60%] (p = 0.00 < 0.05)
                        Performance has regressed.
...
Benchmarking fallbackhash/strings
Benchmarking fallbackhash/strings: Warming up for 3.0000 s
Benchmarking fallbackhash/strings: Collecting 100 samples in estimated 5.0003 s (37M iterations)
Benchmarking fallbackhash/strings: Analyzing
fallbackhash/strings    time:   [134.09 ns 134.11 ns 134.14 ns]
                        change: [+75.955% +76.203% +76.449%] (p = 0.00 < 0.05)
                        Performance has regressed.
...

showing that, as expected, performance isn't as good on wasm likely due to the lack of specialized instructions. I decided to take a look at a hot loop by profiling with perf with Wasmtime and ran:

$ perf record -k mono wasmtime --profile jitdump --dir . target/wasm32-wasip1/release/deps/ahash-c5fdbc98849e7f48.wasm --bench --baseline native 'fallback.*str' --profile-time 10

Using perf to record JIT-generated code is "weird" so the report command looks like:

$ perf inject --jit -i perf.data -o perf.jit-data && perf report -i perf.jit-data 

which showed 99% of runtime spent in this loop

       │441:┌─→mov     %eax,%ecx                                                                                                                                                                                                                                                                                               ▒
       │    │  mov     0x0(%r13,%rcx,1),%rcx                                                                                                                                                                                                                                                                                   ▒
       │    │  lea     0x8(%rax),%r11d                                                                                                                                                                                                                                                                                         ▒
  8.97 │    │  mov     0x0(%r13,%r11,1),%r15                                                                                                                                                                                                                                                                                   ▒
       │    │  mov     $0xfffffff0,%r11d                                                                                                                                                                                                                                                                                       ▒
  0.00 │    │  add     %r11d,%r8d                                                                                                                                                                                                                                                                                              ▒
       │    │  add     $0x10,%eax                                                                                                                                                                                                                                                                                              ▒
  9.19 │    │  xor     criterion::bencher::Bencher<M,%rcx                                                                                                                                                                                                                                                                      ▒
       │    │  mov     %rcx,%r11                                                                                                                                                                                                                                                                                               ▒
 10.43 │    │  bswap   %r11                                                                                                                                                                                                                                                                                                    ▒
       │    │  mov     %r15,%r14                                                                                                                                                                                                                                                                                               ▒
       │    │  xor     criterion::bencher::Bencher<M,%r14                                                                                                                                                                                                                                                                      ▒
  8.76 │    │  imul    %r14,%r11                                                                                                                                                                                                                                                                                               ▒
       │    │  bswap   %r11                                                                                                                                                                                                                                                                                                    ▒
       │    │  movabs  $0x452821e638d01376,%r14                                                                                                                                                                                                                                                                                ▒
  7.50 │    │  add     %r14,%r10                                                                                                                                                                                                                                                                                               ▒
       │    │  xor     %r10,%r11                                                                                                                                                                                                                                                                                               ▒
       │    │  xor     criterion::bencher::Bencher<M,%r15                                                                                                                                                                                                                                                                      ▒
  4.66 │    │  bswap   %r15                                                                                                                                                                                                                                                                                                    ▒
       │    │  imul    %rcx,%r15                                                                                                                                                                                                                                                                                               ▒
       │    │  xor     %r15,%r11                                                                                                                                                                                                                                                                                               ▒
  7.76 │    │  rorx    $0x29,%r11,%r10                                                                                                                                                                                                                                                                                         ▒
       │    ├──cmp     $0x10,%r8d                                                                                                                                                                                                                                                                                              ▒
  8.22 │    └──ja      441                                                                                                                                                                                                                                                                                                     ▒

which, given the three byte swaps, I believe is this function, the one in question here. Now interestingly Wasmtime is generating bswap instructions here despite WebAssembly not having a byte swap instruction. This is because Wasmtime specifically has pattern matching for LLVM-generated sequences of byte swaps to turn them into native bswap instructions. That means that this may not be a great baseline for WebAssembly engines that don't have this optimization.

Nevertheless if I replace folded_multiple with the version from @CryZe's Godbolt and recompile with RUSTFLAGS=-Ctarget-feature=+simd128 and run this in Wasmtime I get:

Gnuplot not found, using plotters backend
Benchmarking fallbackhash/strings
Benchmarking fallbackhash/strings: Warming up for 3.0000 s
Benchmarking fallbackhash/strings: Collecting 100 samples in estimated 5.0000 s (35M iterations)
Benchmarking fallbackhash/strings: Analyzing
fallbackhash/strings    time:   [142.70 ns 142.74 ns 142.79 ns]
                        change: [+87.155% +87.389% +87.577%] (p = 0.00 < 0.05)
                        Performance has regressed.

(it was 70-ish% before)

and the hot loop is now:

       │3eb:┌─→mov      %eax,%edx                                                                                                                                                                                                                                                                                              ▒
       │    │  mov      (%r14,%rdx,1),%rdx                                                                                                                                                                                                                                                                                     ▒
  0.55 │    │  lea      0x8(%rax),%r8d                                                                                                                                                                                                                                                                                         ▒
  5.86 │    │  mov      (%r14,%r8,1),%r12                                                                                                                                                                                                                                                                                      ▒
       │    │  mov      $0xfffffff0,%r8d                                                                                                                                                                                                                                                                                       ▒
       │    │  add      %r8d,%esi                                                                                                                                                                                                                                                                                              ▒
  0.60 │    │  add      $0x10,%eax                                                                                                                                                                                                                                                                                             ▒
  6.26 │    │  xor      criterion::bencher::Bencher<M,%rdx                                                                                                                                                                                                                                                                     ▒
  0.62 │    │  xor      criterion::bencher::Bencher<M,%r12                                                                                                                                                                                                                                                                     ▒
  6.63 │    │  vmovq    %rdx,%xmm1                                                                                                                                                                                                                                                                                             ▒
       │    │  vpinsrq  $0x1,%r12,%xmm1,%xmm1                                                                                                                                                                                                                                                                                  ▒
  0.58 │    │  vpaddusb 0x31a(%rip),%xmm1,%xmm2        # 7c0 <criterion::bencher::Bencher<M>::iter::h5aab6de6fe2515cc+0x740>                                                                                                                                                                                                   ▒
  7.22 │    │  vpshufb  %xmm2,%xmm6,%xmm2                                                                                                                                                                                                                                                                                      ▒
  0.00 │    │  vpsrlq   $0x20,%xmm1,%xmm3                                                                                                                                                                                                                                                                                      ▒
       │    │  vpmuludq %xmm2,%xmm3,%xmm3                                                                                                                                                                                                                                                                                      ▒
  0.58 │    │  vpsrlq   $0x20,%xmm2,%xmm4                                                                                                                                                                                                                                                                                      ▒
  7.36 │    │  vpmuludq %xmm4,%xmm1,%xmm4                                                                                                                                                                                                                                                                                      ▒
       │    │  vpaddq   %xmm4,%xmm3,%xmm3                                                                                                                                                                                                                                                                                      ▒
       │    │  vpsllq   $0x20,%xmm3,%xmm3                                                                                                                                                                                                                                                                                      ▒
  0.61 │    │  vpmuludq %xmm2,%xmm1,%xmm1                                                                                                                                                                                                                                                                                      ▒
  8.34 │    │  vpaddq   %xmm3,%xmm1,%xmm1                                                                                                                                                                                                                                                                                      ▒
  0.00 │    │  vpextrq  $0x1,%xmm1,%rdx                                                                                                                                                                                                                                                                                        ▒
  0.56 │    │  movabs   $0x452821e638d01376,%r8                                                                                                                                                                                                                                                                                ◆
  7.56 │    │  add      %r8,%rcx                                                                                                                                                                                                                                                                                               ▒
       │    │  xor      %rcx,%rdx                                                                                                                                                                                                                                                                                              ▒
  0.57 │    │  vpextrq  $0x0,%xmm1,%rcx                                                                                                                                                                                                                                                                                        ▒
  6.34 │    │  xor      %rcx,%rdx                                                                                                                                                                                                                                                                                              ▒
       │    │  rorx     $0x29,%rdx,%rcx                                                                                                                                                                                                                                                                                        ▒
       │    ├──cmp      $0x10,%esi                                                                                                                                                                                                                                                                                             ▒
  4.45 │    └──ja       3eb                                                                                                                                                                                                                                                                                                    ▒

where this shows that vector instructions are used and geing generated. On x64 though I don't think the lowerings are that great, notably vpaddusb I think is a fallback when a better shuffle otherwise doesn't match.

Sorry if that was a bit more information than desired, but this is at least an example of benchmarking/comparison in a wasm runtime. Benchmarking in a browser-based runtime I think would be more difficult since you'd have to set up a runner/WASI implementation in JS. Those exist, but will require setup.

@alexcrichton
Copy link

Also feel free to file this in the "too much information" category but the compiler explorer gist here uses u8x16_swizzle and that's what vpaddusb comes from and switching to i8x16_relaxed_swizzle (not in stable Rust yet) gets the native-to-wasm slowdown down to +78.206% for this one benchmark, which is much closer to the original version as-is in tree today as +76.203%, so I think simd could still be suitable to use here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants