Add WebAssembly SIMD Support #269

CryZe · 2021-05-26T15:37:08Z

Chrome 91 just released yesterday with stable support for WASM SIMD. Firefox will shortly follow as well. Rust also intends to stabilize the intrinsics soon. So I decided to go ahead and already port SIMD support for a couple of crates including now hashbrown.

This can't be merged until the WebAssembly intrinsics are stabilized. Technically we could merge it early under the nightly feature flag, but I'd expect the intrinsics to change a bit in the near future before stabilizing, so merging early might not make much sense.

Chrome 91 just released yesterday with stable support for WASM SIMD. Firefox will shortly follow as well. Rust also intends to stabilize the intrinsics soon. So I decided to go ahead and already port SIMD support for a couple of crates including now hashbrown.

Amanieu · 2021-05-27T06:47:20Z

I'm curious if this actually improves performance on WASM compared to the generic version.

One concern I have is that i8x16_bitmask doesn't lower cleanly to ARM/AArch64 so those architectures will likely have poor performance.

Regarding the stability of the intrinsics, I'm happy to add this under the nightly feature.

CryZe · 2021-05-27T14:45:01Z

Seems like this is what it lowers to: https://gcc.godbolt.org/z/7chfTPsjb

Also same lowering seen here in V8: arm64/code-generator-arm64.cc

There's some benchmarks here: Wasm SIMD bitmask slides

Though they are mostly comparing against expressing it with other WASM SIMD instructions, whereas in hashbrown we would just avoid SIMD entirely on WASM otherwise.

There's more discussion and some benchmarks in the original PR for adding the instruction to WASM: WebAssembly/simd#201

I can't really benchmark it myself on an Aarch64 device (Safari doesn't yet support WASM SIMD, and the iPhone is the only Aarch64 device with some sort of WASM support that I have).

However I also noticed a few things:

There are a few places where we could stay on the original SIMD mask and evaluate the result on that rather than converting to a bitmask, such as for when querying if there's any EMPTY in the group.
WASM currently uses 32-bit groups rather than the 64-bit groups it could use. I'll do a PR for that in a bit.

alexcrichton · 2021-05-27T17:55:42Z

I did some comparisons on my local macbook (x86_64) and an arm64 machine I had access to using Wasmtime. I don't think Wasmtime's simd backend has seen a ton of optimization yet, but this may at least be somewhat representative. I compared baseline which was this crate as-is, baseline-simd128 which is this crate as-is but compiled with +simd128, and everything which is this PR plus compiling with +simd128

x86_64 - baseline vs baseline-simd128 - mixed bag

 name                         baseline ns/iter  baseline-simd128 ns/iter  diff ns/iter  diff %  speedup 
 clone_from_large             5,101             4,666                             -435  -8.53%   x 1.09 
 clone_from_small             56                54                                  -2  -3.57%   x 1.04 
 clone_large                  4,822             5,302                              480   9.95%   x 0.91 
 clone_small                  82                87                                   5   6.10%   x 0.94 
 grow_insert_ahash_highbits   41,968            42,142                             174   0.41%   x 1.00 
 grow_insert_ahash_random     42,756            40,964                          -1,792  -4.19%   x 1.04 
 grow_insert_ahash_serial     42,612            41,589                          -1,023  -2.40%   x 1.02 
 grow_insert_std_highbits     65,613            63,897                          -1,716  -2.62%   x 1.03 
 grow_insert_std_random       64,218            63,476                            -742  -1.16%   x 1.01 
 grow_insert_std_serial       64,239            63,013                          -1,226  -1.91%   x 1.02 
 insert_ahash_highbits        35,368            35,387                              19   0.05%   x 1.00 
 insert_ahash_random          35,150            35,415                             265   0.75%   x 0.99 
 insert_ahash_serial          34,536            35,677                           1,141   3.30%   x 0.97 
 insert_erase_ahash_highbits  40,948            39,795                          -1,153  -2.82%   x 1.03 
 insert_erase_ahash_random    38,385            38,061                            -324  -0.84%   x 1.01 
 insert_erase_ahash_serial    36,706            36,575                            -131  -0.36%   x 1.00 
 insert_erase_std_highbits    76,954            80,506                           3,552   4.62%   x 0.96 
 insert_erase_std_random      76,566            79,070                           2,504   3.27%   x 0.97 
 insert_erase_std_serial      80,025            78,698                          -1,327  -1.66%   x 1.02 
 insert_std_highbits          49,014            52,069                           3,055   6.23%   x 0.94 
 insert_std_random            49,894            51,618                           1,724   3.46%   x 0.97 
 insert_std_serial            48,969            52,484                           3,515   7.18%   x 0.93 
 iter_ahash_highbits          1,737             2,000                              263  15.14%   x 0.87 
 iter_ahash_random            1,723             1,923                              200  11.61%   x 0.90 
 iter_ahash_serial            1,745             1,938                              193  11.06%   x 0.90 
 iter_std_highbits            1,944             2,202                              258  13.27%   x 0.88 
 iter_std_random              1,983             2,069                               86   4.34%   x 0.96 
 iter_std_serial              1,938             1,987                               49   2.53%   x 0.98 
 lookup_ahash_highbits        6,498             6,473                              -25  -0.38%   x 1.00 
 lookup_ahash_random          5,695             5,702                                7   0.12%   x 1.00 
 lookup_ahash_serial          5,928             6,025                               97   1.64%   x 0.98 
 lookup_fail_ahash_highbits   7,594             7,972                              378   4.98%   x 0.95 
 lookup_fail_ahash_random     6,912             7,491                              579   8.38%   x 0.92 
 lookup_fail_ahash_serial     6,553             6,991                              438   6.68%   x 0.94 
 lookup_fail_std_highbits     27,497            24,971                          -2,526  -9.19%   x 1.10 
 lookup_fail_std_random       26,023            24,171                          -1,852  -7.12%   x 1.08 
 lookup_fail_std_serial       26,538            24,757                          -1,781  -6.71%   x 1.07 
 lookup_std_highbits          26,154            24,469                          -1,685  -6.44%   x 1.07 
 lookup_std_random            25,502            23,360                          -2,142  -8.40%   x 1.09 
 lookup_std_serial            24,842            23,201                          -1,641  -6.61%   x 1.07 
 rehash_in_place              509,767           532,062                         22,295   4.37%   x 0.96

x86_64 - baseline vs everything - clear win

 name                         baseline ns/iter  everything ns/iter  diff ns/iter   diff %  speedup 
 clone_from_large             5,101             4,611                       -490   -9.61%   x 1.11 
 clone_from_small             56                49                            -7  -12.50%   x 1.14 
 clone_large                  4,822             4,569                       -253   -5.25%   x 1.06 
 clone_small                  82                80                            -2   -2.44%   x 1.03 
 grow_insert_ahash_highbits   41,968            33,038                    -8,930  -21.28%   x 1.27 
 grow_insert_ahash_random     42,756            31,465                   -11,291  -26.41%   x 1.36 
 grow_insert_ahash_serial     42,612            30,854                   -11,758  -27.59%   x 1.38 
 grow_insert_std_highbits     65,613            54,778                   -10,835  -16.51%   x 1.20 
 grow_insert_std_random       64,218            54,839                    -9,379  -14.60%   x 1.17 
 grow_insert_std_serial       64,239            53,592                   -10,647  -16.57%   x 1.20 
 insert_ahash_highbits        35,368            31,109                    -4,259  -12.04%   x 1.14 
 insert_ahash_random          35,150            31,250                    -3,900  -11.10%   x 1.12 
 insert_ahash_serial          34,536            31,510                    -3,026   -8.76%   x 1.10 
 insert_erase_ahash_highbits  40,948            30,874                   -10,074  -24.60%   x 1.33 
 insert_erase_ahash_random    38,385            30,963                    -7,422  -19.34%   x 1.24 
 insert_erase_ahash_serial    36,706            31,470                    -5,236  -14.26%   x 1.17 
 insert_erase_std_highbits    76,954            67,220                    -9,734  -12.65%   x 1.14 
 insert_erase_std_random      76,566            70,588                    -5,978   -7.81%   x 1.08 
 insert_erase_std_serial      80,025            66,201                   -13,824  -17.27%   x 1.21 
 insert_std_highbits          49,014            49,247                       233    0.48%   x 1.00 
 insert_std_random            49,894            49,323                      -571   -1.14%   x 1.01 
 insert_std_serial            48,969            48,156                      -813   -1.66%   x 1.02 
 iter_ahash_highbits          1,737             1,891                        154    8.87%   x 0.92 
 iter_ahash_random            1,723             1,790                         67    3.89%   x 0.96 
 iter_ahash_serial            1,745             1,822                         77    4.41%   x 0.96 
 iter_std_highbits            1,944             1,663                       -281  -14.45%   x 1.17 
 iter_std_random              1,983             1,664                       -319  -16.09%   x 1.19 
 iter_std_serial              1,938             1,622                       -316  -16.31%   x 1.19 
 lookup_ahash_highbits        6,498             5,821                       -677  -10.42%   x 1.12 
 lookup_ahash_random          5,695             5,344                       -351   -6.16%   x 1.07 
 lookup_ahash_serial          5,928             4,987                       -941  -15.87%   x 1.19 
 lookup_fail_ahash_highbits   7,594             6,514                     -1,080  -14.22%   x 1.17 
 lookup_fail_ahash_random     6,912             5,504                     -1,408  -20.37%   x 1.26 
 lookup_fail_ahash_serial     6,553             5,240                     -1,313  -20.04%   x 1.25 
 lookup_fail_std_highbits     27,497            24,249                    -3,248  -11.81%   x 1.13 
 lookup_fail_std_random       26,023            24,067                    -1,956   -7.52%   x 1.08 
 lookup_fail_std_serial       26,538            23,525                    -3,013  -11.35%   x 1.13 
 lookup_std_highbits          26,154            24,129                    -2,025   -7.74%   x 1.08 
 lookup_std_random            25,502            23,381                    -2,121   -8.32%   x 1.09 
 lookup_std_serial            24,842            23,269                    -1,573   -6.33%   x 1.07 
 rehash_in_place              509,767           402,089                 -107,678  -21.12%   x 1.27

aarch64 - baseline vs baseline-simd128 - mostly a loss

 name                         baseline ns/iter  baseline-simd128 ns/iter  diff ns/iter  diff %  speedup 
 clone_from_large             14,287            14,546                             259   1.81%   x 0.98 
 clone_from_small             120               135                                 15  12.50%   x 0.89 
 clone_large                  14,426            14,109                            -317  -2.20%   x 1.02 
 clone_small                  168               170                                  2   1.19%   x 0.99 
 grow_insert_ahash_highbits   88,879            89,238                             359   0.40%   x 1.00 
 grow_insert_ahash_random     87,318            88,333                           1,015   1.16%   x 0.99 
 grow_insert_ahash_serial     86,675            87,343                             668   0.77%   x 0.99 
 grow_insert_std_highbits     112,599           114,707                          2,108   1.87%   x 0.98 
 grow_insert_std_random       111,288           111,517                            229   0.21%   x 1.00 
 grow_insert_std_serial       110,858           110,923                             65   0.06%   x 1.00 
 insert_ahash_highbits        74,718            69,227                          -5,491  -7.35%   x 1.08 
 insert_ahash_random          66,334            66,766                             432   0.65%   x 0.99 
 insert_ahash_serial          66,182            66,284                             102   0.15%   x 1.00 
 insert_erase_ahash_highbits  88,475            89,487                           1,012   1.14%   x 0.99 
 insert_erase_ahash_random    85,865            88,123                           2,258   2.63%   x 0.97 
 insert_erase_ahash_serial    84,714            84,628                             -86  -0.10%   x 1.00 
 insert_erase_std_highbits    148,085           143,546                         -4,539  -3.07%   x 1.03 
 insert_erase_std_random      141,257           144,946                          3,689   2.61%   x 0.97 
 insert_erase_std_serial      139,078           136,464                         -2,614  -1.88%   x 1.02 
 insert_std_highbits          86,464            87,785                           1,321   1.53%   x 0.98 
 insert_std_random            81,859            86,434                           4,575   5.59%   x 0.95 
 insert_std_serial            80,537            86,210                           5,673   7.04%   x 0.93 
 iter_ahash_highbits          6,262             6,237                              -25  -0.40%   x 1.00 
 iter_ahash_random            6,215             6,403                              188   3.02%   x 0.97 
 iter_ahash_serial            5,992             6,334                              342   5.71%   x 0.95 
 iter_std_highbits            6,088             6,150                               62   1.02%   x 0.99 
 iter_std_random              6,114             6,144                               30   0.49%   x 1.00 
 iter_std_serial              6,099             6,004                              -95  -1.56%   x 1.02 
 lookup_ahash_highbits        18,621            18,923                             302   1.62%   x 0.98 
 lookup_ahash_random          17,852            18,065                             213   1.19%   x 0.99 
 lookup_ahash_serial          17,050            17,013                             -37  -0.22%   x 1.00 
 lookup_fail_ahash_highbits   18,427            18,555                             128   0.69%   x 0.99 
 lookup_fail_ahash_random     17,151            16,969                            -182  -1.06%   x 1.01 
 lookup_fail_ahash_serial     15,873            16,306                             433   2.73%   x 0.97 
 lookup_fail_std_highbits     41,149            53,365                          12,216  29.69%   x 0.77 
 lookup_fail_std_random       40,532            43,951                           3,419   8.44%   x 0.92 
 lookup_fail_std_serial       40,744            43,874                           3,130   7.68%   x 0.93 
 lookup_std_highbits          40,983            44,475                           3,492   8.52%   x 0.92 
 lookup_std_random            40,889            43,213                           2,324   5.68%   x 0.95 
 lookup_std_serial            41,146            44,067                           2,921   7.10%   x 0.93 
 rehash_in_place              1,031,033         1,029,301                       -1,732  -0.17%   x 1.00

aarch64 - baseline vs everything - huge loss

 name                         baseline ns/iter  everything ns/iter  diff ns/iter   diff %  speedup 
 clone_from_large             14,287            13,932                      -355   -2.48%   x 1.03 
 clone_from_small             120               137                           17   14.17%   x 0.88 
 clone_large                  14,426            14,133                      -293   -2.03%   x 1.02 
 clone_small                  168               202                           34   20.24%   x 0.83 
 grow_insert_ahash_highbits   88,879            136,133                   47,254   53.17%   x 0.65 
 grow_insert_ahash_random     87,318            135,240                   47,922   54.88%   x 0.65 
 grow_insert_ahash_serial     86,675            134,558                   47,883   55.24%   x 0.64 
 grow_insert_std_highbits     112,599           169,150                   56,551   50.22%   x 0.67 
 grow_insert_std_random       111,288           160,915                   49,627   44.59%   x 0.69 
 grow_insert_std_serial       110,858           158,551                   47,693   43.02%   x 0.70 
 insert_ahash_highbits        74,718            101,043                   26,325   35.23%   x 0.74 
 insert_ahash_random          66,334            96,060                    29,726   44.81%   x 0.69 
 insert_ahash_serial          66,182            96,693                    30,511   46.10%   x 0.68 
 insert_erase_ahash_highbits  88,475            147,912                   59,437   67.18%   x 0.60 
 insert_erase_ahash_random    85,865            147,487                   61,622   71.77%   x 0.58 
 insert_erase_ahash_serial    84,714            141,591                   56,877   67.14%   x 0.60 
 insert_erase_std_highbits    148,085           197,079                   48,994   33.09%   x 0.75 
 insert_erase_std_random      141,257           198,782                   57,525   40.72%   x 0.71 
 insert_erase_std_serial      139,078           201,886                   62,808   45.16%   x 0.69 
 insert_std_highbits          86,464            122,619                   36,155   41.82%   x 0.71 
 insert_std_random            81,859            120,006                   38,147   46.60%   x 0.68 
 insert_std_serial            80,537            119,304                   38,767   48.14%   x 0.68 
 iter_ahash_highbits          6,262             5,273                       -989  -15.79%   x 1.19 
 iter_ahash_random            6,215             5,275                       -940  -15.12%   x 1.18 
 iter_ahash_serial            5,992             5,267                       -725  -12.10%   x 1.14 
 iter_std_highbits            6,088             5,155                       -933  -15.33%   x 1.18 
 iter_std_random              6,114             5,186                       -928  -15.18%   x 1.18 
 iter_std_serial              6,099             5,190                       -909  -14.90%   x 1.18 
 lookup_ahash_highbits        18,621            27,667                     9,046   48.58%   x 0.67 
 lookup_ahash_random          17,852            27,243                     9,391   52.60%   x 0.66 
 lookup_ahash_serial          17,050            25,842                     8,792   51.57%   x 0.66 
 lookup_fail_ahash_highbits   18,427            28,029                     9,602   52.11%   x 0.66 
 lookup_fail_ahash_random     17,151            26,913                     9,762   56.92%   x 0.64 
 lookup_fail_ahash_serial     15,873            27,003                    11,130   70.12%   x 0.59 
 lookup_fail_std_highbits     41,149            62,951                    21,802   52.98%   x 0.65 
 lookup_fail_std_random       40,532            62,664                    22,132   54.60%   x 0.65 
 lookup_fail_std_serial       40,744            62,555                    21,811   53.53%   x 0.65 
 lookup_std_highbits          40,983            54,984                    14,001   34.16%   x 0.75 
 lookup_std_random            40,889            54,041                    13,152   32.17%   x 0.76 
 lookup_std_serial            41,146            53,634                    12,488   30.35%   x 0.77 
 rehash_in_place              1,031,033         1,919,128                888,095   86.14%   x 0.54

It was mostly just easiest to collect numbers with Wasmtime, I don't know how to easily collect numbers with v8 and/or Spidermonkey which would probably have different results on AArch64

CryZe · 2021-05-27T18:44:56Z

@alexcrichton Thanks for benchmarking this. I also opened #271 which switches the groups not to v128, but from u32 to u64, so that possibly should itself be a decently large win on both architectures. So if we get some numbers for that PR maybe we can close this one in favor of that. (Although that one is orthogonal to this one anyway)

Amanieu · 2021-05-27T19:16:57Z

This is unfortunately a situation where the underlying architecture leaks out: we can either optimize for x86 or arm, but not both. The core of the issue is that the x86 pmovmskb is a single instruction with 1 cycle of latency while the aarch64 code sequence for bitmask is a single dependency chain with a total latency of 24 cycles (calculated based on Cortex-A72 instruction timings).

A few years ago I attempted to implement NEON support for hashbrown (https://github.com/Amanieu/hashbrown/blob/neon/src/raw/neon.rs) but the results were about the same as the generic integer version so I dropped it. I wonder how an equivalent algorithm would perform in wasm for x86 and aarch64.

In the end I'm not sure what the best approach to use here is. We can't optimize for a specific target architecture since wasm specifically hides this from us.

alexcrichton · 2021-05-27T19:52:52Z

If interested, the best thing to do here would probably be to replicate these results on a likely-more-battle-tested-and-production-ready-simd-engine, aka v8 or Spidermonkey. After that if it's still an issue opening an issue on https://github.com/WebAssembly/simd would be the way to go most likely.

bors · 2023-05-18T08:19:53Z

☔ The latest upstream changes (presumably #430) made this pull request unmergeable. Please resolve the merge conflicts.

CryZe force-pushed the wasm-simd branch from 3b2bc3a to 96d625f Compare May 26, 2021 15:44

CryZe force-pushed the wasm-simd branch from 96d625f to 2004def Compare May 26, 2021 15:46

michaelwoerister mentioned this pull request Oct 11, 2021

Provide a SIMD implementation of swisstable_group_query suitable for ARM rust-lang/odht#17

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add WebAssembly SIMD Support #269

Add WebAssembly SIMD Support #269

Uh oh!

CryZe commented May 26, 2021 •

edited

Loading

Uh oh!

Amanieu commented May 27, 2021

Uh oh!

CryZe commented May 27, 2021

Uh oh!

alexcrichton commented May 27, 2021

Uh oh!

CryZe commented May 27, 2021

Uh oh!

Amanieu commented May 27, 2021

Uh oh!

alexcrichton commented May 27, 2021

Uh oh!

bors commented May 18, 2023

Uh oh!

Uh oh!

Add WebAssembly SIMD Support #269

Are you sure you want to change the base?

Add WebAssembly SIMD Support #269

Uh oh!

Conversation

CryZe commented May 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Amanieu commented May 27, 2021

Uh oh!

CryZe commented May 27, 2021

Uh oh!

alexcrichton commented May 27, 2021

Uh oh!

CryZe commented May 27, 2021

Uh oh!

Amanieu commented May 27, 2021

Uh oh!

alexcrichton commented May 27, 2021

Uh oh!

bors commented May 18, 2023

Uh oh!

Uh oh!

CryZe commented May 26, 2021 •

edited

Loading