Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

Documenting performance tradeoffs #200

Open
zeux opened this issue Feb 27, 2020 · 6 comments
Open

Documenting performance tradeoffs #200

zeux opened this issue Feb 27, 2020 · 6 comments

Comments

@zeux
Copy link
Contributor

zeux commented Feb 27, 2020

I've taken a stab at documenting the performance tradeoffs between various instructions and collected the information in this repository:

https://github.com/zeux/wasm-simd

Obviously this is far from a normative reference, and is designed merely to serve as a easy to use guide to refer to when writing WASM SIMD programs. It focuses on comparing performance using the same code generator (v8) on two most prominent architectures (x64/arm64).

The instruction performance comparison table summarizes the codegen for various instructions, which makes it easier to highlight performance cliffs - I think most/all of them are already present on this repository:

https://github.com/zeux/wasm-simd/blob/master/Instructions.md

The shuffle listings summarize the shuffle patterns that are matched by both x64 and arm64 backends; unfortunately during this investigation I learned that some shuffle patterns that are first-class in x64 world aren't supported on arm64 (e.g. blends and half-shuffles), and 32-bit swizzles are pretty slow on arm64 (5 move instructions) which isn't great for some floating-point code.

https://github.com/zeux/wasm-simd/blob/master/Shuffles.md

(there's also surprisingly few shuffle patterns that have predictably high performance across x64/arm64)

I'm just filing this issue for visibility; please let me know if I can improve these in any way and if these seem helpful (or harmful).

@nfrechette
Copy link

Great stuff! I will definitely keep it open in a tab as a reference when porting my code.

It is worth noting that on ARM v7 and ARM 64, it is still common for scalar code to remain faster than SIMD even when the code is hand optimized for optimal code generation. If WASM code is written with SIMD intrinsics to primarily target x64, it will likely require a different version altogether for ARM even if intrinsics map 1:1 to instructions on both platforms.

For example, quaternion multiplication (both with quat/quat and quat/vector3) is faster with SIMD on x64 but slower than scalar on ARM.

The fact that native instructions are not available on some platforms (as you have very well documented) coupled with the fact that optimal code often requires very different code altogether highlights the importance to be able to write WASM code per platform and have the runtime/webserver strip implementations it does not need in a future version of WASM SIMD.

@dtig
Copy link
Member

dtig commented Feb 28, 2020

Thanks @zeux for documenting this, as it's documenting the V8 implementation leaving a disclaimer in here that the support in V8 is still experimental, and may change as we shift focus to optimizing once we have full support.

Great stuff! I will definitely keep it open in a tab as a reference when porting my code.

It is worth noting that on ARM v7 and ARM 64, it is still common for scalar code to remain faster than SIMD even when the code is hand optimized for optimal code generation. If WASM code is

I would be interested in more specifics here, do you mean ARM64 with/without Neon?

written with SIMD intrinsics to primarily target x64, it will likely require a different version altogether for ARM even if intrinsics map 1:1 to instructions on both platforms.

For example, quaternion multiplication (both with quat/quat and quat/vector3) is faster with SIMD on x64 but slower than scalar on ARM.

Are there other examples where this happens? It would be an interesting data point, and speaking as an implementer, it would be useful for us to dig into this more and see if we're missing something that would make this better. Also a potentially useful thing to have in our benchmark suite to see how this performs on different hardware.

The fact that native instructions are not available on some platforms (as you have very well documented) coupled with the fact that optimal code often requires very different code altogether highlights the importance to be able to write WASM code per platform and have the runtime/webserver strip implementations it does not need in a future version of WASM SIMD.

I would disagree with your conclusion here about writing Wasm code per platform. The goal of having a portable specification here is to reduce the overhead of having platform specific implementations, and for a portable specification performance tradeoffs do unfortunately have to happen. If platform specific code is the goal, then it seems to be a case where platform specific intrinsics XMM/Neon intrinsics should be used, and not a use case for Wasm SIMD. Given that native instructions on some platforms are not available for common operations, the goal of the standards process (with application feedback being an important part of this) is to work as a feedback loop to formalize a proposal that is both useful and portable.

@zeux
Copy link
Contributor Author

zeux commented Feb 28, 2020

@dtig Absolutely, I will note this in the Instructions.md list - in a lot of cases the codegen is probably close to optimal but there are definitely some fallouts that I'd expect v8 to do a bit better at over time.

@mingqiusun
Copy link

mingqiusun commented Feb 28, 2020

@nfrechette By WASM per platform, did you mean to increase the richness of WASM SIMD instruction sets, so that different WASM codes could be generated per platform for optimal performance? I like the idea, as this would not break portability. For example, we can have bit_select and byte_select instructions implemented on all platforms with different performance characteristics, and let a compiler generate optimized WASM code per platform.

@nfrechette
Copy link

@dtig As the discussion around shuffles has shown, some mappings will have poor or expected performance on some platforms and this isn't easily avoided without more platform specific intrinsics (which may map poorly on other platforms, etc). A lot of the code I write is hand optimized with intrinsics for x64, ARMv7, and ARM64. Adding custom flavors of those for WASM wouldn't be hard as long as I can get reasonable mappings to the native instructions I already use. It becomes harder and more work to maintain but that is the price to pay for optimal performance across a wide range of platforms. Without per platform SIMD code, we have to settle to a middle ground approach where some intrinsics map poorly on some platforms and we just have to live with that. Similarly, some SIMD algorithms map poorly on some platforms and require a different implementation (e.g. quaternion multiplication in SIMD vs scalar). If we want to keep WASM SIMD a sort of middle ground without per platform stripping, that's fair. Documenting how various intrinsics map to which native instructions on various platforms becomes a must.

For a good example where different flavors makes a difference see my benchmark code here with my math library: Realtime Math. Multiplying a quaternion with a vector3 shows a very significant difference.

On my Haswell laptop and my Ryzen 2990x desktop, the SSE2 version is consistently much faster than the reference or scalar implementations.

However, on ARM the picture isn’t as clear.

On my iPad Pro with ARM64:
Ref takes 47.5 ns
Scalar takes 36.0 ns
Neon64 takes 32.7 ns
Neon takes 26.6 ns

On my Pixel 3 with ARM64:
Ref takes 59.1 ns
Scalar takes 44.7 ns
Neon64 takes 51.0 ns
Neon takes 66.0 ns

On my Pixel 3 with ARMv7:
Ref takes 68.8 ns
Scalar takes 56.7 ns
Neon64 takes 84.5 ns
Neon takes 69.3 ns

On my Pixel 3, the scalar implementation is consistently and clearly much faster than the SIMD variants but that isn’t true on iPad. The best middle ground based on the above numbers is the scalar variant.

The situation is similarly complicated for the quaternion/quaternion multiplication. On my Haswell laptop it seems that using XOR is slower than FMUL to flip the component signs (probably a penalty to switch domain) but that isn’t (and shouldn’t) be the case on modern processors such as my Ryzen. On ARM64 we can merge the multiplication with FMA and XOR becomes strictly slower and on my Pixel 3 using more shuffles instead of FMA is even faster. On ARMv7 FMA remains faster than XOR as well. As such, with x64 we ideally would like to use XOR while on ARM FMA makes more sense. You could suggest to use FMA as well with x64 when available but it is slower than XOR on my Haswell and Ryzen CPUs (and most mainstream CPUs out there that support FMA).

These are just micro benchmarks though and I haven’t managed to test them yet into my compression library side by side to compare for sure in more realistic conditions.

I’ll be happy to work with whatever WASM SIMD ends up having but the ability to tailor code per platform (or even per CPU) would be very nice sometime down the road.

@tlively
Copy link
Member

tlively commented Jun 4, 2020

FWIW @juj has been documenting performance considerations for his emulated SSE intrinsics on top of Wasm SIMD at https://github.com/emscripten-core/emscripten/blob/master/site/source/docs/porting/simd.rst. This documentation seems very similar to the documentation we would have for the Wasm SIMD intrinsics themselves.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants