-
Notifications
You must be signed in to change notification settings - Fork 43
Conversation
I would be interested to hear others' thoughts, but IMO this seems to fall outside the scope of this proposal. It's definitely something we should consider, but I think we should continue the discussion of this as a separate proposal at WebAssembly/design#1364. |
I agree, this would make much more sense "upstream". |
IMO it makes most sense to add Prefetch instructions as part of the SIMD specification, for two reasons:
|
I'm sympathetic to those arguments, but prefetch opens up a can of semantic worms. Since it's semantically a nop, it's unclear how the specification can say anything about how it should be ordered with respect to other instructions. We would have to involve the wider CG in a discussion about the proper way to specify that, and I expect that those discussions would delay shipping this proposal by multiple months. It would be much better to split these instructions and all the questions they raise into a follow-up proposal so we can get this one out the door. |
This request is usually phrased in terms of a prefetch instruction that the producer knows where to place (through magic, more or less). This makes sense both because the CPUs have prefetch instructions and because the front end can help out with where to place the prefetch. But would an alternative be a "hot load" instruction, that bundles the prefetch with the load (as a prefix now), that the JIT can then try to place meaningfully? |
Wouldn't it be a 'cold load' since the data isn't in the cache for that load? Generally speaking, I think it would be best to leave the instruction where it is in the stream. The compiler shouldn't attempt to move it much. Prefetching is often added once many other optimizations have taken place first and by that point, you'll have to carefully measure the benefits (if any) of adding prefetching and where. With out-of-order processors, prefetching shouldn't be casually added just anywhere (unlike with in-order processors where it was often copy/pasted everywhere). Their usage definitely isn't as common but where it is used, it can provide significant benefits. This circles back somewhat to the general pain of writing SIMD in a language that can compile to multiple ASM SIMD flavors. Where a prefetch is best placed with ARM might not be the same with x64 (although it might not make a huge difference then). I wonder what ISPC is doing with this since it somewhat faces a similar problem. Prefetching getting added with previous SIMD instruction sets is a coincidence IMO. New SSE/NEON standards don't come out very often. It is simply one more way to help better utilize modern processors. Scalar code can benefit from prefetching just as much. It really depends on what the code is doing. |
This is potato-potahto but the instruction is hot (ie should not be expensive), hence the nomenclature. But we move on...
This will tend to inhibit optimizations in a JIT and "much" is not very precise. Currently we (Firefox) pretty much have instructions that are fully moveable and instructions that don't move at all. Can we come up with rules that are better than "treat the prefetch as a reordering barrier"?
And different JITs for the same architecture may also affect the outcome significantly. You test your code on JIT A and place your prefetches carefully, then the web exposes your code to JIT B which has some hoisting or checking optimization that shrinks the path from prefetch to load and makes the prefetch placement suboptimal / wrong. |
It's not even JIT A vs JIT B, it could just as well be JIT A.1 vs JIT A.2. I'm not convinced that we want prefetch instructions to be a barrier any more than it's a barrier for the CPU itself, which as far as I know it's not. I could be convince otherwise but I'd like to see hard data before committing to making the instructions barriers. Beyond that, It's perfectly valid to insert a bunch of random unobservable loads/stores between the prefetch and the subsequent load, which from a performance PoV is likely just as bad as moving the prefetch. |
I would like to see prefetch in the SIMD specification. I've used it often, but only in SIMD. |
@kmiller68, the use of "barrier" was just meant to illustrate that if there's a prefetch instruction and the desire is for it to not move "too much" in the code, then a reasonably precise meaning for that would have to be found, or it's pointless to have it. The most precise I have come up with so far is that it acts as a reordering barrier in the semantics - bytecodes preceding it have to be executed before it, bytecodes succeeding it have to be executed after it, much in the way of a store to memory - not that it is expressed as an actual reordering barrier in the hardware. Improvements on that are obviously welcome, but it has to be something the compiler can relate to. |
I would like to highlight that hardware prefetchers are better and better, to the point that Agner Fog says the following about Zen prefetcher (architecture.pdf, 20.16):
I am really curious to see if there is any benefits of prefetching instructions on any recent hardware, and especially for 128-bit SIMD code (the target for current SIMD WASM). |
I don't think prefetches should be like barriers, some re-ordering is fine providing it doesn't end up 100 instructions away or the other side of a an important loop. Key is to have the behavior be predictable. Even if we lose a few cycles doing a sub-optimal prefetch, we can still save hundreds of cycles from the cache miss. @lemaitre In my code, software prefetching gave a 10-20% boost (even though all my reads are linear and contiguous and hardware prefetching friendly) on my Ryzen 2950X and my Pixel 3 phone. Agner is right in that with the advent of hardware prefetching and new chips supporting more parallel streams, their usage isn't as important anymore and can hurt performance in some scenarios. But they remain an important tool. The hardware prefetcher doesn't kick in until you've done 2 cache misses in a pattern it recognizes (and only if it can accommodate the stream at the L1/L2/L3 levels). Depending on your code, that might be significant. In my animation decompression, some streams are very densely packed and might fit in 2-3 cache lines meaning the hardware prefetcher often doesn't have time to kick in. I also know the memory layout and where TLB misses are likely to happen. Software prefetching allows me to hide that latency as well. Code that does random access but performs at least 100+ instructions with each access can benefit as well. For example, querying an R-tree or B-tree with multiple children per node. You can easily prefetch the next node to process while processing the current one. Hardware prefetching won't help you here and each node is susceptible to TLB misses as well. In code like this, software prefetching will always help and the hardware is unlikely to ever be able to help. |
That's a good high-level goal, but we still need to formalize it in such a way that it can be implemented and guaranteed by compilers (both in the engine and in the producing toolchain). That's what @lars-t-hansen was talking about in his previous comment. |
270202c
to
c24b3b8
Compare
IMO prefetch instructions should NOT be considered barriers. Hardware prefetch instructions are not barriers at the architecture level, i.e. the CPU can execute them in different order than they appear in native instruction stream. Thus, it would be strange if the WebAssembly engines try to enforce stronger memory ordering than hardware, and practically impossible on out-of-order processors. I suggest we experiment with implementing Prefetch in V8 and/or SpiderMonkey and evaluate its impact on real-world tasks. |
@Maratyszcza, that issue, which was an issue of wording only, has already been adressed (see my latest comment above) and the substantive issue here is not how to "experiment" with a prefetch but how to express its meaning to the compiler. |
In my experience, on Intel ISA prefetch is most useful outside of tight SIMD kernels - hardware prefetcher works well when you're processing streams of data, and it doesn't work well when the access pattern is unpredictable. Historically for in-order architectures and/or ones without hardware prefetchers it could make sense to use it for stream processing but these times mostly passed. In fact on Intel processors it's almost easier to introduce perf. regressions by adding prefetch for stream processing vs improving performance... This feels like it's outside of the scope. |
I had some code that did a prefetch but it was at the end of loop. Its fine in the case where memory is the bottleneck, but if you run the code on data that is already in L1 cache, the prefetches can slow down the code. So instead of 2 prfm just before the branch, i moved them just after some math. e.g.
Running 120 benchmarks at 128x72 resolution, 10000 times each the differences are subtle. 3 runs of each benchmark prfm middle (code above) prfm end |
Increasing the resolution to 1280x720 and running 12 tests 1000 each: no prfm prfm middle prfm end Prefetching improves this example - scaling with bilinear filter to 1280x720, by 10% |
Will |
@ngzhian Prefetch of unallocated memory does not cause SEGFAULT, so no bounds check needed. |
Prototyped on arm64 https://crrev.com/c/2543167 |
As proposed in WebAssembly/simd#352 and using the opcodes used in the V8 prototype: https://chromium-review.googlesource.com/c/v8/v8/+/2543167. These instructions are only usable via intrinsics and clang builtins to make them opt-in while they are being benchmarked. Differential Revision: https://reviews.llvm.org/D93883
As proposed in WebAssembly/simd#352, using the opcodes used in the LLVM and V8 implementations.
As proposed in WebAssembly/simd#352, using the opcodes used in the LLVM and V8 implementations.
I evaluated the performance impact on end-to-end sparse inference in convolution neural networks by modifying the SpMM microkernels in XNNPACK library. I use three sparse neural network models:
Performance results on ARM64 are presented below:
|
Performance results on x86-64 systems:
|
IMO, consistent performance losses on one architecture is a bit of a disqualifying factor. However, the issues with prefetch run deeper than this - it is not really portable in terms of performance. The effects usually are specific to particular architecture, not broad x86/Arm, but model/family/etc. There might be a workload where you would get negative speedups on Arm, and positive on x86, or negative on one x86 chip and positive on the other. As an a example, here is a link to a paper describing challenges with prefetch. I am provisionally against this PR. |
I agree, portability is rather disappointing. Better performance on ARM vs x86 might be due to prefetch code being ported from ARM implementation. Likewise, provisionally against this proposal. |
FWIW JPEG XL uses prefetch for ANS alias tables and filtering, and did see some modest gains, IIRC also on x86. On balance though, I agree prefetch is problematic due to lack of performance portability, and we've also seen some perf penalty, so would be OK with leaving it out. |
Adding a preliminary vote for the inclusion of prefetch operations to the SIMD proposal below. Please vote with - 👍 For including prefetch operations |
The community group decided against including these instructions in #429 due to performance portability concerns. |
Removing prefetch operations as per the vote in the github issue: WebAssembly/simd#352 Bug:v8:11168 Change-Id: Ia72684e68ce886f8f26a7d3b5bea601be416dfab Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2771758 Reviewed-by: Jakob Kummerow <jkummerow@chromium.org> Reviewed-by: Maya Lekova <mslekova@chromium.org> Reviewed-by: Zhi An Ng <zhin@chromium.org> Commit-Queue: Deepti Gandluri <gdeepti@chromium.org> Cr-Commit-Position: refs/heads/master@{#73578}
Introduction
Most modern instruction sets include prefetch instructions. These instructions have no explicit effects, but provide a hint to the processor to pre-load soon-to-be-used data from memory into cache. As these instructions have only side-effects, they don't directly affect SIMD register. However, their usage is closely associated with SIMD processing (e.g. on x86 they were added in SSE, and on ARM -- in ARMv7, together with NEON), thus I suggest they should be part of the specification.
Applications
Mapping to Common Instruction Sets
This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.
x86/x86-64 processors with SSE instruction set
prefetch.t(mem)
is lowered toPREFETCHT0 [mem]
prefetch.nt(mem)
is lowered toPREFETCHNTA [mem]
ARM64 processors
prefetch.t(mem)
is lowered toPRFM PLDL1KEEP, [Xmem]
prefetch.nt(mem)
is lowered toPRFM PLDL1STRM, [Xmem]
ARMv7 processors
prefetch.t(mem)
is lowered toPLD [Rmem]
prefetch.nt(mem)
is lowered toPLD [Rmem]