Prefetch instructions #352

Maratyszcza · 2020-09-19T01:48:00Z

Introduction

Most modern instruction sets include prefetch instructions. These instructions have no explicit effects, but provide a hint to the processor to pre-load soon-to-be-used data from memory into cache. As these instructions have only side-effects, they don't directly affect SIMD register. However, their usage is closely associated with SIMD processing (e.g. on x86 they were added in SSE, and on ARM -- in ARMv7, together with NEON), thus I suggest they should be part of the specification.

Applications

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with SSE instruction set

prefetch.t
- prefetch.t(mem) is lowered to PREFETCHT0 [mem]
prefetch.nt
- prefetch.nt(mem) is lowered to PREFETCHNTA [mem]

ARM64 processors

prefetch.t
- prefetch.t(mem) is lowered to PRFM PLDL1KEEP, [Xmem]
prefetch.nt
- prefetch.nt(mem) is lowered to PRFM PLDL1STRM, [Xmem]

ARMv7 processors

prefetch.t
- prefetch.t(mem) is lowered to PLD [Rmem]
prefetch.nt
- prefetch.nt(mem) is lowered to PLD [Rmem]

tlively · 2020-09-19T02:04:53Z

I would be interested to hear others' thoughts, but IMO this seems to fall outside the scope of this proposal. It's definitely something we should consider, but I think we should continue the discussion of this as a separate proposal at WebAssembly/design#1364.

penzn · 2020-09-19T05:27:55Z

I agree, this would make much more sense "upstream".

Maratyszcza · 2020-09-19T09:47:22Z

IMO it makes most sense to add Prefetch instructions as part of the SIMD specification, for two reasons:

Like SIMD, Prefetch is a performance feature. Prefetch is typically used in the same codebases that use SIMD instructions (note that examples in the Applications section of the proposal interleave SIMD operations and Prefetch).
Prefetch solves the problem inherent to SIMD computations: the problem of memory subsystem not catching up to data processing. Scalar computations almost never deal with this issue, because compute operations themselves become the bottleneck, and the capabilities of the out-of-order engines in the processors become sufficient to keep the scalar units fed with the smaller amount of in-memory data. Thus, there is little incentive to introduce Prefetch instructions outside of SIMD, as evidenced by both x86 and ARM instruction sets introducing Prefetch and SIMD instructions simultaneously.

tlively · 2020-09-20T00:37:51Z

I'm sympathetic to those arguments, but prefetch opens up a can of semantic worms. Since it's semantically a nop, it's unclear how the specification can say anything about how it should be ordered with respect to other instructions. We would have to involve the wider CG in a discussion about the proper way to specify that, and I expect that those discussions would delay shipping this proposal by multiple months. It would be much better to split these instructions and all the questions they raise into a follow-up proposal so we can get this one out the door.

lars-t-hansen · 2020-09-21T09:15:05Z

This request is usually phrased in terms of a prefetch instruction that the producer knows where to place (through magic, more or less). This makes sense both because the CPUs have prefetch instructions and because the front end can help out with where to place the prefetch. But would an alternative be a "hot load" instruction, that bundles the prefetch with the load (as a prefix now), that the JIT can then try to place meaningfully?

nfrechette · 2020-09-21T13:53:50Z

Wouldn't it be a 'cold load' since the data isn't in the cache for that load?
In practice, it isn't always possible to keep the instruction close to the load and a compiler might not be able to figure out where it should go. For example, my animation decompression works by processing one cache line worth of data at a time. I prefetch the next cache line ahead but many other things happen before we finally start processing it to make sure the latency is fully hidden in case of a TLB miss. Other times, you want to bundle several prefetches/loads that miss together to avoid bubbles in the execution pipelines and allow the processor to prepare as much work as possible in the cache miss's shadow. It is not uncommon for a prefetch to live in one function while the load ends up in a different function.

Generally speaking, I think it would be best to leave the instruction where it is in the stream. The compiler shouldn't attempt to move it much. Prefetching is often added once many other optimizations have taken place first and by that point, you'll have to carefully measure the benefits (if any) of adding prefetching and where. With out-of-order processors, prefetching shouldn't be casually added just anywhere (unlike with in-order processors where it was often copy/pasted everywhere). Their usage definitely isn't as common but where it is used, it can provide significant benefits.

This circles back somewhat to the general pain of writing SIMD in a language that can compile to multiple ASM SIMD flavors. Where a prefetch is best placed with ARM might not be the same with x64 (although it might not make a huge difference then). I wonder what ISPC is doing with this since it somewhat faces a similar problem.

Prefetching getting added with previous SIMD instruction sets is a coincidence IMO. New SSE/NEON standards don't come out very often. It is simply one more way to help better utilize modern processors. Scalar code can benefit from prefetching just as much. It really depends on what the code is doing.

lars-t-hansen · 2020-09-21T17:15:44Z

Wouldn't it be a 'cold load' since the data isn't in the cache for that load?

This is potato-potahto but the instruction is hot (ie should not be expensive), hence the nomenclature. But we move on...

Generally speaking, I think it would be best to leave the instruction where it is in the stream. The compiler shouldn't attempt to move it much.

This will tend to inhibit optimizations in a JIT and "much" is not very precise. Currently we (Firefox) pretty much have instructions that are fully moveable and instructions that don't move at all.

Can we come up with rules that are better than "treat the prefetch as a reordering barrier"?

This circles back somewhat to the general pain of writing SIMD in a language that can compile to multiple ASM SIMD flavors. Where a prefetch is best placed with ARM might not be the same with x64 (although it might not make a huge difference then). I wonder what ISPC is doing with this since it somewhat faces a similar problem.

And different JITs for the same architecture may also affect the outcome significantly. You test your code on JIT A and place your prefetches carefully, then the web exposes your code to JIT B which has some hoisting or checking optimization that shrinks the path from prefetch to load and makes the prefetch placement suboptimal / wrong.

kmiller68 · 2020-09-21T18:08:29Z

And different JITs for the same architecture may also affect the outcome significantly. You test your code on JIT A and place your prefetches carefully, then the web exposes your code to JIT B which has some hoisting or checking optimization that shrinks the path from prefetch to load and makes the prefetch placement suboptimal / wrong.

It's not even JIT A vs JIT B, it could just as well be JIT A.1 vs JIT A.2. I'm not convinced that we want prefetch instructions to be a barrier any more than it's a barrier for the CPU itself, which as far as I know it's not. I could be convince otherwise but I'd like to see hard data before committing to making the instructions barriers. Beyond that, It's perfectly valid to insert a bunch of random unobservable loads/stores between the prefetch and the subsequent load, which from a performance PoV is likely just as bad as moving the prefetch.

fbarchard · 2020-09-22T06:03:48Z

I would like to see prefetch in the SIMD specification. I've used it often, but only in SIMD.
The placement isnt critical, but it can be co-issued for free after a math instruction, with the same characteristics as a load.

lars-t-hansen · 2020-09-22T07:29:28Z

@kmiller68, the use of "barrier" was just meant to illustrate that if there's a prefetch instruction and the desire is for it to not move "too much" in the code, then a reasonably precise meaning for that would have to be found, or it's pointless to have it. The most precise I have come up with so far is that it acts as a reordering barrier in the semantics - bytecodes preceding it have to be executed before it, bytecodes succeeding it have to be executed after it, much in the way of a store to memory - not that it is expressed as an actual reordering barrier in the hardware. Improvements on that are obviously welcome, but it has to be something the compiler can relate to.

lemaitre · 2020-09-22T11:49:11Z

I would like to highlight that hardware prefetchers are better and better, to the point that Agner Fog says the following about Zen prefetcher (architecture.pdf, 20.16):

Automatic hardware prefetching is more efficient than explicit software prefetching in most cases.

I am really curious to see if there is any benefits of prefetching instructions on any recent hardware, and especially for 128-bit SIMD code (the target for current SIMD WASM).

nfrechette · 2020-09-22T12:46:02Z

I don't think prefetches should be like barriers, some re-ordering is fine providing it doesn't end up 100 instructions away or the other side of a an important loop. Key is to have the behavior be predictable. Even if we lose a few cycles doing a sub-optimal prefetch, we can still save hundreds of cycles from the cache miss.

@lemaitre In my code, software prefetching gave a 10-20% boost (even though all my reads are linear and contiguous and hardware prefetching friendly) on my Ryzen 2950X and my Pixel 3 phone. Agner is right in that with the advent of hardware prefetching and new chips supporting more parallel streams, their usage isn't as important anymore and can hurt performance in some scenarios. But they remain an important tool. The hardware prefetcher doesn't kick in until you've done 2 cache misses in a pattern it recognizes (and only if it can accommodate the stream at the L1/L2/L3 levels). Depending on your code, that might be significant. In my animation decompression, some streams are very densely packed and might fit in 2-3 cache lines meaning the hardware prefetcher often doesn't have time to kick in. I also know the memory layout and where TLB misses are likely to happen. Software prefetching allows me to hide that latency as well. Code that does random access but performs at least 100+ instructions with each access can benefit as well. For example, querying an R-tree or B-tree with multiple children per node. You can easily prefetch the next node to process while processing the current one. Hardware prefetching won't help you here and each node is susceptible to TLB misses as well. In code like this, software prefetching will always help and the hardware is unlikely to ever be able to help.

tlively · 2020-09-22T18:07:10Z

I don't think prefetches should be like barriers, some re-ordering is fine providing it doesn't end up 100 instructions away or the other side of a an important loop. Key is to have the behavior be predictable.

That's a good high-level goal, but we still need to formalize it in such a way that it can be implemented and guaranteed by compilers (both in the engine and in the producing toolchain). That's what @lars-t-hansen was talking about in his previous comment.

Maratyszcza · 2020-09-29T18:52:12Z

IMO prefetch instructions should NOT be considered barriers. Hardware prefetch instructions are not barriers at the architecture level, i.e. the CPU can execute them in different order than they appear in native instruction stream. Thus, it would be strange if the WebAssembly engines try to enforce stronger memory ordering than hardware, and practically impossible on out-of-order processors.

I suggest we experiment with implementing Prefetch in V8 and/or SpiderMonkey and evaluate its impact on real-world tasks.

lars-t-hansen · 2020-09-30T06:10:43Z

@Maratyszcza, that issue, which was an issue of wording only, has already been adressed (see my latest comment above) and the substantive issue here is not how to "experiment" with a prefetch but how to express its meaning to the compiler.

zeux · 2020-10-01T05:35:46Z

In my experience, on Intel ISA prefetch is most useful outside of tight SIMD kernels - hardware prefetcher works well when you're processing streams of data, and it doesn't work well when the access pattern is unpredictable. Historically for in-order architectures and/or ones without hardware prefetchers it could make sense to use it for stream processing but these times mostly passed. In fact on Intel processors it's almost easier to introduce perf. regressions by adding prefetch for stream processing vs improving performance... This feels like it's outside of the scope.

fbarchard · 2020-10-01T21:46:09Z

I had some code that did a prefetch but it was at the end of loop. Its fine in the case where memory is the bottleneck, but if you run the code on data that is already in L1 cache, the prefetches can slow down the code. So instead of 2 prfm just before the branch, i moved them just after some math. e.g.

ld1        {v0.16b, v1.16b}, [%0], #32  // load row 1 and post inc
ld1        {v2.16b, v3.16b}, [%1], #32  // load row 2 and post inc
subs       %w3, %w3, #16                // 16 processed per loop
uaddlp     v0.8h, v0.16b                // row 1 add adjacent
prfm       pldl1keep, [%0, 448]         // prefetch 7 lines ahead
uaddlp     v1.8h, v1.16b              
prfm       pldl1keep, [%1, 448]       
uadalp     v0.8h, v2.16b                // += row 2 add adjacent
uadalp     v1.8h, v3.16b              
rshrn      v0.8b, v0.8h, #2             // round and pack
rshrn2     v0.16b, v1.8h, #2          
st1        {v0.16b}, [%2], #16        
b.gt       1b

Running 120 benchmarks at 128x72 resolution, 10000 times each the differences are subtle. 3 runs of each benchmark
no prfm
(82448 ms total)
(82524 ms total)
(82123 ms total)

prfm middle (code above)
(82320 ms total)
(81095 ms total)
(81827 ms total)

prfm end
(84484 ms total)
(82876 ms total)
(83001 ms total)

fbarchard · 2020-10-01T22:16:56Z

Increasing the resolution to 1280x720 and running 12 tests 1000 each:

no prfm
(90115 ms total)
(90661 ms total)
(90812 ms total)

prfm middle
(81816 ms total)
(82394 ms total)
(82509 ms total)

prfm end
(82424 ms total)
(82733 ms total)
(83089 ms total)

Prefetching improves this example - scaling with bilinear filter to 1280x720, by 10%
When prefetch is effective, the location of the prefetch instruction doesnt matter much.
When prefetch is ineffective (data is in L1 cache), scheduling the instruction to be free makes a small difference. (1.4%)

ngzhian · 2020-10-16T20:28:22Z

Will PREFETCHT0 [mem] be considered a memory access? We need to do bounds check for prefetch as well, won't we?

Maratyszcza · 2020-10-16T20:34:15Z

@ngzhian Prefetch of unallocated memory does not cause SEGFAULT, so no bounds check needed.

ngzhian · 2020-11-24T08:49:55Z

Prototyped on arm64 https://crrev.com/c/2543167

As proposed in WebAssembly/simd#352 and using the opcodes used in the V8 prototype: https://chromium-review.googlesource.com/c/v8/v8/+/2543167. These instructions are only usable via intrinsics and clang builtins to make them opt-in while they are being benchmarked. Differential Revision: https://reviews.llvm.org/D93883

As proposed in WebAssembly/simd#352, using the opcodes used in the LLVM and V8 implementations.

Maratyszcza · 2021-01-20T21:16:04Z

I evaluated the performance impact on end-to-end sparse inference in convolution neural networks by modifying the SpMM microkernels in XNNPACK library. I use three sparse neural network models:

Publicly released Sparse Cache Aware MobileNetV2 1.0 model
Hand Tracking model similar to the one demoed on Chrome Dev Summit
Segmentation model similar to the one used in Google Meet

Performance results on ARM64 are presented below:

Processor (Device)	Speedup on MobileNet v2	Speedup on Hand Tracking	Speedup on Segmentation
Qualcomm Snapdragon 670 (Pixel 3a)	2%	5%	-1%
Samsung Exynos 8895 (Galaxy S8)	3%	8%	11%

Maratyszcza · 2021-01-22T06:09:02Z

Performance results on x86-64 systems:

Processor	Speedup on MobileNet v2	Speedup on Hand Tracking	Speedup on Segmentation
AMD PRO A10-8700B	-5%	-6%	-3%
AMD A4-7210	-3%	-6%	-3%
Intel Xeon W-2135	-2%	-2%	-1%
Intel Celeron N3060	-6%	-8%	-5%

penzn · 2021-01-22T18:21:09Z

IMO, consistent performance losses on one architecture is a bit of a disqualifying factor. However, the issues with prefetch run deeper than this - it is not really portable in terms of performance. The effects usually are specific to particular architecture, not broad x86/Arm, but model/family/etc. There might be a workload where you would get negative speedups on Arm, and positive on x86, or negative on one x86 chip and positive on the other.

As an a example, here is a link to a paper describing challenges with prefetch.

I am provisionally against this PR.

Maratyszcza · 2021-01-22T18:29:21Z

However, the issues with prefetch run deeper than this - it is not really portable in terms of performance.

I agree, portability is rather disappointing. Better performance on ARM vs x86 might be due to prefetch code being ported from ARM implementation.

Likewise, provisionally against this proposal.

jan-wassenberg · 2021-01-25T07:56:51Z

FWIW JPEG XL uses prefetch for ANS alias tables and filtering, and did see some modest gains, IIRC also on x86.

On balance though, I agree prefetch is problematic due to lack of performance portability, and we've also seen some perf penalty, so would be OK with leaving it out.

dtig · 2021-01-25T19:33:21Z

Adding a preliminary vote for the inclusion of prefetch operations to the SIMD proposal below. Please vote with -

👍 For including prefetch operations
👎 Against including prefetch operations

Maratyszcza · 2021-02-02T20:24:42Z

The community group decided against including these instructions in #429 due to performance portability concerns.

Removing prefetch operations as per the vote in the github issue: WebAssembly/simd#352 Bug:v8:11168 Change-Id: Ia72684e68ce886f8f26a7d3b5bea601be416dfab Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2771758 Reviewed-by: Jakob Kummerow <jkummerow@chromium.org> Reviewed-by: Maya Lekova <mslekova@chromium.org> Reviewed-by: Zhi An Ng <zhin@chromium.org> Commit-Queue: Deepti Gandluri <gdeepti@chromium.org> Cr-Commit-Position: refs/heads/master@{#73578}

Maratyszcza mentioned this pull request Sep 21, 2020

Finalizing the instruction set #343

Closed

Maratyszcza force-pushed the prefetch branch from 270202c to c24b3b8 Compare September 29, 2020 18:42

tlively added a commit to tlively/binaryen that referenced this pull request Jan 6, 2021

Prototype prefetch instructions

fd2a006

As proposed in WebAssembly/simd#352, using the opcodes used in the LLVM and V8 implementations.

tlively mentioned this pull request Jan 6, 2021

Prototype prefetch instructions WebAssembly/binaryen#3467

Merged

tlively added a commit to WebAssembly/binaryen that referenced this pull request Jan 6, 2021

Prototype prefetch instructions (#3467)

3d41465

As proposed in WebAssembly/simd#352, using the opcodes used in the LLVM and V8 implementations.

tlively mentioned this pull request Jan 8, 2021

Agenda for sync meeting 1/8/2021 #410

Closed

Prefetch instructions

6344234

Maratyszcza force-pushed the prefetch branch from c24b3b8 to 6344234 Compare January 20, 2021 21:17

Maratyszcza mentioned this pull request Jan 22, 2021

Agenda for sync meeting 1/22/21 #419

Closed

tlively mentioned this pull request Jan 23, 2021

Agenda for sync meeting 1/29/21 #429

Closed

tlively added the post SIMD MVP label Feb 2, 2021

Maratyszcza closed this Feb 4, 2021

Prefetch instructions #352

Prefetch instructions #352

Uh oh!

Conversation

Maratyszcza commented Sep 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Introduction

Applications

Mapping to Common Instruction Sets

x86/x86-64 processors with SSE instruction set

ARM64 processors

ARMv7 processors

Uh oh!

tlively commented Sep 19, 2020

Uh oh!

penzn commented Sep 19, 2020

Uh oh!

Maratyszcza commented Sep 19, 2020

Uh oh!

tlively commented Sep 20, 2020

Uh oh!

lars-t-hansen commented Sep 21, 2020

Uh oh!

nfrechette commented Sep 21, 2020

Uh oh!

lars-t-hansen commented Sep 21, 2020

Uh oh!

kmiller68 commented Sep 21, 2020

Uh oh!

fbarchard commented Sep 22, 2020

Uh oh!

lars-t-hansen commented Sep 22, 2020

Uh oh!

lemaitre commented Sep 22, 2020

Uh oh!

nfrechette commented Sep 22, 2020

Uh oh!

tlively commented Sep 22, 2020

Uh oh!

Maratyszcza commented Sep 29, 2020

Uh oh!

lars-t-hansen commented Sep 30, 2020

Uh oh!

zeux commented Oct 1, 2020

Uh oh!

fbarchard commented Oct 1, 2020

Uh oh!

fbarchard commented Oct 1, 2020

Uh oh!

ngzhian commented Oct 16, 2020

Uh oh!

Maratyszcza commented Oct 16, 2020

Uh oh!

ngzhian commented Nov 24, 2020

Uh oh!

Maratyszcza commented Jan 20, 2021

Uh oh!

Maratyszcza commented Jan 22, 2021

Uh oh!

penzn commented Jan 22, 2021

Uh oh!

Maratyszcza commented Jan 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jan-wassenberg commented Jan 25, 2021

Uh oh!

dtig commented Jan 25, 2021

Uh oh!

Maratyszcza commented Feb 2, 2021

Uh oh!

Uh oh!

Maratyszcza commented Sep 19, 2020 •

edited

Loading

Maratyszcza commented Jan 22, 2021 •

edited

Loading