-
Notifications
You must be signed in to change notification settings - Fork 43
v128.move32_zero and v128.move64_zero instructions extensions to #237 #373
Comments
Memory operations in WebAssembly can only do either of the following:
|
Is this a labeling issue? I've updated this to use the word 'move' instead of 'load'. I didn't see any other terminology in there that matched this specific case. If there is, please let me know and I'll update promptly. |
Setting a lane is done using replace lane operations. Please try to understand the spec before proposing changes to it, and if you have questions - please ask, we'll be happy to answer. |
@penzn This isn't the same as replace_lane. Replace lane replaces one value and returns the updated vector. This initializes a vector from a scalar or another vector zeroing the upper parts. |
Looks reasonable. I suggest @omnisip open a Pull Request with the proposed changes to the specification, because is it more actionable for V8/SpiderMonkey/LLVM devs. |
Do we expect this operation to be "on critical path", and if so, what kind of gains are we going to get by going from two wasm instructions (neither of which touches memory) to one? |
@Maratyszcza Thanks for the feedback. If anyone can help with the ARMv7 with Neon intrinsics I'll generate the PR today. @penzn Your question has two parts and is very interesting. One of the things that may or may not be obvious is that the logic @Maratyszcza and I are proposing is much older technologically speaking than any shuffling, insert (replace_lane), or extract (extract_lane). Movd and Movq are the original instructions for initializing a vector on x86, and their support goes all the way back to MMX. Their use is evident in just about every application that has ever had to load a vector on this architecture. Prior to AVX and AVX2, the only ways to initialize a vector were XORs, Movs, Compares, and Loads. With respect to your second question, what's the benefit in two ops -- let's look and compare what we're replacing. If I'm thinking about this correctly and please correct me if I'm wrong, to get to your two op solution, we'd have to zero the vector and use insert. In an ideal configuation, that's 3 uops with a throughput of 2.33 on Skylake whereas movq is 1 uop and throughput of 0.33 on the same architecture going between vectors. That's a pretty significant difference in performance. It also doesn't require the use of a shuffle port. The question that hasn't been asked, but is also relevant to the topic, is whether we should consider adding the two corresponding conversion ops going from XMM/YMM to registers. I think the answer is no -- provided that the implementations optimize the extract_lane 0 cases for 32 bit and 64-bit operations to use this under the hood. |
|
… On Tue, Oct 6, 2020, 23:28 Marat Dukhan ***@***.***> wrote:
VMOVD xmm, xmm and VMOVQ xmm, xmm forms don't exist. You could use
[V]MOVSS and [V]MOVSD, but they don't set non-copied lanes to zero.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#373 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKQJVAOC4TNCDVXXHXZ63SJP365ANCNFSM4SGISCOA>
.
|
@Maratyszcza Check these two out See how vmovq and pand both use port 015? Think they're synonyms for the same op? |
This proposal was originally put together for completeness along with the load32/load64_zero. Its functionality is equivalent to |
Introduction
@Maratyszcza has done a wonderful job describing the use cases and functionality of load64_zero and load32_zero in #237. This proposal seeks to extend the functionality of load64_zero and load32_zero to be functionally complete with the underlying architecture by adding support for its sister variants with identical implementations. This would add support from other 32-bit and 64-bit registers and from the low 32 and 64 bits of other vectors. The proposed instructions are move32_zero_r, move64_zero_r, move32_zero_v, and move64_zero_v respectively. Since these are sister instructions, the applications, use cases, and instructions are identical to the original proposal. This ticket will serve as a placeholder for the upcoming PR and will be updated in tandem.
Mapping to Common Instruction Sets
This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.
x86/x86-64 processors with AVX instruction set
v128.move32_zero_r
v = v128.move32_zero_r(r32) is lowered to VMOVD xmm_v, r32
v128.move64_zero_r
v = v128.move64_zero_r(r64) is lowered to VMOVQ xmm_v, r64
v128.move32_zero_v
v = v128.move32_zero_v(v128) is lowered to VMOVD xmm_v, xmm
v128.move64_zero_v
v = v128.move64_zero_v(v128) is lowered to VMOVQ xmm_v, xmm
x86/x86-64 processors with SSE2 instruction set
v128.move32_zero_r
v = v128.move32_zero_r(r32) is lowered to MOVD xmm_v, r32
v128.move64_zero_r
v = v128.move64_zero_r(r64) is lowered to MOVQ xmm_v, r64
v128.move32_zero_v
v = v128.move32_zero_v(v128) is lowered to MOVD xmm_v, xmm
v128.move64_zero_v
v = v128.move64_zero_v(v128) is lowered to MOVQ xmm_v, xmm
ARM64 Processors
v128.move32_zero_r
v = v128.move32_zero_r(r32) is lowered to fmov s0, w0
v128.move64_zero_r
v = v128.move64_zero_r(r64) is lowered to fmov d0, x0
v128.move32_zero_v
v = v128.move32_zero_v(v128) is lowered to mov s0, v1.s[0] OR fmov s0, s1
v128.move64_zero_v
v = v128.move64_zero_v(v128) is lowered to mov d0, v1.d[0] OR fmov d0, d1
ARMv7 with Neon
v128.move32_zero_r
v = v128.move32_zero_r(r32)
v128.move64_zero_r
v128.move32_zero_v
v128.move64_zero_v
v= v128.move64_zero_v(v128).
The text was updated successfully, but these errors were encountered: