x64: Lower shuffle and swizzle in ISLE #4772

elliottt · 2022-08-24T17:32:22Z

Lower shuffle and swizzle in ISLE.

This PR surfaced a bug with the lowering of shuffle when avx512vl and avx512vbmi are enabled: we use vpermi2b as the implementation, but panic if the immediate shuffle mask contains any out-of-bounds values. The behavior when the avx512 extensions are not present is that out-of-bounds values are turned into 0 in the result.

I've resolved this by detecting when the shuffle immediate has out-of-bounds indices in the avx512-enabled lowering, and generating an additional mask to zero out the lanes where those indices occur. This brings the avx512 case into line with the semantics of the shuffle op:

wasmtime/cranelift/codegen/meta/src/shared/instructions.rs

Lines 1495 to 1498 in 94bcbe8

    
                   Shuffle two vectors using the given immediate bytes. For each of the 16 bytes of the 
        
                   immediate, a value i of 0-15 selects the i-th element of the first vector and a value i of 
        
                   16-31 selects the (i-16)th element of the second vector. Immediate values outside of the 
        
                   0-31 range place a 0 in the resulting vector lane.

github-actions · 2022-08-24T18:17:57Z

Subscribe to Label Action

cc @cfallin, @fitzgen

This issue or pull request has been labeled: "cranelift", "cranelift:area:machinst", "cranelift:area:x64", "isle"

Thus the following users have been cc'd because of the following labels:

cfallin: isle
fitzgen: isle

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

elliottt · 2022-08-24T18:45:08Z

cranelift/filetests/filetests/isa/x64/shuffle-avx512.clif

+;   movdqa  %xmm0, %xmm9
+;   load_const VCodeConstant(0), %xmm0
+;   vpermi2b %xmm1, %xmm0, %xmm9
+;   movq    %rbp, %rsp
+;   popq    %rbp
+;   ret


Here's the case where the mask wasn't necessary, thus no andps instruction was generated.

elliottt · 2022-08-24T18:45:28Z

cranelift/filetests/filetests/isa/x64/shuffle-avx512.clif

+;   movdqa  %xmm0, %xmm12
+;   load_const VCodeConstant(1), %xmm0
+;   load_const VCodeConstant(0), %xmm7
+;   vpermi2b %xmm1, %xmm7, %xmm12
+;   andps   %xmm0, %xmm7, %xmm0
+;   movq    %rbp, %rsp
+;   popq    %rbp
+;   ret


Here's a case where the permutation contained out-of-bounds values, so the andps on line 38 is necessary.

elliottt · 2022-08-24T18:46:26Z

cranelift/filetests/filetests/runtests/simd-shuffle.clif

+    v2 = shuffle v0, v1, [3 0 32 255 4 6 12 11 23 13 24 4 2 97 17 5]
+    return v2
+}
+; run: %shuffle_zeros([1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16], [17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32]) == [4 1 0 0 5 7 13 12 24 14 25 5 3 0 18 6]


This test already passed for the non-avx512 lowering, and this PR allows it to pass when avx512 is available as well.

Excellent -- it sounds like you've tested on avx512 hardware?

(As an aside, it might be nice to think about ways to use qemu to get us AVX512 testing in CI, if the GitHub Actions host doesn't natively have it. @abrown any thoughts on that?)

I have -- my laptop has those extensions :D

elliottt · 2022-08-24T19:31:32Z

cranelift/codegen/src/isa/x64/inst.isle

+            (_ Unit (emit (gen_move $I8X16 dst src3)))
+            (_ Unit (emit (MInst.XmmRmREvex (Avx512Opcode.Vpermi2b)
+                                            src1
+                                            src2
+                                            dst))))


I'm a little unhappy about this, but we don't have an encoding for xmm instructions that have three arguments currently.

Yeah, we can definitely add such a thing later, and should I think; we'll get to this as part of our "no more mod operands" cleanup on regalloc operands, if not before.

cfallin

Thanks! This all looks great overall.

cfallin · 2022-08-24T20:28:59Z

cranelift/codegen/src/isa/x64/inst.isle

+            (_ Unit (emit (gen_move $I8X16 dst src3)))
+            (_ Unit (emit (MInst.XmmRmREvex (Avx512Opcode.Vpermi2b)
+                                            src1
+                                            src2
+                                            dst))))


Yeah, we can definitely add such a thing later, and should I think; we'll get to this as part of our "no more mod operands" cleanup on regalloc operands, if not before.

cfallin · 2022-08-24T20:30:29Z

cranelift/codegen/src/isa/x64/inst.isle

+;; Produce a permutation suitable for use with `vpermi2b`, for permuting two
+;; I8X16 vectors simultaneously. NOTE: this will not avoid out-of-bounds values,
+;; and the internal lane value masking of vpermi2b will come into play. If you
+;; need the out-of-bounds behavior of shuffle, you'll need to also mask the


could we say what that behavior is and/or mention "CLIF-level shuffle" here? otherwise it's a bit unclear if one doesn't already have the context I think.

cfallin · 2022-08-24T20:34:28Z

cranelift/codegen/src/isa/x64/lower.isle

+;; However, if the shuffle mask contains no out-of-bounds values, we can use
+;; `vpermi2b` without any masking.
+(rule (lower (has_type (and (avx512vl_enabled) (avx512vbmi_enabled))
+                       (shuffle a b (vec_mask_from_immediate mask))))


Here we're relying on implicit firing-order heuristics (the above rule before this one, specifically perm_from_mask... extractor before mask variable binding); I think tests below should ensure this works properly but just wanted to call it out to be sure.

Yep, we should have enough test coverage to catch problems with these two: there are precise-output tests for both rules.

cfallin · 2022-08-24T20:35:09Z

cranelift/codegen/src/isa/x64/lower.isle

+        (x64_pshufb a (x64_xmm_load_const $I8X16 (shuffle_0_15_mask mask)))
+        (x64_pshufb b (x64_xmm_load_const $I8X16 (shuffle_16_31_mask mask)))))
+
+;; Rules for `shuffle` ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;


s/shuffle/swizzle/ ?

Thanks for catching that!

cfallin · 2022-08-24T20:36:59Z

cranelift/filetests/filetests/runtests/simd-shuffle.clif

+    v2 = shuffle v0, v1, [3 0 32 255 4 6 12 11 23 13 24 4 2 97 17 5]
+    return v2
+}
+; run: %shuffle_zeros([1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16], [17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32]) == [4 1 0 0 5 7 13 12 24 14 25 5 3 0 18 6]


Excellent -- it sounds like you've tested on avx512 hardware?

(As an aside, it might be nice to think about ways to use qemu to get us AVX512 testing in CI, if the GitHub Actions host doesn't natively have it. @abrown any thoughts on that?)

elliottt added 4 commits August 23, 2022 21:55

Lower shuffle in ISLE

b5b9ef6

Add a shuffle test that produces zeros

ea60c63

Add a TODO for the use of vpermi2b

b80e37f

Lower swizzle in ISLE

31fd919

elliottt marked this pull request as ready for review August 24, 2022 17:38

github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift:area:x64 Issues related to x64 codegen isle Related to the ISLE domain-specific language labels Aug 24, 2022

Mask the result of vpermi2b when the permutation contains oob indices

d52bd6d

elliottt force-pushed the trevor/x64-shuffle branch from 28fa27d to d52bd6d Compare August 24, 2022 18:44

elliottt commented Aug 24, 2022

View reviewed changes

cfallin approved these changes Aug 24, 2022

View reviewed changes

elliottt added 2 commits August 24, 2022 13:59

Fix a comment typo

ff13b34

Clarify the comment in perm_from_mask

bf13c84

elliottt enabled auto-merge (squash) August 24, 2022 21:30

elliottt merged commit b8b6f27 into bytecodealliance:main Aug 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x64: Lower shuffle and swizzle in ISLE #4772

x64: Lower shuffle and swizzle in ISLE #4772

elliottt commented Aug 24, 2022 •

edited

Loading

github-actions bot commented Aug 24, 2022

elliottt Aug 24, 2022

elliottt Aug 24, 2022 •

edited

Loading

elliottt Aug 24, 2022

cfallin Aug 24, 2022

elliottt Aug 24, 2022

elliottt Aug 24, 2022

cfallin Aug 24, 2022

cfallin left a comment

cfallin Aug 24, 2022

cfallin Aug 24, 2022

cfallin Aug 24, 2022

elliottt Aug 24, 2022

cfallin Aug 24, 2022

elliottt Aug 24, 2022

cfallin Aug 24, 2022

	Shuffle two vectors using the given immediate bytes. For each of the 16 bytes of the
	immediate, a value i of 0-15 selects the i-th element of the first vector and a value i of
	16-31 selects the (i-16)th element of the second vector. Immediate values outside of the
	0-31 range place a 0 in the resulting vector lane.

x64: Lower shuffle and swizzle in ISLE #4772

x64: Lower shuffle and swizzle in ISLE #4772

Conversation

elliottt commented Aug 24, 2022 • edited Loading

github-actions bot commented Aug 24, 2022

Subscribe to Label Action

Choose a reason for hiding this comment

elliottt Aug 24, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cfallin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elliottt commented Aug 24, 2022 •

edited

Loading

elliottt Aug 24, 2022 •

edited

Loading