Add zero-masking vs merge-masking to IR and handle in scatter expansion #5488

derekbruening · 2022-05-09T18:45:32Z

Today the AVX-512 EVEX.z bit ({z} in assembler syntax when set to 1) controls whether zero elements in the mask are zeroed or not in the output. This is not represented in DR's IR, nor is zero-masking handled in scatter-gather expansion. This issue covers addressing both.

Generally, we want the IR to be an abstraction, mapping ISA encoding details into general concepts. The various prefixes in x86 map into operand size differences or opcode differences today. This zero-masking prefix bit doesn't affect just one operand: it affects the operation. The precedent there is to split the opcode. We have separate opcodes for OP_rep_movs vs OP_movs. This is similar to the "sub-opcode" numeric values indicating behavior in ARM (see the discussion at #4386 (comment) regarding opcode philosophy of separate opcodes for separate semantics; see also #4388).

We do expose some x86 encoding prefixes in instr_t.prefixes today but they generally do not change the semantics enough that most tools, including taint-tracking tools, can completely ignore them. Part of me wants to get rid of instr_t.prefixes entirely (other than predication), so that tools don't have to worry about this separate set of flags controlling behavior, and split up opcodes where behavior differs. But if we did that for the zero-masking we may have a huge number of split opcodes and compatibility issues since the existing opcodes have been public for a while. So maybe embracing the prefixes is an ok solution for these behavior changes that apply to many different opcodes. I would name this something like PREFIX_MASK_ZERO though to try to be cross-platform if another ISA had the same concept, and not put anything about "EVEX" in there.

The text was updated successfully, but these errors were encountered:

abhinav92003 · 2022-05-10T15:56:01Z

I see that x86 prefixes are somewhat similar to OP_sys on AArch64. But there are two key differences that I think justify a different design (instead of splitting the opcodes):

OP_sys had a small number of sub-opcodes. It was easy to define each of them as a separate opcode. Whereas the x86 prefixes affect all instructions that can have prefixes (e.g. for the z bit, we'd have to define a new variant for many relevant avx512 opcodes. To make it worse, multiple prefix bits can potentially be combined, creating even more combinations, and more opcodes.)
Some OP_sys sub-opcodes were fundamentally different (e.g. DC ZVA had to be marked as a store), so separating its opcode made sense. However, the x86 prefixes don't change the opcode behavior that much, so creating separate opcodes would just add cruft.

I vote to expose the z bit in the prefix, and naming it something generic like you suggested.

Tools should really remember to look at the prefixes if they care. The x86 ISA also has prefixes, so the concept is not specific to DR.

* i5483: Added support for AVX512 BF16 instructions **AVX512 bfloat16 instructions** These are the three bfloat16 instructions. VCVTNE2PS2BF16—Convert Two Packed Single Data to One Packed BF16 Data ``` EVEX.128.F2.0F38.W0 72 /r VCVTNE2PS2BF16 xmm1{k1}{z}, xmm2, xmm3/m128/m32bcst EVEX.256.F2.0F38.W0 72 /r VCVTNE2PS2BF16 ymm1{k1}{z}, ymm2, ymm3/m256/m32bcst EVEX.512.F2.0F38.W0 72 /r VCVTNE2PS2BF16 zmm1{k1}{z}, zmm2, zmm3/m512/m32bcst Op/En Tuple Operand 1 Operand 2 Operand 3 A Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) ``` VCVTNEPS2BF16—Convert Packed Single Data to Packed BF16 Data ``` EVEX.128.F3.0F38.W0 72 /r VCVTNEPS2BF16 xmm1{k1}{z}, xmm2/m128/m32bcst EVEX.256.F3.0F38.W0 72 /r VCVTNEPS2BF16 xmm1{k1}{z}, ymm2/m256/m32bcst EVEX.512.F3.0F38.W0 72 /r VCVTNEPS2BF16 ymm1{k1}{z}, zmm2/m512/m32bcst Op/En Tuple Operand 1 Operand 2 A Full ModRM:reg (w) ModRM:r/m (r) ``` VDPBF16PS—Dot Product of BF16 Pairs Accumulated into Packed Single Precision ``` EVEX.128.F3.0F38.W0 52 /r VDPBF16PS xmm1{k1}{z}, xmm2, xmm3/m128/m32bcst EVEX.256.F3.0F38.W0 52 /r VDPBF16PS ymm1{k1}{z}, ymm2, ymm3/m256/m32bcst EVEX.512.F3.0F38.W0 52 /r VDPBF16PS zmm1{k1}{z}, zmm2, zmm3/m512/m32bcst Op/En Tuple Operand 1 Operand 2 Operand 3 A Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) ``` **List of places to update** From https://github.com/DynamoRIO/dynamorio/blob/master/core/ir/x86/opcode_api.h#L53 ``` * When adding new instructions, be sure to update all of these places: * 1) decode_table op_instr array * 2) decode_table decoding table entries * 3) OP_ enum (here) via x86opnums.pl * 4) update OP_LAST at end of enum here * 5) decode_fast tables if necessary (they are conservative) * 6) instr_create macros * 7) suite/tests/api/ir* tests * 8) add binutils tests in third_party/binutils/test_decenc ``` **Step 1: update `op_instr` array** Added entries to `op_instr`. These point directly to `evex_Wb_extensions` since these instructions only have `evex` encoding. **Step 2: add decode_table entries** - updated `third_byte_38` table to point to `prefix_extensions` since these instructions have common opcodes and differ in prefix. - added entries in `prefix_extensions` to point to appropriate vex/evex entries - added entries in `evex_Wb_extensions` The instructions `VCVTNEPS2BF16` and `VCVTNE2PS2BF16` have three byte opcodes starting with `0f 38` so the decoder looks at `third_byte_38[third_byte_38_index[opcode]]`. Since these instructions have the same opcode (`72`) and differ only in the prefix (`f2/f3`), we need to point the `third_byte_38` to `prefix_extensions` which in turn points to the appropriate `EVEX_Wb` entries. The instruction `VDPBF16PS` has the same opcode (52) as the VNNI instruction `vpdpwsd` and they differ only in the prefix (`F3/66`). We need to update that entry to point to `prefix_extensions` instead of `e_vex_extensions`. This causes the `e_vex_extensions` entry ( `e_vex ext 151`) to be orphaned - do we remove this entry? Updated opcodes for invalid entries in e_vex ext 151 and 152 for consistency. **Step 3: add OP_ enums** Done **Step 4: update OP_LAST** Not needed since OP_LAST already points to the last enum. **Step 5: decode_fast tables if necessary** Not done **Step 6: instr_create macros** Added `1dst_3src` macros for `VCVTNE2PS2BF16` and `VDPBF16PS` since they write to operand 1 and read from mask register, operand 2, and operand 3. Added `1dst_2src` macro for `VCVTNEPS2BF16` since it writes to operand 1 and reads from mask register and operand 2. **Step 7: suite/tests/api/ir tests** Added tests in ir_x86_3args_avx512_evex_mask.h and ir_x86_4args_avx512_evex_mask_C.h. Added manual tests in ir_x86.c (hard-coded zeroing prefix flag pending #5488) **Step 8: binutils tests** Added binutils tests. These represent the assembly instructions using DR API and match against the encoded bytes.

* i5483: Added support for AVX512 BF16 instructions **AVX512 bfloat16 instructions** These are the three bfloat16 instructions. VCVTNE2PS2BF16—Convert Two Packed Single Data to One Packed BF16 Data ``` EVEX.128.F2.0F38.W0 72 /r VCVTNE2PS2BF16 xmm1{k1}{z}, xmm2, xmm3/m128/m32bcst EVEX.256.F2.0F38.W0 72 /r VCVTNE2PS2BF16 ymm1{k1}{z}, ymm2, ymm3/m256/m32bcst EVEX.512.F2.0F38.W0 72 /r VCVTNE2PS2BF16 zmm1{k1}{z}, zmm2, zmm3/m512/m32bcst Op/En Tuple Operand 1 Operand 2 Operand 3 A Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) ``` VCVTNEPS2BF16—Convert Packed Single Data to Packed BF16 Data ``` EVEX.128.F3.0F38.W0 72 /r VCVTNEPS2BF16 xmm1{k1}{z}, xmm2/m128/m32bcst EVEX.256.F3.0F38.W0 72 /r VCVTNEPS2BF16 xmm1{k1}{z}, ymm2/m256/m32bcst EVEX.512.F3.0F38.W0 72 /r VCVTNEPS2BF16 ymm1{k1}{z}, zmm2/m512/m32bcst Op/En Tuple Operand 1 Operand 2 A Full ModRM:reg (w) ModRM:r/m (r) ``` VDPBF16PS—Dot Product of BF16 Pairs Accumulated into Packed Single Precision ``` EVEX.128.F3.0F38.W0 52 /r VDPBF16PS xmm1{k1}{z}, xmm2, xmm3/m128/m32bcst EVEX.256.F3.0F38.W0 52 /r VDPBF16PS ymm1{k1}{z}, ymm2, ymm3/m256/m32bcst EVEX.512.F3.0F38.W0 52 /r VDPBF16PS zmm1{k1}{z}, zmm2, zmm3/m512/m32bcst Op/En Tuple Operand 1 Operand 2 Operand 3 A Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) ``` **List of places to update** From https://github.com/DynamoRIO/dynamorio/blob/master/core/ir/x86/opcode_api.h#L53 ``` * When adding new instructions, be sure to update all of these places: * 1) decode_table op_instr array * 2) decode_table decoding table entries * 3) OP_ enum (here) via x86opnums.pl * 4) update OP_LAST at end of enum here * 5) decode_fast tables if necessary (they are conservative) * 6) instr_create macros * 7) suite/tests/api/ir* tests * 8) add binutils tests in third_party/binutils/test_decenc ``` **Step 1: update `op_instr` array** Added entries to `op_instr`. These point directly to `evex_Wb_extensions` since these instructions only have `evex` encoding. **Step 2: add decode_table entries** - updated `third_byte_38` table to point to `prefix_extensions` since these instructions have common opcodes and differ in prefix. - added entries in `prefix_extensions` to point to appropriate vex/evex entries - added entries in `evex_Wb_extensions` The instructions `VCVTNEPS2BF16` and `VCVTNE2PS2BF16` have three byte opcodes starting with `0f 38` so the decoder looks at `third_byte_38[third_byte_38_index[opcode]]`. Since these instructions have the same opcode (`72`) and differ only in the prefix (`f2/f3`), we need to point the `third_byte_38` to `prefix_extensions` which in turn points to the appropriate `EVEX_Wb` entries. The instruction `VDPBF16PS` has the same opcode (52) as the VNNI instruction `vpdpwsd` and they differ only in the prefix (`F3/66`). We need to update that entry to point to `prefix_extensions` instead of `e_vex_extensions`. This causes the `e_vex_extensions` entry ( `e_vex ext 151`) to be orphaned - do we remove this entry? Updated opcodes for invalid entries in e_vex ext 151 and 152 for consistency. **Step 3: add OP_ enums** Done **Step 4: update OP_LAST** Not needed since OP_LAST already points to the last enum. **Step 5: decode_fast tables if necessary** Not done **Step 6: instr_create macros** Added `1dst_3src` macros for `VCVTNE2PS2BF16` and `VDPBF16PS` since they write to operand 1 and read from mask register, operand 2, and operand 3. Added `1dst_2src` macro for `VCVTNEPS2BF16` since it writes to operand 1 and reads from mask register and operand 2. **Step 7: suite/tests/api/ir tests** Added tests in ir_x86_3args_avx512_evex_mask.h and ir_x86_4args_avx512_evex_mask_C.h. Added manual tests in ir_x86.c (hard-coded zeroing prefix flag pending DynamoRIO#5488) **Step 8: binutils tests** Added binutils tests. These represent the assembly instructions using DR API and match against the encoded bytes.

abhinav92003 · 2022-11-09T16:04:17Z

nor is zero-masking handled in scatter-gather expansion.

Looking more closely for the AVX512 gather (https://www.felixcloutier.com/x86/vpgatherdd:vpgatherdq), looks like it doesn't support zeroing-masking, and performs merging-masking always. So the scatter/gather expansion doesn't need to worry about this.

derekbruening · 2022-11-09T16:35:08Z

In discussions about how to represent the similar zero vs merging in AArch64 SVE in PR #5718 we decided that putting a flag on the predicate/mask register operand is the way to go, rather than the whole-instruction prefix flag proposed here.
The predicate/mask register is already controlling the operation by selecting the sources, so adding the zero vs merge semantics on just that operand seems reasonable.

derekbruening · 2023-08-18T15:58:51Z

For AArch64 SVE we now have DR_OPND_IS_MERGE_PREDICATE and DR_OPND_IS_ZERO_PREDICATE so I assume we would use those on x86 too?

abhinav92003 · 2023-08-22T15:29:04Z

Makes sense to use the same ones on x86. Any reason we shouldn't?

Interesting to note that x86 gathers are always merging but AArch64 gathers are zeroing.

derekbruening added the Component-IR label May 9, 2022

derekbruening mentioned this issue May 9, 2022

Add support for AVX512 BF16 instructions #5483

Closed

abhinav92003 mentioned this issue Nov 8, 2022

i#3044 AArch64 SVE codec: Add support for predicate modes in the IR #5718

Merged

derekbruening mentioned this issue Aug 18, 2023

where can I find zero-masking in AVX-512 instruction? #6249

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add zero-masking vs merge-masking to IR and handle in scatter expansion #5488

Add zero-masking vs merge-masking to IR and handle in scatter expansion #5488

derekbruening commented May 9, 2022 •

edited

Loading

abhinav92003 commented May 10, 2022

abhinav92003 commented Nov 9, 2022

derekbruening commented Nov 9, 2022

derekbruening commented Aug 18, 2023

abhinav92003 commented Aug 22, 2023

Add zero-masking vs merge-masking to IR and handle in scatter expansion #5488

Add zero-masking vs merge-masking to IR and handle in scatter expansion #5488

Comments

derekbruening commented May 9, 2022 • edited Loading

abhinav92003 commented May 10, 2022

abhinav92003 commented Nov 9, 2022

derekbruening commented Nov 9, 2022

derekbruening commented Aug 18, 2023

abhinav92003 commented Aug 22, 2023

derekbruening commented May 9, 2022 •

edited

Loading