Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add zero-masking vs merge-masking to IR and handle in scatter expansion #5488

Open
derekbruening opened this issue May 9, 2022 · 5 comments

Comments

@derekbruening
Copy link
Contributor

derekbruening commented May 9, 2022

Today the AVX-512 EVEX.z bit ({z} in assembler syntax when set to 1) controls whether zero elements in the mask are zeroed or not in the output. This is not represented in DR's IR, nor is zero-masking handled in scatter-gather expansion. This issue covers addressing both.

Generally, we want the IR to be an abstraction, mapping ISA encoding details into general concepts. The various prefixes in x86 map into operand size differences or opcode differences today. This zero-masking prefix bit doesn't affect just one operand: it affects the operation. The precedent there is to split the opcode. We have separate opcodes for OP_rep_movs vs OP_movs. This is similar to the "sub-opcode" numeric values indicating behavior in ARM (see the discussion at #4386 (comment) regarding opcode philosophy of separate opcodes for separate semantics; see also #4388).

We do expose some x86 encoding prefixes in instr_t.prefixes today but they generally do not change the semantics enough that most tools, including taint-tracking tools, can completely ignore them. Part of me wants to get rid of instr_t.prefixes entirely (other than predication), so that tools don't have to worry about this separate set of flags controlling behavior, and split up opcodes where behavior differs. But if we did that for the zero-masking we may have a huge number of split opcodes and compatibility issues since the existing opcodes have been public for a while. So maybe embracing the prefixes is an ok solution for these behavior changes that apply to many different opcodes. I would name this something like PREFIX_MASK_ZERO though to try to be cross-platform if another ISA had the same concept, and not put anything about "EVEX" in there.

@abhinav92003
Copy link
Contributor

I see that x86 prefixes are somewhat similar to OP_sys on AArch64. But there are two key differences that I think justify a different design (instead of splitting the opcodes):

  1. OP_sys had a small number of sub-opcodes. It was easy to define each of them as a separate opcode. Whereas the x86 prefixes affect all instructions that can have prefixes (e.g. for the z bit, we'd have to define a new variant for many relevant avx512 opcodes. To make it worse, multiple prefix bits can potentially be combined, creating even more combinations, and more opcodes.)
  2. Some OP_sys sub-opcodes were fundamentally different (e.g. DC ZVA had to be marked as a store), so separating its opcode made sense. However, the x86 prefixes don't change the opcode behavior that much, so creating separate opcodes would just add cruft.

I vote to expose the z bit in the prefix, and naming it something generic like you suggested.

Tools should really remember to look at the prefixes if they care. The x86 ISA also has prefixes, so the concept is not specific to DR.

prasun3 added a commit that referenced this issue May 17, 2022
* i5483: Added support for AVX512 BF16 instructions

**AVX512 bfloat16 instructions**

These are the three bfloat16 instructions.

VCVTNE2PS2BF16—Convert Two Packed Single Data to One Packed BF16 Data
```
EVEX.128.F2.0F38.W0 72 /r VCVTNE2PS2BF16 xmm1{k1}{z}, xmm2, xmm3/m128/m32bcst
EVEX.256.F2.0F38.W0 72 /r VCVTNE2PS2BF16 ymm1{k1}{z}, ymm2, ymm3/m256/m32bcst
EVEX.512.F2.0F38.W0 72 /r VCVTNE2PS2BF16 zmm1{k1}{z}, zmm2, zmm3/m512/m32bcst

Op/En   Tuple   Operand 1       Operand 2       Operand 3
A       Full    ModRM:reg (w)   EVEX.vvvv (r)   ModRM:r/m (r)
```
VCVTNEPS2BF16—Convert Packed Single Data to Packed BF16 Data
```
EVEX.128.F3.0F38.W0 72 /r VCVTNEPS2BF16 xmm1{k1}{z}, xmm2/m128/m32bcst
EVEX.256.F3.0F38.W0 72 /r VCVTNEPS2BF16 xmm1{k1}{z}, ymm2/m256/m32bcst
EVEX.512.F3.0F38.W0 72 /r VCVTNEPS2BF16 ymm1{k1}{z}, zmm2/m512/m32bcst

Op/En   Tuple   Operand 1       Operand 2
A       Full    ModRM:reg (w)   ModRM:r/m (r)
```

VDPBF16PS—Dot Product of BF16 Pairs Accumulated into Packed Single Precision
```
EVEX.128.F3.0F38.W0 52 /r VDPBF16PS xmm1{k1}{z}, xmm2, xmm3/m128/m32bcst
EVEX.256.F3.0F38.W0 52 /r VDPBF16PS ymm1{k1}{z}, ymm2, ymm3/m256/m32bcst
EVEX.512.F3.0F38.W0 52 /r VDPBF16PS zmm1{k1}{z}, zmm2, zmm3/m512/m32bcst

Op/En   Tuple   Operand 1       Operand 2       Operand 3
A       Full    ModRM:reg (w)   EVEX.vvvv (r)   ModRM:r/m (r)
```

**List of places to update**

From https://github.com/DynamoRIO/dynamorio/blob/master/core/ir/x86/opcode_api.h#L53

```
 * When adding new instructions, be sure to update all of these places:
 *   1) decode_table op_instr array
 *   2) decode_table decoding table entries
 *   3) OP_ enum (here) via x86opnums.pl
 *   4) update OP_LAST at end of enum here
 *   5) decode_fast tables if necessary (they are conservative)
 *   6) instr_create macros
 *   7) suite/tests/api/ir* tests
 *   8) add binutils tests in third_party/binutils/test_decenc

```

**Step 1: update `op_instr` array**

Added entries to `op_instr`. These point directly to `evex_Wb_extensions` since these instructions only have `evex` encoding.

**Step 2: add decode_table entries**

   - updated `third_byte_38` table to point to `prefix_extensions` since these instructions have common opcodes and differ in prefix.
   - added entries in `prefix_extensions` to point to appropriate vex/evex entries
   - added entries in `evex_Wb_extensions`

The instructions `VCVTNEPS2BF16` and `VCVTNE2PS2BF16` have three byte opcodes starting with `0f 38` so the decoder looks at `third_byte_38[third_byte_38_index[opcode]]`. Since these instructions have the same opcode (`72`) and differ only in the prefix (`f2/f3`), we need to point the `third_byte_38` to `prefix_extensions` which in turn points to the appropriate `EVEX_Wb` entries.

The instruction `VDPBF16PS` has the same opcode (52) as the VNNI instruction `vpdpwsd` and they differ only in the prefix (`F3/66`). We need to update that entry to point to `prefix_extensions` instead of `e_vex_extensions`. This causes the `e_vex_extensions` entry ( `e_vex ext 151`) to be orphaned - do we remove this entry?

Updated opcodes for invalid entries in e_vex ext 151 and 152 for consistency.

**Step 3: add OP_ enums**

Done

**Step 4: update OP_LAST**

Not needed since OP_LAST already points to the last enum.

**Step 5: decode_fast tables if necessary**

Not done

**Step 6: instr_create macros**

Added `1dst_3src` macros for `VCVTNE2PS2BF16` and `VDPBF16PS` since they write to operand 1 and read from mask register, operand 2, and operand 3.

Added `1dst_2src` macro for `VCVTNEPS2BF16` since it writes to operand 1 and reads from mask register and operand 2.

**Step 7: suite/tests/api/ir tests**

Added tests in ir_x86_3args_avx512_evex_mask.h and ir_x86_4args_avx512_evex_mask_C.h.

Added manual tests in ir_x86.c (hard-coded zeroing prefix flag pending #5488)

**Step 8: binutils tests**

Added binutils tests. These represent the assembly instructions using DR API and match against the encoded bytes.
prasun3 added a commit to prasun3/dynamorio that referenced this issue Jun 6, 2022
* i5483: Added support for AVX512 BF16 instructions

**AVX512 bfloat16 instructions**

These are the three bfloat16 instructions.

VCVTNE2PS2BF16—Convert Two Packed Single Data to One Packed BF16 Data
```
EVEX.128.F2.0F38.W0 72 /r VCVTNE2PS2BF16 xmm1{k1}{z}, xmm2, xmm3/m128/m32bcst
EVEX.256.F2.0F38.W0 72 /r VCVTNE2PS2BF16 ymm1{k1}{z}, ymm2, ymm3/m256/m32bcst
EVEX.512.F2.0F38.W0 72 /r VCVTNE2PS2BF16 zmm1{k1}{z}, zmm2, zmm3/m512/m32bcst

Op/En   Tuple   Operand 1       Operand 2       Operand 3
A       Full    ModRM:reg (w)   EVEX.vvvv (r)   ModRM:r/m (r)
```
VCVTNEPS2BF16—Convert Packed Single Data to Packed BF16 Data
```
EVEX.128.F3.0F38.W0 72 /r VCVTNEPS2BF16 xmm1{k1}{z}, xmm2/m128/m32bcst
EVEX.256.F3.0F38.W0 72 /r VCVTNEPS2BF16 xmm1{k1}{z}, ymm2/m256/m32bcst
EVEX.512.F3.0F38.W0 72 /r VCVTNEPS2BF16 ymm1{k1}{z}, zmm2/m512/m32bcst

Op/En   Tuple   Operand 1       Operand 2
A       Full    ModRM:reg (w)   ModRM:r/m (r)
```

VDPBF16PS—Dot Product of BF16 Pairs Accumulated into Packed Single Precision
```
EVEX.128.F3.0F38.W0 52 /r VDPBF16PS xmm1{k1}{z}, xmm2, xmm3/m128/m32bcst
EVEX.256.F3.0F38.W0 52 /r VDPBF16PS ymm1{k1}{z}, ymm2, ymm3/m256/m32bcst
EVEX.512.F3.0F38.W0 52 /r VDPBF16PS zmm1{k1}{z}, zmm2, zmm3/m512/m32bcst

Op/En   Tuple   Operand 1       Operand 2       Operand 3
A       Full    ModRM:reg (w)   EVEX.vvvv (r)   ModRM:r/m (r)
```

**List of places to update**

From https://github.com/DynamoRIO/dynamorio/blob/master/core/ir/x86/opcode_api.h#L53

```
 * When adding new instructions, be sure to update all of these places:
 *   1) decode_table op_instr array
 *   2) decode_table decoding table entries
 *   3) OP_ enum (here) via x86opnums.pl
 *   4) update OP_LAST at end of enum here
 *   5) decode_fast tables if necessary (they are conservative)
 *   6) instr_create macros
 *   7) suite/tests/api/ir* tests
 *   8) add binutils tests in third_party/binutils/test_decenc

```

**Step 1: update `op_instr` array**

Added entries to `op_instr`. These point directly to `evex_Wb_extensions` since these instructions only have `evex` encoding.

**Step 2: add decode_table entries**

   - updated `third_byte_38` table to point to `prefix_extensions` since these instructions have common opcodes and differ in prefix.
   - added entries in `prefix_extensions` to point to appropriate vex/evex entries
   - added entries in `evex_Wb_extensions`

The instructions `VCVTNEPS2BF16` and `VCVTNE2PS2BF16` have three byte opcodes starting with `0f 38` so the decoder looks at `third_byte_38[third_byte_38_index[opcode]]`. Since these instructions have the same opcode (`72`) and differ only in the prefix (`f2/f3`), we need to point the `third_byte_38` to `prefix_extensions` which in turn points to the appropriate `EVEX_Wb` entries.

The instruction `VDPBF16PS` has the same opcode (52) as the VNNI instruction `vpdpwsd` and they differ only in the prefix (`F3/66`). We need to update that entry to point to `prefix_extensions` instead of `e_vex_extensions`. This causes the `e_vex_extensions` entry ( `e_vex ext 151`) to be orphaned - do we remove this entry?

Updated opcodes for invalid entries in e_vex ext 151 and 152 for consistency.

**Step 3: add OP_ enums**

Done

**Step 4: update OP_LAST**

Not needed since OP_LAST already points to the last enum.

**Step 5: decode_fast tables if necessary**

Not done

**Step 6: instr_create macros**

Added `1dst_3src` macros for `VCVTNE2PS2BF16` and `VDPBF16PS` since they write to operand 1 and read from mask register, operand 2, and operand 3.

Added `1dst_2src` macro for `VCVTNEPS2BF16` since it writes to operand 1 and reads from mask register and operand 2.

**Step 7: suite/tests/api/ir tests**

Added tests in ir_x86_3args_avx512_evex_mask.h and ir_x86_4args_avx512_evex_mask_C.h.

Added manual tests in ir_x86.c (hard-coded zeroing prefix flag pending DynamoRIO#5488)

**Step 8: binutils tests**

Added binutils tests. These represent the assembly instructions using DR API and match against the encoded bytes.
@abhinav92003
Copy link
Contributor

nor is zero-masking handled in scatter-gather expansion.

Looking more closely for the AVX512 gather (https://www.felixcloutier.com/x86/vpgatherdd:vpgatherdq), looks like it doesn't support zeroing-masking, and performs merging-masking always. So the scatter/gather expansion doesn't need to worry about this.

@derekbruening
Copy link
Contributor Author

In discussions about how to represent the similar zero vs merging in AArch64 SVE in PR #5718 we decided that putting a flag on the predicate/mask register operand is the way to go, rather than the whole-instruction prefix flag proposed here.
The predicate/mask register is already controlling the operation by selecting the sources, so adding the zero vs merge semantics on just that operand seems reasonable.

@derekbruening
Copy link
Contributor Author

For AArch64 SVE we now have DR_OPND_IS_MERGE_PREDICATE and DR_OPND_IS_ZERO_PREDICATE so I assume we would use those on x86 too?

@abhinav92003
Copy link
Contributor

Makes sense to use the same ones on x86. Any reason we shouldn't?

Interesting to note that x86 gathers are always merging but AArch64 gathers are zeroing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants