-
Notifications
You must be signed in to change notification settings - Fork 562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add zero-masking vs merge-masking to IR and handle in scatter expansion #5488
Comments
I see that x86 prefixes are somewhat similar to OP_sys on AArch64. But there are two key differences that I think justify a different design (instead of splitting the opcodes):
I vote to expose the Tools should really remember to look at the prefixes if they care. The x86 ISA also has prefixes, so the concept is not specific to DR. |
* i5483: Added support for AVX512 BF16 instructions **AVX512 bfloat16 instructions** These are the three bfloat16 instructions. VCVTNE2PS2BF16—Convert Two Packed Single Data to One Packed BF16 Data ``` EVEX.128.F2.0F38.W0 72 /r VCVTNE2PS2BF16 xmm1{k1}{z}, xmm2, xmm3/m128/m32bcst EVEX.256.F2.0F38.W0 72 /r VCVTNE2PS2BF16 ymm1{k1}{z}, ymm2, ymm3/m256/m32bcst EVEX.512.F2.0F38.W0 72 /r VCVTNE2PS2BF16 zmm1{k1}{z}, zmm2, zmm3/m512/m32bcst Op/En Tuple Operand 1 Operand 2 Operand 3 A Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) ``` VCVTNEPS2BF16—Convert Packed Single Data to Packed BF16 Data ``` EVEX.128.F3.0F38.W0 72 /r VCVTNEPS2BF16 xmm1{k1}{z}, xmm2/m128/m32bcst EVEX.256.F3.0F38.W0 72 /r VCVTNEPS2BF16 xmm1{k1}{z}, ymm2/m256/m32bcst EVEX.512.F3.0F38.W0 72 /r VCVTNEPS2BF16 ymm1{k1}{z}, zmm2/m512/m32bcst Op/En Tuple Operand 1 Operand 2 A Full ModRM:reg (w) ModRM:r/m (r) ``` VDPBF16PS—Dot Product of BF16 Pairs Accumulated into Packed Single Precision ``` EVEX.128.F3.0F38.W0 52 /r VDPBF16PS xmm1{k1}{z}, xmm2, xmm3/m128/m32bcst EVEX.256.F3.0F38.W0 52 /r VDPBF16PS ymm1{k1}{z}, ymm2, ymm3/m256/m32bcst EVEX.512.F3.0F38.W0 52 /r VDPBF16PS zmm1{k1}{z}, zmm2, zmm3/m512/m32bcst Op/En Tuple Operand 1 Operand 2 Operand 3 A Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) ``` **List of places to update** From https://github.com/DynamoRIO/dynamorio/blob/master/core/ir/x86/opcode_api.h#L53 ``` * When adding new instructions, be sure to update all of these places: * 1) decode_table op_instr array * 2) decode_table decoding table entries * 3) OP_ enum (here) via x86opnums.pl * 4) update OP_LAST at end of enum here * 5) decode_fast tables if necessary (they are conservative) * 6) instr_create macros * 7) suite/tests/api/ir* tests * 8) add binutils tests in third_party/binutils/test_decenc ``` **Step 1: update `op_instr` array** Added entries to `op_instr`. These point directly to `evex_Wb_extensions` since these instructions only have `evex` encoding. **Step 2: add decode_table entries** - updated `third_byte_38` table to point to `prefix_extensions` since these instructions have common opcodes and differ in prefix. - added entries in `prefix_extensions` to point to appropriate vex/evex entries - added entries in `evex_Wb_extensions` The instructions `VCVTNEPS2BF16` and `VCVTNE2PS2BF16` have three byte opcodes starting with `0f 38` so the decoder looks at `third_byte_38[third_byte_38_index[opcode]]`. Since these instructions have the same opcode (`72`) and differ only in the prefix (`f2/f3`), we need to point the `third_byte_38` to `prefix_extensions` which in turn points to the appropriate `EVEX_Wb` entries. The instruction `VDPBF16PS` has the same opcode (52) as the VNNI instruction `vpdpwsd` and they differ only in the prefix (`F3/66`). We need to update that entry to point to `prefix_extensions` instead of `e_vex_extensions`. This causes the `e_vex_extensions` entry ( `e_vex ext 151`) to be orphaned - do we remove this entry? Updated opcodes for invalid entries in e_vex ext 151 and 152 for consistency. **Step 3: add OP_ enums** Done **Step 4: update OP_LAST** Not needed since OP_LAST already points to the last enum. **Step 5: decode_fast tables if necessary** Not done **Step 6: instr_create macros** Added `1dst_3src` macros for `VCVTNE2PS2BF16` and `VDPBF16PS` since they write to operand 1 and read from mask register, operand 2, and operand 3. Added `1dst_2src` macro for `VCVTNEPS2BF16` since it writes to operand 1 and reads from mask register and operand 2. **Step 7: suite/tests/api/ir tests** Added tests in ir_x86_3args_avx512_evex_mask.h and ir_x86_4args_avx512_evex_mask_C.h. Added manual tests in ir_x86.c (hard-coded zeroing prefix flag pending #5488) **Step 8: binutils tests** Added binutils tests. These represent the assembly instructions using DR API and match against the encoded bytes.
* i5483: Added support for AVX512 BF16 instructions **AVX512 bfloat16 instructions** These are the three bfloat16 instructions. VCVTNE2PS2BF16—Convert Two Packed Single Data to One Packed BF16 Data ``` EVEX.128.F2.0F38.W0 72 /r VCVTNE2PS2BF16 xmm1{k1}{z}, xmm2, xmm3/m128/m32bcst EVEX.256.F2.0F38.W0 72 /r VCVTNE2PS2BF16 ymm1{k1}{z}, ymm2, ymm3/m256/m32bcst EVEX.512.F2.0F38.W0 72 /r VCVTNE2PS2BF16 zmm1{k1}{z}, zmm2, zmm3/m512/m32bcst Op/En Tuple Operand 1 Operand 2 Operand 3 A Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) ``` VCVTNEPS2BF16—Convert Packed Single Data to Packed BF16 Data ``` EVEX.128.F3.0F38.W0 72 /r VCVTNEPS2BF16 xmm1{k1}{z}, xmm2/m128/m32bcst EVEX.256.F3.0F38.W0 72 /r VCVTNEPS2BF16 xmm1{k1}{z}, ymm2/m256/m32bcst EVEX.512.F3.0F38.W0 72 /r VCVTNEPS2BF16 ymm1{k1}{z}, zmm2/m512/m32bcst Op/En Tuple Operand 1 Operand 2 A Full ModRM:reg (w) ModRM:r/m (r) ``` VDPBF16PS—Dot Product of BF16 Pairs Accumulated into Packed Single Precision ``` EVEX.128.F3.0F38.W0 52 /r VDPBF16PS xmm1{k1}{z}, xmm2, xmm3/m128/m32bcst EVEX.256.F3.0F38.W0 52 /r VDPBF16PS ymm1{k1}{z}, ymm2, ymm3/m256/m32bcst EVEX.512.F3.0F38.W0 52 /r VDPBF16PS zmm1{k1}{z}, zmm2, zmm3/m512/m32bcst Op/En Tuple Operand 1 Operand 2 Operand 3 A Full ModRM:reg (w) EVEX.vvvv (r) ModRM:r/m (r) ``` **List of places to update** From https://github.com/DynamoRIO/dynamorio/blob/master/core/ir/x86/opcode_api.h#L53 ``` * When adding new instructions, be sure to update all of these places: * 1) decode_table op_instr array * 2) decode_table decoding table entries * 3) OP_ enum (here) via x86opnums.pl * 4) update OP_LAST at end of enum here * 5) decode_fast tables if necessary (they are conservative) * 6) instr_create macros * 7) suite/tests/api/ir* tests * 8) add binutils tests in third_party/binutils/test_decenc ``` **Step 1: update `op_instr` array** Added entries to `op_instr`. These point directly to `evex_Wb_extensions` since these instructions only have `evex` encoding. **Step 2: add decode_table entries** - updated `third_byte_38` table to point to `prefix_extensions` since these instructions have common opcodes and differ in prefix. - added entries in `prefix_extensions` to point to appropriate vex/evex entries - added entries in `evex_Wb_extensions` The instructions `VCVTNEPS2BF16` and `VCVTNE2PS2BF16` have three byte opcodes starting with `0f 38` so the decoder looks at `third_byte_38[third_byte_38_index[opcode]]`. Since these instructions have the same opcode (`72`) and differ only in the prefix (`f2/f3`), we need to point the `third_byte_38` to `prefix_extensions` which in turn points to the appropriate `EVEX_Wb` entries. The instruction `VDPBF16PS` has the same opcode (52) as the VNNI instruction `vpdpwsd` and they differ only in the prefix (`F3/66`). We need to update that entry to point to `prefix_extensions` instead of `e_vex_extensions`. This causes the `e_vex_extensions` entry ( `e_vex ext 151`) to be orphaned - do we remove this entry? Updated opcodes for invalid entries in e_vex ext 151 and 152 for consistency. **Step 3: add OP_ enums** Done **Step 4: update OP_LAST** Not needed since OP_LAST already points to the last enum. **Step 5: decode_fast tables if necessary** Not done **Step 6: instr_create macros** Added `1dst_3src` macros for `VCVTNE2PS2BF16` and `VDPBF16PS` since they write to operand 1 and read from mask register, operand 2, and operand 3. Added `1dst_2src` macro for `VCVTNEPS2BF16` since it writes to operand 1 and reads from mask register and operand 2. **Step 7: suite/tests/api/ir tests** Added tests in ir_x86_3args_avx512_evex_mask.h and ir_x86_4args_avx512_evex_mask_C.h. Added manual tests in ir_x86.c (hard-coded zeroing prefix flag pending DynamoRIO#5488) **Step 8: binutils tests** Added binutils tests. These represent the assembly instructions using DR API and match against the encoded bytes.
Looking more closely for the AVX512 gather (https://www.felixcloutier.com/x86/vpgatherdd:vpgatherdq), looks like it doesn't support zeroing-masking, and performs merging-masking always. So the scatter/gather expansion doesn't need to worry about this. |
In discussions about how to represent the similar zero vs merging in AArch64 SVE in PR #5718 we decided that putting a flag on the predicate/mask register operand is the way to go, rather than the whole-instruction prefix flag proposed here. |
For AArch64 SVE we now have |
Makes sense to use the same ones on x86. Any reason we shouldn't? Interesting to note that x86 gathers are always merging but AArch64 gathers are zeroing. |
Today the AVX-512 EVEX.z bit (
{z}
in assembler syntax when set to 1) controls whether zero elements in the mask are zeroed or not in the output. This is not represented in DR's IR, nor is zero-masking handled in scatter-gather expansion. This issue covers addressing both.Generally, we want the IR to be an abstraction, mapping ISA encoding details into general concepts. The various prefixes in x86 map into operand size differences or opcode differences today. This zero-masking prefix bit doesn't affect just one operand: it affects the operation. The precedent there is to split the opcode. We have separate opcodes for
OP_rep_movs
vsOP_movs
. This is similar to the "sub-opcode" numeric values indicating behavior in ARM (see the discussion at #4386 (comment) regarding opcode philosophy of separate opcodes for separate semantics; see also #4388).We do expose some x86 encoding prefixes in
instr_t.prefixes
today but they generally do not change the semantics enough that most tools, including taint-tracking tools, can completely ignore them. Part of me wants to get rid ofinstr_t.prefixes
entirely (other than predication), so that tools don't have to worry about this separate set of flags controlling behavior, and split up opcodes where behavior differs. But if we did that for the zero-masking we may have a huge number of split opcodes and compatibility issues since the existing opcodes have been public for a while. So maybe embracing the prefixes is an ok solution for these behavior changes that apply to many different opcodes. I would name this something likePREFIX_MASK_ZERO
though to try to be cross-platform if another ISA had the same concept, and not put anything about "EVEX" in there.The text was updated successfully, but these errors were encountered: