-
Notifications
You must be signed in to change notification settings - Fork 0
Second PR Description
SIMD instructions have been implemented in files under plugins/arm/semantics
with the aarch64-simd-
prefix,
in the aarch64
package. This is done as bap
only looks in the top level of the semantics
folder,
so adding a subdirectory simd
won't be recognised.
For FP instructions soon to be implemented, this approach (with prefix aarch64-fp
) will also be used.
For instructions in the CASP
and LDn
families, LLVM gives BAP register groups like X0_X1
or Q0_Q1_Q2
which used to require a large switch statement like the following to extract the actual register:
(defun first-reg-in-group (r-pair)
(case (symbol r-pair)
'X0_X1 X0
'X1_X2 X1
;; ...
'X30_X31 X30))
A new Primus Lisp primitive, (nth-reg-in-group sym n)
, returns the n
th register in a register group passed in as a symbol, sym
.
For example, (nth-reg-in-group 'D0_D1_D2_D3 2)
returns D2
.
(:warning: there is a slight problem with the implementation, please see the notes on CASP
below)
There are a lot of instructions implemented in this PR; these are sufficient to fully lift the cntlm binary (cross-compiled for aarch64) except two FMOV
variants. This was tested using the --print-missing
option to bap disassemble
(#1410).
Instructions added are listed here, with some containing the BIL code hidden under a collapsible menu.
ADDS*ri
, ADDS*rs
, ADD*rx
, ADDXrx64
Instruction: adds x0, x1, x2
Opcode: 20 00 02 ab
{
#3 := R2 + R1
NF := 63:63[#3]
VF := 63:63[R1] & 63:63[R2] & ~63:63[#3] | ~63:63[R1] & ~63:63[R2] &
63:63[#3]
ZF := #3 = 0
CF := 63:63[R1] & 63:63[R2] | 63:63[R2] & ~63:63[#3] | 63:63[R1] &
~63:63[#3]
R0 := #3
}
SUBXr*
, SUBSXrx
, SUBSXrx64
Similar to ADDS
.
UMADDLrr
, SMADDLrr
, UMSUBLrr
, SMSUBLrr
Instruction: umaddl x0, w1, w2, x3
Opcode: 20 0c a2 9b
{
R0 := R3 + extend:64[31:0[R1] * 31:0[R2]]
}
The rest are similar.
UMULHrr
Instruction: umulh x0, x1, x2
Opcode: 20 7c c2 9b
{
R0 := high:64[pad:128[R1] * pad:128[R2]]
}
ADR
Instruction: adr x0, 0xABCD
Opcode: 60 5e 05 30
{
mem := mem with [R0, el]:u64 <- 0xABCD
}
CASP
family ⚠️
This uses the load-acquire
and store-release
intrinsics as described in #1458.
Instruction: caspal x0, x1, x2, x3, [x4]
Opcode: 82 fc 60 48
{
#0 := mem[R4, el]:u128
#1 := low:64[#0]
#2 := high:64[#0]
call(intrinsic:load-acquire)
#4 := #0 = (R1.R0)
if (#4) {
call(intrinsic:store-release)
mem := mem with [R4, el]:u128 <- R3.R2
}
R0 := #1
R1 := #2
}
(:warning:)
The nth-reg-in-group
primitive is also used to extract the registers in the xa_xb
pairs.
However, its implementation prevents the following expression from reifying correctly:
(concat
(nth-reg-in-group 'X0_X1 0)
(nth-reg-in-group 'X0_X1 1))
The expected result is X0.X1
, but printing out the result with msg
gives 0x30000000000000004
.
As a temporary workaround, a helper function (register-pair-concat r-pair)
has been defined, containing a large switch statement with cases for each 'Xa_Xb
, but this is not ideal.
Some advice on how to resolve this would be much appreciated.
BIL code has not been provided for most instructions in this category due to the amount of instructions and the minute differences between them.
Loads:
-
LDR*ro*
,LDR*pre
,LDR*post
,LDR*ui
-
LDRBBro*
,LDRBBpre
,LDRBBpost
-
LDRHHro*
,LDRHHpre
,LDRHHpost
,LDRHHui
- rest of
LDP*pre
andLDP*post
,LDP*i
-
LDRSWui
,LDRSWro*
-
LDURBBi
,LDURHHi
-
LDURSB*i
,LDURSH*i
,LDURSWi
LDUR*i
Stores:
-
STR*ro*
,STR*pre
,STR*post
STRHHui
-
STRBBro*
,STRBBpre
,STRBBpost
-
STP*pre
,STP*post
,STP*i
-
STURHHi
,STURBBi
Other:
EXTR*rri
Instruction: extr x0, x1, x2, #5
Opcode: 20 14 c2 93
{
R0 := 68:5[R1.R2]
}
ANDS*ri
, ANDS*rs
Instruction: ands x0, x1, x2, LSL #2
Opcode: 20 08 02 ea
{
#3 := R2 << 2
#4 := R1 & #3
NF := 63:63[#4]
ZF := #4 = 0
CF := 0
VF := 0
R0 := #4
}
BIC*r
, BICS*rs
Instruction: bic x0, x1, x2, asr #4
Opcode: 20 10 a2 8a
{
#3 := R2 ~>> 4
#4 := ~#3
R0 := R1 & #4
}
REV*r
, REV16*r
, REV32Xr
Note that REV16*r
etc. reverses the bytes within each container of size 16.
Instruction: rev x0, x1
Opcode: 20 0c c0 da
{
R0 :=
7:0[R1].15:8[R1].23:16[R1].31:24[R1].39:32[R1].47:40[R1].55:48[R1].63:56[R1]
}
Instruction: rev16 w0, w1
Opcode: 20 04 c0 5a
{
R0 := high:32[R0].23:16[R1].31:24[R1].7:0[R1].15:8[R1]
}
ASRV*r
, LSRV*r
, LSLV*r
, RORV*r
Nothing special about these.
RBIT*r
Instruction: rbit w0, w1
Opcode: 20 00 c0 5a
{
R0 :=
0:0[R1].1:1[R1].2:2[R1].3:3[R1].4:4[R1].5:5[R1].6:6[R1].7:7[R1].8:8[R1].9:9[R1].10:10[R1].11:11[R1].12:12[R1].13:13[R1].14:14[R1].15:15[R1].16:16[R1].17:17[R1].18:18[R1].19:19[R1].20:20[R1].21:21[R1].22:22[R1].23:23[R1].24:24[R1].25:25[R1].26:26[R1].27:27[R1].28:28[R1].29:29[R1].30:30[R1].31:31[R1]
}
BRK
This passes the label argument to a software-breakpoint
intrinsic.
Instruction: brk 0xABCD
Opcode: a0 79 35 d4
{
intrinsic:x0 := 0xABCD
call(intrinsic:software-breakpoint)
}
We use .
to indicate one of B
, H
, S
, D
, Q
instead of *
to avoid name conflicts with existing non-SIMD macros.
Here, we just reuse *
to also indicate some number for element count or element size.
ADDv*i*
, SUBv*i*
, MULv*i*
Note: +
has a higher precedence in the textual representation than .
, so although the spacing in the BIL output below is misleading, the output is correct.
Instruction: add v0.8h, v1.8h, v2.8h
Opcode: 20 84 62 4e
{
V0 := 127:112[V1] + 127:112[V2].111:96[V1] + 111:96[V2].95:80[V1] +
95:80[V2].79:64[V1] + 79:64[V2].63:48[V1] + 63:48[V2].47:32[V1] +
47:32[V2].31:16[V1] + 31:16[V2].15:0[V1] + 15:0[V2]
}
The rest are similar and only differ in the binary operation.
This PR implements all of the SIMD load instructions; see the PR diff for a full list. Instructions with interesting BIL output are listed below.
LDNP.i
As an instruction with non-temporal properties, LDNP relaxes the order of its memory accesses. This is represented as a call to a 'non-temporal-hint' intrinsic where the address is passed as a parameter.
Instruction: ldnp s0, s1, [x3, 4]
Opcode: 60 84 40 2c
{
intrinsic:x0 := R3 + 4
call(intrinsic:non-temporal-hint)
V0 := high:96[V0].mem[R3 + 4, el]:u32
intrinsic:x0 := R3 + 8
call(intrinsic:non-temporal-hint)
V1 := high:96[V1].mem[R3 + 8, el]:u32
}
LD..v._POST
(e.g. ld2 {v0.4s, v1.4s}, [x2], x3
)
This instruction family receives register groups from LLVM like CASP
.
The BIL code separates each memory access individually to accurately model the interleaving done by the processor.
This may not be ideal for generated code size -- advice on making such levels of detail toggleable would be appreciated.
Instruction: ld2 {v0.4s, v1.4s}, [x2], x3
Opcode: 40 88 c3 4c
{
#1 := mem[R2, el]:u32
#3 := mem[R2 + 4, el]:u32
#5 := #1.mem[R2 + 8, el]:u32
#7 := #3.mem[R2 + 0xC, el]:u32
#9 := #5.mem[R2 + 0x10, el]:u32
#11 := #7.mem[R2 + 0x14, el]:u32
#13 := #9.mem[R2 + 0x18, el]:u32
#15 := #11.mem[R2 + 0x1C, el]:u32
V0 := #13
V1 := #15
R2 := R2 + R3
}
Similar expansions apply to the rest of the LDn
family.
ANDv*i*
, EORv*i*
, NOTv*i*
, ORRv*i*
, ORNv*i*
These are done on the whole register Vn
.
Instruction: not v0.16b, v1.16b
Opcode: 20 58 20 6e
{
V0 := ~V1
}
INSvi32gpr
, INSvi32lane
The implementation uses bitmasks and bit shifts to insert the vector elements, but could equivalently use extract and concat. Please advise if this is preferred.
Instruction: ins v0.s[1], v1.s[1]
Opcode: 20 24 0c 6e
{
#1 := 63:32[V1]
#5 := V0 & 0xFFFFFFFFFFFFFFFF00000000FFFFFFFF
#6 := #5 | 0xFFFFFFFF00000000 & pad:128[#1] << 0x20
V0 := #6
}
Instruction: ins v0.s[1], w1
Opcode: 20 1c 0c 4e
{
#3 := V0 & 0xFFFFFFFFFFFFFFFF00000000FFFFFFFF
#4 := #3 | 0xFFFFFFFF00000000 & pad:128[31:0[R1]] << 0x20
V0 := #4
}
MOVIv*i*
, MOVIv*b_ns
Instruction: movi v0.4h, 0xAB
Opcode: 60 85 05 0f
{
V0 := 0xAB00AB00AB00AB
}
EXTv*i*
This is implemented literally as described in the ISA with extract after concat.
Instruction: ext v0.16b, v1.16b, v2.16b, 3
Opcode: 20 18 02 6e
{
V0 := 151:24[V2.V1]
}
Most of these have nearly identical implementations to the non-SIMD STP
variants.
STR.ro*
, STR.pre
, STR.post
, STR.ui
For STR.ro*
:
Instruction: str q0, [x1, x2]
Opcode: 20 68 a2 3c
{
mem := mem with [R1 + R2, el]:u128 <- V0
}
For STR.post
(pre
is similar):
Instruction: str q0, [x1], 123
Opcode: 20 b4 87 3c
{
mem := mem with [R1, el]:u128 <- V0
R1 := R1 + 0x7B
}
For STR.ui
:
Instruction: str q0, [x1, 0xAB]
Opcode: 20 b0 8a 3c
{
mem := mem with [R1 + 0xAB, el]:u128 <- V0
}
STP.pre
, STP.post
, STP.i
Instruction: stp q0, q1, [x2], #16
Opcode: 40 84 80 ac
{
#3 := R2
mem := mem with [#3, el]:u128 <- V0
mem := mem with [#3 + 0x10, el]:u128 <- V1
R2 := #3 + 0x10
}
STUR.i
Instruction: stur q0, [x1, 0xAB]
Opcode: 20 b0 8a 3c
{
mem := mem with [R1 + 0xAB, el]:u128 <- V0
}