Skip to content

Second PR Description

DukMastaaa edited this page Aug 3, 2022 · 10 revisions

File structure

SIMD instructions have been implemented in files under plugins/arm/semantics with the aarch64-simd- prefix, in the aarch64 package. This is done as bap only looks in the top level of the semantics folder, so adding a subdirectory simd won't be recognised. For FP instructions soon to be implemented, this approach (with prefix aarch64-fp) will also be used.

nth-reg-in-group primitive

For instructions in the CASP and LDn families, LLVM gives BAP register groups like X0_X1 or Q0_Q1_Q2 which used to require a large switch statement like the following to extract the actual register:

(defun first-reg-in-group (r-pair)
  (case (symbol r-pair)
    'X0_X1 X0
    'X1_X2 X1
    ;; ...
    'X30_X31 X30))

A new Primus Lisp primitive, (nth-reg-in-group sym n), returns the nth register in a register group passed in as a symbol, sym. For example, (nth-reg-in-group 'D0_D1_D2_D3 2) returns D2. (:warning: there is a slight problem with the implementation, please see the notes on CASP below)

Non-SIMD Instructions

There are a lot of instructions implemented in this PR; these are sufficient to fully lift the cntlm binary (cross-compiled for aarch64) except two FMOV variants. This was tested using the --print-missing option to bap disassemble (#1410).

Instructions added are listed here, with some containing the BIL code hidden under a collapsible menu.

Arithmetic

ADDS*ri, ADDS*rs, ADD*rx, ADDXrx64
Instruction: adds x0, x1, x2
Opcode: 20 00 02 ab
{
  #3 := R2 + R1
  NF := 63:63[#3]
  VF := 63:63[R1] & 63:63[R2] & ~63:63[#3] | ~63:63[R1] & ~63:63[R2] &
    63:63[#3]
  ZF := #3 = 0
  CF := 63:63[R1] & 63:63[R2] | 63:63[R2] & ~63:63[#3] | 63:63[R1] &
    ~63:63[#3]
  R0 := #3
}
SUBXr*, SUBSXrx, SUBSXrx64

Similar to ADDS.

UMADDLrr, SMADDLrr, UMSUBLrr, SMSUBLrr
Instruction: umaddl x0, w1, w2, x3
Opcode: 20 0c a2 9b
{
  R0 := R3 + extend:64[31:0[R1] * 31:0[R2]]
}

The rest are similar.

UMULHrr
Instruction: umulh x0, x1, x2
Opcode: 20 7c c2 9b
{
  R0 := high:64[pad:128[R1] * pad:128[R2]]
}
ADR
Instruction: adr x0, 0xABCD
Opcode: 60 5e 05 30
{
  mem := mem with [R0, el]:u64 <- 0xABCD
}

Atomic

CASP family ⚠️

This uses the load-acquire and store-release intrinsics as described in #1458.

Instruction: caspal x0, x1, x2, x3, [x4]
Opcode: 82 fc 60 48
{
  #0 := mem[R4, el]:u128
  #1 := low:64[#0]
  #2 := high:64[#0]
  call(intrinsic:load-acquire)
  #4 := #0 = (R1.R0)
  if (#4) {
    call(intrinsic:store-release)
    mem := mem with [R4, el]:u128 <- R3.R2
  }
  R0 := #1
  R1 := #2
}

(:warning:) The nth-reg-in-group primitive is also used to extract the registers in the xa_xb pairs. However, its implementation prevents the following expression from reifying correctly:

(concat
  (nth-reg-in-group 'X0_X1 0)
  (nth-reg-in-group 'X0_X1 1))

The expected result is X0.X1, but printing out the result with msg gives 0x30000000000000004. As a temporary workaround, a helper function (register-pair-concat r-pair) has been defined, containing a large switch statement with cases for each 'Xa_Xb, but this is not ideal. Some advice on how to resolve this would be much appreciated.

Data movement

BIL code has not been provided for most instructions in this category due to the amount of instructions and the minute differences between them.

Loads:

  • LDR*ro*, LDR*pre, LDR*post, LDR*ui
  • LDRBBro*, LDRBBpre, LDRBBpost
  • LDRHHro*, LDRHHpre, LDRHHpost, LDRHHui
  • rest of LDP*pre and LDP*post, LDP*i
  • LDRSWui, LDRSWro*
  • LDURBBi, LDURHHi
  • LDURSB*i, LDURSH*i, LDURSWi
  • LDUR*i

Stores:

  • STR*ro*, STR*pre, STR*post
  • STRHHui
  • STRBBro*, STRBBpre, STRBBpost
  • STP*pre, STP*post, STP*i
  • STURHHi, STURBBi

Other:

EXTR*rri
Instruction: extr x0, x1, x2, #5
Opcode: 20 14 c2 93
{
  R0 := 68:5[R1.R2]
}

Logical

ANDS*ri, ANDS*rs
Instruction: ands x0, x1, x2, LSL #2
Opcode: 20 08 02 ea
{
  #3 := R2 << 2
  #4 := R1 & #3
  NF := 63:63[#4]
  ZF := #4 = 0
  CF := 0
  VF := 0
  R0 := #4
}
BIC*r, BICS*rs
Instruction: bic x0, x1, x2, asr #4
Opcode: 20 10 a2 8a
{
  #3 := R2 ~>> 4
  #4 := ~#3
  R0 := R1 & #4
}
REV*r, REV16*r, REV32Xr

Note that REV16*r etc. reverses the bytes within each container of size 16.

Instruction: rev x0, x1
Opcode: 20 0c c0 da
{
  R0 :=
    7:0[R1].15:8[R1].23:16[R1].31:24[R1].39:32[R1].47:40[R1].55:48[R1].63:56[R1]
}
Instruction: rev16 w0, w1
Opcode: 20 04 c0 5a
{
  R0 := high:32[R0].23:16[R1].31:24[R1].7:0[R1].15:8[R1]
}
ASRV*r, LSRV*r, LSLV*r, RORV*r

Nothing special about these.

RBIT*r
Instruction: rbit w0, w1
Opcode: 20 00 c0 5a
{
  R0 :=
    0:0[R1].1:1[R1].2:2[R1].3:3[R1].4:4[R1].5:5[R1].6:6[R1].7:7[R1].8:8[R1].9:9[R1].10:10[R1].11:11[R1].12:12[R1].13:13[R1].14:14[R1].15:15[R1].16:16[R1].17:17[R1].18:18[R1].19:19[R1].20:20[R1].21:21[R1].22:22[R1].23:23[R1].24:24[R1].25:25[R1].26:26[R1].27:27[R1].28:28[R1].29:29[R1].30:30[R1].31:31[R1]
}

Special

BRK

This passes the label argument to a software-breakpoint intrinsic.

Instruction: brk 0xABCD
Opcode: a0 79 35 d4
{
  intrinsic:x0 := 0xABCD
  call(intrinsic:software-breakpoint)
}

SIMD Instructions

We use . to indicate one of B, H, S, D, Q instead of * to avoid name conflicts with existing non-SIMD macros.

Arithmetic

Here, we just reuse * to also indicate some number for element count or element size.

ADDv*i*, SUBv*i*, MULv*i*

Note: + has a higher precedence in the textual representation than ., so although the spacing in the BIL output below is misleading, the output is correct.

Instruction: add v0.8h, v1.8h, v2.8h
Opcode: 20 84 62 4e
{
  V0 := 127:112[V1] + 127:112[V2].111:96[V1] + 111:96[V2].95:80[V1] +
    95:80[V2].79:64[V1] + 79:64[V2].63:48[V1] + 63:48[V2].47:32[V1] +
    47:32[V2].31:16[V1] + 31:16[V2].15:0[V1] + 15:0[V2]
}

The rest are similar and only differ in the binary operation.

Loads

This PR implements all of the SIMD load instructions; see the PR diff for a full list. Instructions with interesting BIL output are listed below.

LDNP.i

As an instruction with non-temporal properties, LDNP relaxes the order of its memory accesses. This is represented as a call to a 'non-temporal-hint' intrinsic where the address is passed as a parameter.

Instruction: ldnp s0, s1, [x3, 4]
Opcode: 60 84 40 2c
{
  intrinsic:x0 := R3 + 4
  call(intrinsic:non-temporal-hint)
  V0 := high:96[V0].mem[R3 + 4, el]:u32
  intrinsic:x0 := R3 + 8
  call(intrinsic:non-temporal-hint)
  V1 := high:96[V1].mem[R3 + 8, el]:u32
}
LD..v._POST (e.g. ld2 {v0.4s, v1.4s}, [x2], x3)

This instruction family receives register groups from LLVM like CASP. The BIL code separates each memory access individually to accurately model the interleaving done by the processor. This may not be ideal for generated code size -- advice on making such levels of detail toggleable would be appreciated.

Instruction: ld2 {v0.4s, v1.4s}, [x2], x3
Opcode: 40 88 c3 4c
{
  #1 := mem[R2, el]:u32
  #3 := mem[R2 + 4, el]:u32
  #5 := #1.mem[R2 + 8, el]:u32
  #7 := #3.mem[R2 + 0xC, el]:u32
  #9 := #5.mem[R2 + 0x10, el]:u32
  #11 := #7.mem[R2 + 0x14, el]:u32
  #13 := #9.mem[R2 + 0x18, el]:u32
  #15 := #11.mem[R2 + 0x1C, el]:u32
  V0 := #13
  V1 := #15
  R2 := R2 + R3
}

Similar expansions apply to the rest of the LDn family.

Logical

ANDv*i*, EORv*i*, NOTv*i*, ORRv*i*, ORNv*i*

These are done on the whole register Vn.

Instruction: not v0.16b, v1.16b
Opcode: 20 58 20 6e
{
  V0 := ~V1
}

Misc. movement

INSvi32gpr, INSvi32lane

The implementation uses bitmasks and bit shifts to insert the vector elements, but could equivalently use extract and concat. Please advise if this is preferred.

Instruction: ins v0.s[1], v1.s[1]
Opcode: 20 24 0c 6e
{
  #1 := 63:32[V1]
  #5 := V0 & 0xFFFFFFFFFFFFFFFF00000000FFFFFFFF
  #6 := #5 | 0xFFFFFFFF00000000 & pad:128[#1] << 0x20
  V0 := #6
}
Instruction: ins v0.s[1], w1
Opcode: 20 1c 0c 4e
{
  #3 := V0 & 0xFFFFFFFFFFFFFFFF00000000FFFFFFFF
  #4 := #3 | 0xFFFFFFFF00000000 & pad:128[31:0[R1]] << 0x20
  V0 := #4
}
MOVIv*i*, MOVIv*b_ns
Instruction: movi v0.4h, 0xAB
Opcode: 60 85 05 0f
{
  V0 := 0xAB00AB00AB00AB
}
EXTv*i*

This is implemented literally as described in the ISA with extract after concat.

Instruction: ext v0.16b, v1.16b, v2.16b, 3
Opcode: 20 18 02 6e
{
  V0 := 151:24[V2.V1]
}

Store

Most of these have nearly identical implementations to the non-SIMD STP variants.

STR.ro*, STR.pre, STR.post, STR.ui

For STR.ro*:

Instruction: str q0, [x1, x2]
Opcode: 20 68 a2 3c
{
  mem := mem with [R1 + R2, el]:u128 <- V0
}

For STR.post (pre is similar):

Instruction: str q0, [x1], 123
Opcode: 20 b4 87 3c
{
  mem := mem with [R1, el]:u128 <- V0
  R1 := R1 + 0x7B
}

For STR.ui:

Instruction: str q0, [x1, 0xAB]
Opcode: 20 b0 8a 3c
{
  mem := mem with [R1 + 0xAB, el]:u128 <- V0
}
STP.pre, STP.post, STP.i
Instruction: stp q0, q1, [x2], #16
Opcode: 40 84 80 ac
{
  #3 := R2
  mem := mem with [#3, el]:u128 <- V0
  mem := mem with [#3 + 0x10, el]:u128 <- V1
  R2 := #3 + 0x10
}
STUR.i
Instruction: stur q0, [x1, 0xAB]
Opcode: 20 b0 8a 3c
{
  mem := mem with [R1 + 0xAB, el]:u128 <- V0
}