We follow the model of the A32/A64 backends of a simple expansion of Cwerg instruction to x86-64 instruction.
Since x86-64 is essentially a two address ISA some legalization work is necessary to convert from Cwerg's three address code.
Limitations:
- avoid 3dnow, mmx or x87 instructions
- floating point handled via SSE xmmX registers
- expect certain instruction set extensions: SSE, lzcnt, tzcnt, popcnt, ...
Eventually, the intention is to target CPUs Haswell or later.
In terms of gcc compiler flags this correspods to: -march=haswell
Code generation goes through the following phases which massage the IR until it becomes almost trivial to generate X64 code from it.
This phase rewrites the IR so the all remaining IR instructions are covered by the instruction selector.
The major differences to the A32 and A64 backends are:
- no operand widening is performed as the code selector supports all widths
- X86-64 is resembling 2 address code rather than Cwerg's 3 addresss code requiring a special processing described further down.
- spilling of regs does not result in explicit IR rewrites. The instruction selector is aware of whether a regs is spilled or not.
Legalization encompasses the following steps:
- replacement of
push
/pop
instruction withmov
s to and from CPU registers as required by the calling convention. - eliminate addressing modes not supported by ISA, e.g.
- convert
ld.stk
/st.stk
/lea.stk
with reg offset tolea.stk
+ld
/st
/lea
- convert
ld.mem
/st.mem
to
lea.mem
+ld
/st
- convert
- rewriting of opcodes not directly supported by ISA or the instruction selector, e.g.
copysign
,cmpXX
- other special rewriting for
div
,mod
and shifts. Note that this may introduce uses of CPU regsrcx
andrdx
. - deal with out of range immediates
- 3-address code to 2-address code conversion
- TODO: move parameters that cannot be handled by calling convention onto stk
Most x86 instructions for binary ops come in the following variants:
-
ins-x86
mem_or_regA
regB
(MR)(3 op equivalent:
ins-x86
mem_or_regA
mem_or_regA
regB
) -
ins-x86
regA
mem_or_regB
(RM)(3 op equivalent
ins-x86
regA
regA
mem_or_regB
) -
ins-x86
mem_or_regA
imm
(MI)(3 op equivalent
ins-x86
mem_or_regA
mem_or_regA
imm
)
Because memory operands can be directly encoded, spilled registers do not
always need to be explicitly rewritten before instruction selection.
However, we will need to reserve registers for some of the rewrites - currently rax
and xmm0
.
However we do need to rewrite the three address Cwerg instructions to match one of the two x86 patterns from below:
regA
regA
imm
regA
regA
regB
(orregA
regA
regA
)
This is accomplished by first rewriting instructions so that there is at most one immediate which must be in the last slot and then applying the following expansions:
-
<ins>
regA
regB
regC
=> rewritemov
regA
regB
;<ins>
regA
regA
regC
-
<ins>
regA
regB
imm
=> rewritemov
regA
regB
;<ins>
regA
regA
imm
-
<ins>
regA
regB
regA
=> rewrite (with tmp reg)mov
tmp
regB
;<ins>
tmp
tmp
regA
;mov
regA
tmp
(For commutative ops a simpler rewrite is to just swap the last two operands.)
-
<ins>
regA
regB
regB
=> rewritemov
regA
regB
;<ins>
regA
regB
This phase assigns CPU registers to all global IR registers, so that afterwards all global IR registers have either been assigned a CPU register or have been spilled.
Unlike the A32/A64 backends, the spilling is done implicitly by using StackSlot
objects
in lieu of CPU registers. The reason for this is that most X64-64 instruction support one memory operand which can be uses to access the stack.
This implies that the code selector distinguishes between spilled and regular
registers.
This phase will assign CPU register to all the remaining (local) registers
- compute offsets of stack objects
- add
nop1
for those case where we likely need a scratch register - Allocate registers for local registers.
- Eliminate
mov
s where CPU registers of thesrc
anddst
operands are identical
Just like with Global Register Allocation, the registers are spilled by assigning them a `StackSlot.
After register allocation a Cwerg register, regX
, will either be assigned
to an x86-64 register, in which case we keep referring to it as regX
,
or it will be spilled, in which case we refer to it as spillX
.
The patterns below show how to translate Cwerg three address instruction to x86-64 instruction variants.
-
<ins>
regA
regA
imm
=> direct expansion (MI) -
<ins>
spillA
spillA
imm
=> direct expansion (MI) -
<ins>
regA
regA
regA
=> direct expansion (RM) or (MR) -
<ins>
regA
regA
regB
=> direct expansion (RM) or (MR) -
<ins>
regA
regA
spillB
=> direct expansion (RM) -
<ins>
spillA
spillA
regB
=> direct expansion (MR) -
<ins>
spillA
spillA
spillB
=> complex expansion (with tmp reg)mov-x86
tmp
spillB
;ins-x86
spillA
tmp
(MR)
This is all straight forward except for the case where all operands have been spilled. To obtain the tmp scratch register we may need to reserve one general and one floating point register.
Note, the situation where two operands are spilled also occurs for mov
and unary
instructions.
For floating point operations only the RM variant is available. So we get the following expansions (omitting the cases involving immediates):
-
<ins>
regA
regA
regA
=> direct expansion (RM) -
<ins>
regA
regA
regB
=> direct expansion (RM) -
<ins>
regA
regA
spillB
=> direct expansion (RM) -
<ins>
spillA
spillA
regB
=> complex expansion (with tmp reg)mov-x86
tmp
spillA
;ins-x86
tmp
regB
;mov-x86
spillA
tmp
(RM) -
<ins>
spillA
spillA
spillB
=> complex expansion (with tmp reg)mov-x86
tmp
spillA
;ins-x86
tmp
spillB
;mov-x86
spillA
tmp
(RM)
scratch: rdi
rsi
rdx
rcx
r8
r9
r10
r11
rax
xmm0-xmm7
Callee save: rbx
rbp
r12-15
rsp
xmm8-xmm15
params-in: rdi
rsi
rdx
rcx
r8
r9
params-out: rax
rdx
syscalls: rdi
rsi
rdx
r10
r8
r9
-> rax
(note: r10
takes the place of rcx
)
-
instruction selection is somewhat simplified as most immediates can be encoded directly without size restrictions in particular memory offsets can be up to 32 bit.
-
Once we have rewritten instructions to AAB form we have to be careful with subsequent optimizations
-
some instructions (div/mul/shl/shr/asr) use implicit registers we handle these by adding move instructions with fixed CpuRegs during legalization, e.g.
div a b c
becomes
mov RAX@rax b div RDX@rdx RAX@rax c # hack: should really be div RAX@rax RAX@rax c mov a RAX
the hack is working in conjunction with instruction selection.
-
Spilled registers are handled directly by instruction selection so no rewrite as in the a64/a64 backends is necessary
-
Instruction selection is handling all data kinds (DK) so no widening step as in the a32/a64 backends is necessary. On the flip-side, the instruction selection tables are rather large.