Description
Loop unrolling is a tecnique intended to speed up loops. It's supported by other mature compilers such as Clang.
This proposal consists of possible implementation ideas: general loop unrolling rules and how they can apply to Golang compiler, and some simple benchmarks reflecting this optimization performance on simple constant range loops.
It's easy to begin with a simple constant range for loops such as:
for i := 0 ; i < 1000 ; i++ {
a[i] += 2;
}
And then add more features inside unroll package which will represent a loop unrolling optimization pass.
Unroll package implementation ideas
The following approach could already be easily integrated inside Golang optimization pipeline right after the inlining stage.
// Inlining
base.Timer.Start("fe", "inlining")
if base.Flag.LowerL != 0 {
inline.InlinePackage()
}
noder.MakeWrappers(typecheck.Target) // must happen after inlining
// Unrolling
unroll.UnrollPackage()
UnrollPackage() will traverse each function to find for loops and check if it's appropriate for unrolling, then perform unrolling by calling Unroll() function if so.
// Unroll function takes 2 parameters:
// forstmt - an appropriate for loop,
// unroll - an unrolling factor (the amount of times the body will be repeated).
// It unrolls it in-place and returns a tail which should be placed right after the loop.
// The tail is generated if for range isn't divisible by the unrolling factor.
func Unroll(forstmt *ir.ForStmt, unroll uint32) (tail ir.Nodes) {
....
}
It's important to calculate the unrolling factor correctly. If it's too big we can run into a problem when a for loop body exceeds the instruction cache. A possible idea of picking the factor is by reusing the part of the inlining stage, that is, hairyVisitor since there was already a lot of work done for choosing the weights of the nodes.
if A = maximum weight of the for loop which is short enough to be kept in cache, and B = the weight of a for loop body calculated by the extended version of hairyVisitor, then the unrolling factor = A / B. If it's greater than 1 then loop unrolling is beneficial for that loop.
Unroll function implementation ideas
When unroll variable is picked a for loop can be unrolled in 4 steps. Once again, we're dealing with constant values here:
- Align condition so the loop does not go out of the boundaries (i < 1000 -> i < 1000 / unroll * unroll)
if forstmt.Cond != nil {
cmp := forstmt.Cond.(*ir.BinaryExpr)
val := ir.ConstValue(cmp.Y).(int64)
alignedval = val / int64(unroll) * int64(unroll)
newval := ir.NewConstExpr(constant.MakeInt64(alignedval), cmp.Y)
cmp.Y = newval
}
- Modify post expression so the induction variable goes unroll steps at a time (i++ -> i += unroll)
// Unroll post
if forstmt.Post != nil {
post := forstmt.Post.(*ir.AssignOpStmt)
inc := ir.ConstValue(post.Y).(int64)
unrolledinc := inc * int64(unroll)
newinc := ir.NewConstExpr(constant.MakeInt64(unrolledinc), post.Y)
post.Y = newinc
}
- Modify body.
This step is a little bit more complex since to only copy the body isn't enough. Suppose unroll is 4 and the body of the loop is:
for i := 0 ; i < 100 ; i++ {
sum += i
}
Just coping the body 4 times isn't enough since it gives us:
for i := 0 ; i < 100 / 4 * 4 ; i+=4 {
sum += i
sum += i
sum += i
sum += i
}
The correct version is:
for i := 0 ; i < 100 / 4 * 4 ; i+=4 {
sum += i
sum += i + 1
sum += i + 2
sum += i + 3
}
Firstly, we must find all induction variables and then after coping the body, apply shifting operation each time.
Keeping that in mind, body unrolling can be implemented in the following way:
// Firstly, copy original version of a body
bodycopy := ir.DeepCopyList(base.Pos, forstmt.Body)
// Copy the body unroll - 1 times, apply shifting and insert it in the body
for unr := uint64(1); unr < unroll; unr++ {
appendbody := ir.DeepCopyList(base.Pos, bodycopy)
// i is a loop induction variable
// Note: there could be multiple induction variables for a loop.
shiftNodes(appendbody, i, uint64(inc)*unr)
forstmt.Body.Append(appendbody...)
}
// shiftNodes function takes 3 parameters:
// nodes - a list of nodes,
// orig - an original node that's going to be shifted,
// shift - a shift constant.
// It generates a new node which represents an expression orig + shift and
// changes every orig reference to the new expression.
func shiftNodes(nodes ir.Nodes, orig ir.Node, shift uint64) {
idx := ir.NewConstExpr(constant.MakeUint64(shift), orig)
idx.SetType(types.Types[types.TUINTPTR])
idx.SetTypecheck(1)
shifted := ir.NewBinaryExpr(base.Pos, ir.OADD, orig, idx)
shifted.SetType(orig.Type())
shifted.SetTypecheck(orig.Typecheck())
var edit func(ir.Node) ir.Node
edit = func(x ir.Node) ir.Node {
ir.EditChildren(x, edit)
if x == orig {
return shifted
}
return x
}
for _, node := range nodes {
edit(node)
}
}
Suppose we have a loop:
for i := 0 ; i < 101 ; i++ {
sum += i
}
And the unroll is 3. Than it should be unrolled to this:
for i := 0 ; i < 101 / 3 * 3 ; i+=3 {
sum += i
sum += i + 1
sum += i + 2
}
Since 101 isn't divisible by 3 there are 101 % 3 operations that hasn't been performed yet:
sum += 99
sum += 100
This should also be generated and placed after the for loop.
for idx := val / unroll * unroll; idx < val; idx++ {
appendtail := ir.DeepCopyList(base.Pos, bodycopy)
placeConst(appendtail, i, idx)
tail.Append(appendtail...)
}
// placeConst function takes 3 parameters:
// nodes - a list of nodes,
// orig - an original node that's going to be replaced by a constant,
// con - a constant itself.
// It generates a new constant and changes every orig reference to the new contant node.
func placeConst(nodes ir.Nodes, orig ir.Node, con int64) {
i := ir.NewConstExpr(constant.MakeInt64(con), orig)
i.SetType(types.Types[types.TINT64])
i.SetTypecheck(1)
var edit func(ir.Node) ir.Node
edit = func(x ir.Node) ir.Node {
ir.EditChildren(x, edit)
if x == orig {
return i
}
return x
}
for _, node := range nodes {
edit(node)
}
}
Results
The following lines of code:
var a [100]int
for i := 0 ; i < 100 - 1; i++ {
a[i] += 2
}
currently are compiled to:
1053eb3: 31 c0 xor eax,eax
1053eb5: eb 09 jmp 1053ec0 <_main.main+0x60>
1053eb7: 48 83 44 c4 18 02 add QWORD PTR [rsp+rax*8+0x18],0x2
1053ebd: 48 ff c0 inc rax
1053ec0: 48 83 f8 63 cmp rax,0x63
1053ec4: 7c f1 jl 1053eb7 <_main.main+0x57>
After applying a very basic version of loop unrolling with the above approach those are compiled to:
1053eb3: 31 c0 xor eax,eax
1053eb5: eb 1c jmp 1053ed3 <_main.main+0x73>
1053eb7: 48 83 44 c4 18 02 add QWORD PTR [rsp+rax*8+0x18],0x2
1053ebd: 48 83 44 c4 20 02 add QWORD PTR [rsp+rax*8+0x20],0x2
1053ec3: 48 83 44 c4 28 02 add QWORD PTR [rsp+rax*8+0x28],0x2
1053ec9: 48 83 44 c4 30 02 add QWORD PTR [rsp+rax*8+0x30],0x2
1053ecf: 48 83 c0 04 add rax,0x4
1053ed3: 48 83 f8 60 cmp rax,0x60
1053ed7: 7c de jl 1053eb7 <_main.main+0x57>
1053ed9: 48 83 84 24 18 03 00 add QWORD PTR [rsp+0x318],0x2
1053ee0: 00 02
1053ee2: 48 83 84 24 20 03 00 add QWORD PTR [rsp+0x320],0x2
1053ee9: 00 02
1053eeb: 48 83 84 24 28 03 00 add QWORD PTR [rsp+0x328],0x2
Body is repeated 4 times with the shifted indices. Tail is placed after the loop that handles 96th, 97th and 98th indices.
Benchmarks
func BenchmarkUnrolling(b *testing.B) {
var a [100]int
for j := 0 ; j < b.N ; j++ {
for i := 0 ; i < 100 - 1; i++ {
a[i] += 2
}
}
}
name old time/op new time/op delta
Unrolling-12 51.7ns ± 0% 23.9ns ± 3% -53.71% (p=0.016 n=4+5)