- 
                Notifications
    You must be signed in to change notification settings 
- Fork 18.4k
Description
Loop unrolling is a tecnique intended to speed up loops. It's supported by other mature compilers such as Clang.
This proposal consists of possible implementation ideas: general loop unrolling rules and how they can apply to Golang compiler, and some simple benchmarks reflecting this optimization performance on simple constant range loops.
It's easy to begin with a simple constant range for loops such as:
for i := 0 ; i < 1000 ; i++ {
	a[i] += 2;
}
And then add more features inside unroll package which will represent a loop unrolling optimization pass.
Unroll package implementation ideas
The following approach could already be easily integrated inside Golang optimization pipeline right after the inlining stage.
// Inlining
base.Timer.Start("fe", "inlining")
if base.Flag.LowerL != 0 {
	inline.InlinePackage()
}
noder.MakeWrappers(typecheck.Target) // must happen after inlining
// Unrolling
unroll.UnrollPackage()
UnrollPackage() will traverse each function to find for loops and check if it's appropriate for unrolling, then perform unrolling by calling Unroll() function if so.
// Unroll function takes 2 parameters:
// forstmt - an appropriate for loop,
// unroll - an unrolling factor (the amount of times the body will be repeated).
// It unrolls it in-place and returns a tail which should be placed right after the loop.
// The tail is generated if for range isn't divisible by the unrolling factor.
func Unroll(forstmt *ir.ForStmt, unroll uint32) (tail ir.Nodes) {
	....
}
It's important to calculate the unrolling factor correctly. If it's too big we can run into a problem when a for loop body exceeds the instruction cache. A possible idea of picking the factor is by reusing the part of the inlining stage, that is, hairyVisitor since there was already a lot of work done for choosing the weights of the nodes.
if A = maximum weight of the for loop which is short enough to be kept in cache, and B = the weight of a for loop body calculated by the extended version of hairyVisitor, then the unrolling factor = A / B. If it's greater than 1 then loop unrolling is beneficial for that loop.
Unroll function implementation ideas
When unroll variable is picked a for loop can be unrolled in 4 steps. Once again, we're dealing with constant values here:
- Align condition so the loop does not go out of the boundaries (i < 1000 -> i < 1000 / unroll * unroll)
if forstmt.Cond != nil {
	cmp := forstmt.Cond.(*ir.BinaryExpr)
	val := ir.ConstValue(cmp.Y).(int64)
	alignedval = val / int64(unroll) * int64(unroll)
	newval := ir.NewConstExpr(constant.MakeInt64(alignedval), cmp.Y)
	cmp.Y = newval
}
- Modify post expression so the induction variable goes unroll steps at a time (i++ -> i += unroll)
// Unroll post
if forstmt.Post != nil {
	post := forstmt.Post.(*ir.AssignOpStmt)
	inc := ir.ConstValue(post.Y).(int64)
	unrolledinc := inc * int64(unroll)
	newinc := ir.NewConstExpr(constant.MakeInt64(unrolledinc), post.Y)
	post.Y = newinc
}
- Modify body.
This step is a little bit more complex since to only copy the body isn't enough. Suppose unroll is 4 and the body of the loop is:
for i := 0 ; i < 100 ; i++ {
	sum += i
}
Just coping the body 4 times isn't enough since it gives us:
for i := 0 ; i < 100 / 4 * 4 ; i+=4 {
	sum += i
	sum += i
	sum += i
	sum += i
}
The correct version is:
for i := 0 ; i < 100 / 4 * 4 ; i+=4 {
	sum += i
	sum += i + 1
	sum += i + 2
	sum += i + 3
}
Firstly, we must find all induction variables and then after coping the body, apply shifting operation each time.
Keeping that in mind, body unrolling can be implemented in the following way:
// Firstly, copy original version of a body
bodycopy := ir.DeepCopyList(base.Pos, forstmt.Body)
// Copy the body unroll - 1 times, apply shifting and insert it in the body
for unr := uint64(1); unr < unroll; unr++ {
	appendbody := ir.DeepCopyList(base.Pos, bodycopy)
	// i is a loop induction variable
	// Note: there could be multiple induction variables for a loop.
	shiftNodes(appendbody, i, uint64(inc)*unr)
	forstmt.Body.Append(appendbody...)
}
// shiftNodes function takes 3 parameters:
// nodes - a list of nodes,
// orig - an original node that's going to be shifted,
// shift - a shift constant.
// It generates a new node which represents an expression orig + shift and
// changes every orig reference to the new expression.
func shiftNodes(nodes ir.Nodes, orig ir.Node, shift uint64) {
	idx := ir.NewConstExpr(constant.MakeUint64(shift), orig)
	idx.SetType(types.Types[types.TUINTPTR])
	idx.SetTypecheck(1)
	shifted := ir.NewBinaryExpr(base.Pos, ir.OADD, orig, idx)
	shifted.SetType(orig.Type())
	shifted.SetTypecheck(orig.Typecheck())
	var edit func(ir.Node) ir.Node
	edit = func(x ir.Node) ir.Node {
		ir.EditChildren(x, edit)
		if x == orig {
			return shifted
		}
		return x
	}
	for _, node := range nodes {
		edit(node)
	}
}
Suppose we have a loop:
for i := 0 ; i < 101 ; i++ {
	sum += i
}
And the unroll is 3. Than it should be unrolled to this:
for i := 0 ; i < 101 / 3 * 3 ; i+=3 {
	sum += i
	sum += i + 1
	sum += i + 2
}
Since 101 isn't divisible by 3 there are 101 % 3 operations that hasn't been performed yet:
sum += 99
sum += 100
This should also be generated and placed after the for loop.
for idx := val / unroll * unroll; idx < val; idx++ {
	appendtail := ir.DeepCopyList(base.Pos, bodycopy)
	placeConst(appendtail, i, idx)
	tail.Append(appendtail...)
}
// placeConst function takes 3 parameters:
// nodes - a list of nodes,
// orig - an original node that's going to be replaced by a constant,
// con - a constant itself.
// It generates a new constant and changes every orig reference to the new contant node.
func placeConst(nodes ir.Nodes, orig ir.Node, con int64) {
	i := ir.NewConstExpr(constant.MakeInt64(con), orig)
	i.SetType(types.Types[types.TINT64])
	i.SetTypecheck(1)
	var edit func(ir.Node) ir.Node
	edit = func(x ir.Node) ir.Node {
		ir.EditChildren(x, edit)
		if x == orig {
			return i
		}
		return x
	}
	for _, node := range nodes {
		edit(node)
	}
}
Results
The following lines of code:
var a [100]int
for i := 0 ; i < 100 - 1; i++ {
	a[i] += 2
}
currently are compiled to:
 1053eb3:	31 c0                	xor    eax,eax
 1053eb5:	eb 09                	jmp    1053ec0 <_main.main+0x60>
 1053eb7:	48 83 44 c4 18 02    	add    QWORD PTR [rsp+rax*8+0x18],0x2
 1053ebd:	48 ff c0             	inc    rax
 1053ec0:	48 83 f8 63          	cmp    rax,0x63
 1053ec4:	7c f1                	jl     1053eb7 <_main.main+0x57>
After applying a very basic version of loop unrolling with the above approach those are compiled to:
 1053eb3:	31 c0                	xor    eax,eax
 1053eb5:	eb 1c                	jmp    1053ed3 <_main.main+0x73>
 1053eb7:	48 83 44 c4 18 02    	add    QWORD PTR [rsp+rax*8+0x18],0x2
 1053ebd:	48 83 44 c4 20 02    	add    QWORD PTR [rsp+rax*8+0x20],0x2
 1053ec3:	48 83 44 c4 28 02    	add    QWORD PTR [rsp+rax*8+0x28],0x2
 1053ec9:	48 83 44 c4 30 02    	add    QWORD PTR [rsp+rax*8+0x30],0x2
 1053ecf:	48 83 c0 04          	add    rax,0x4
 1053ed3:	48 83 f8 60          	cmp    rax,0x60
 1053ed7:	7c de                	jl     1053eb7 <_main.main+0x57>
 1053ed9:	48 83 84 24 18 03 00 	add    QWORD PTR [rsp+0x318],0x2
 1053ee0:	00 02 
 1053ee2:	48 83 84 24 20 03 00 	add    QWORD PTR [rsp+0x320],0x2
 1053ee9:	00 02 
 1053eeb:	48 83 84 24 28 03 00 	add    QWORD PTR [rsp+0x328],0x2
Body is repeated 4 times with the shifted indices. Tail is placed after the loop that handles 96th, 97th and 98th indices.
Benchmarks
func BenchmarkUnrolling(b *testing.B) {
	var a [100]int
	for j := 0 ; j < b.N ; j++ {
		for i := 0 ; i < 100 - 1; i++ {
			a[i] += 2
		}
	}
}
name          old time/op  new time/op  delta
Unrolling-12  51.7ns ± 0%  23.9ns ± 3%  -53.71%  (p=0.016 n=4+5)