Description
Extending a register load or store currently generates suboptimal code with the SSA backend. In some cases, this is a regression from the old backend. For example:
func load8(i uint8) uint64 { return uint64(i) }
Generates:
"".load8 t=1 size=16 args=0x10 locals=0x0
0x0000 00000 (extend.go:3) TEXT "".load8(SB), $0-16
0x0000 00000 (extend.go:3) FUNCDATA $0, gclocals·23e8278e2b69a3a75fa59b23c49ed6ad(SB)
0x0000 00000 (extend.go:3) FUNCDATA $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
0x0000 00000 (extend.go:3) MOVBLZX "".i+8(FP), AX
0x0005 00005 (extend.go:3) MOVBQZX AL, AX
0x0008 00008 (extend.go:3) MOVQ AX, "".~r1+16(FP)
0x000d 00013 (extend.go:3) RET
The old back end generates:
"".load8 t=1 size=16 args=0x10 locals=0x0
0x0000 00000 (extend.go:3) TEXT "".load8(SB), $0-16
0x0000 00000 (extend.go:3) FUNCDATA $0, gclocals·23e8278e2b69a3a75fa59b23c49ed6ad(SB)
0x0000 00000 (extend.go:3) FUNCDATA $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
0x0000 00000 (extend.go:3) MOVBQZX "".i+8(FP), BX
0x0005 00005 (extend.go:3) MOVQ BX, "".~r1+16(FP)
0x000a 00010 (extend.go:3) RET
I tried fixing this in CL 21838, but the fix was partial and probably not in the right place.
It's hard to do this as part of the arch-specific rewrite rules, because (a) you lose the type extension work that the ssa conversion did for you and have to recreate it later and (b) you don't know where all the register loads and stores will be, because regalloc hasn't run.
However, it's hard to do this as part of converting final SSA to instructions (genvalue), since that's really geared to handle one value at a time, in isolation.
Teaching regalloc to combine these MOVs seems arch-specific and would further complicate already complicated machinery. Maybe the thing to do is to add an arch-specific rewrite pass after regalloc ("peep"?), using hand-written rewrite rules. Input requested.
Related, for those extension MOVs that remain, we should test whether CWB and friends are desirable--they are shorter, but are register-restricted and the internet disagrees about whether they are as fast.
Here are some test cases:
package x
func load8(i uint8) uint64 { return uint64(i) }
func load32(i uint32) uint64 { return uint64(i) }
func store8(i uint64) uint64 { return uint64(uint8(i)) }
func store32(i uint64) uint64 { return uint64(uint32(i)) }
var p *int
func load8spill(i uint8) uint64 {
i++ // use i
print(i) // spill i
j := uint64(i) // use and extend i
return j
}
func load32spill(i uint32) uint64 {
i++ // use i
print(i) // spill i
j := uint64(i) // use and extend i
return j
}
func store8spill(i uint64) uint64 {
j := uint8(i) // convert
print(j) // spill
return uint64(j) // use
}
func store32spill(i uint64) uint64 {
j := uint32(i) // convert
print(j) // spill
return uint64(j) // use
}
cc @randall77