cmd/compile: combine extension with register loads and stores on amd64

Extending a register load or store currently generates suboptimal code with the SSA backend. In some cases, this is a regression from the old backend. For example:

`func load8(i uint8) uint64 { return uint64(i) }`

Generates:

```
"".load8 t=1 size=16 args=0x10 locals=0x0
    0x0000 00000 (extend.go:3)  TEXT    "".load8(SB), $0-16
    0x0000 00000 (extend.go:3)  FUNCDATA    $0, gclocals·23e8278e2b69a3a75fa59b23c49ed6ad(SB)
    0x0000 00000 (extend.go:3)  FUNCDATA    $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
    0x0000 00000 (extend.go:3)  MOVBLZX "".i+8(FP), AX
    0x0005 00005 (extend.go:3)  MOVBQZX AL, AX
    0x0008 00008 (extend.go:3)  MOVQ    AX, "".~r1+16(FP)
    0x000d 00013 (extend.go:3)  RET
```

The old back end generates:

```
"".load8 t=1 size=16 args=0x10 locals=0x0
    0x0000 00000 (extend.go:3)  TEXT    "".load8(SB), $0-16
    0x0000 00000 (extend.go:3)  FUNCDATA    $0, gclocals·23e8278e2b69a3a75fa59b23c49ed6ad(SB)
    0x0000 00000 (extend.go:3)  FUNCDATA    $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
    0x0000 00000 (extend.go:3)  MOVBQZX "".i+8(FP), BX
    0x0005 00005 (extend.go:3)  MOVQ    BX, "".~r1+16(FP)
    0x000a 00010 (extend.go:3)  RET
```

I tried fixing this in CL 21838, but the fix was partial and probably not in the right place.

It's hard to do this as part of the arch-specific rewrite rules, because (a) you lose the type extension work that the ssa conversion did for you and have to recreate it later and (b) you don't know where all the register loads and stores will be, because regalloc hasn't run.

However, it's hard to do this as part of converting final SSA to instructions (genvalue), since that's really geared to handle one value at a time, in isolation.

Teaching regalloc to combine these MOVs seems arch-specific and would further complicate already complicated machinery. Maybe the thing to do is to add an arch-specific rewrite pass after regalloc ("peep"?), using hand-written rewrite rules. Input requested.

Related, for those extension MOVs that remain, we should test whether CWB and friends are desirable--they are shorter, but are register-restricted and the internet disagrees about whether they are as fast.

Here are some test cases:

``` go
package x

func load8(i uint8) uint64   { return uint64(i) }
func load32(i uint32) uint64 { return uint64(i) }

func store8(i uint64) uint64  { return uint64(uint8(i)) }
func store32(i uint64) uint64 { return uint64(uint32(i)) }

var p *int

func load8spill(i uint8) uint64 {
    i++            // use i
    print(i)       // spill i
    j := uint64(i) // use and extend i
    return j
}

func load32spill(i uint32) uint64 {
    i++            // use i
    print(i)       // spill i
    j := uint64(i) // use and extend i
    return j
}

func store8spill(i uint64) uint64 {
    j := uint8(i)    // convert
    print(j)         // spill
    return uint64(j) // use
}

func store32spill(i uint64) uint64 {
    j := uint32(i)   // convert
    print(j)         // spill
    return uint64(j) // use
}
```

cc @randall77 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cmd/compile: combine extension with register loads and stores on amd64 #15300

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

cmd/compile: combine extension with register loads and stores on amd64 #15300

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions