Skip to content

cmd/compile: structs with more than the hardcoded 4 words limits will always be spilled onto the stack even when passed in and out through registers by regabi #72897

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mcy opened this issue Mar 17, 2025 · 8 comments
Labels
BugReport Issues describing a possible bug in the Go implementation. compiler/runtime Issues related to the Go compiler and/or runtime.

Comments

@mcy
Copy link

mcy commented Mar 17, 2025

Go version

gotip version devel go1.25-38d146d5 Sun Mar 16 15:46:25 2025 -0700 linux/amd64

Output of go env in your module/workspace:

AR='ar'
CC='gcc'
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_ENABLED='1'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
CXX='g++'
GCCGO='gccgo'
GO111MODULE='on'
GOAMD64='v1'
GOARCH='amd64'
GOAUTH='netrc'
GOBIN=''
GOCACHE='/home/mcyoung/.cache/go-build'
GOCACHEPROG=''
GODEBUG=''
GOENV='/home/mcyoung/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFIPS140='off'
GOFLAGS=''
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build4068624110=/tmp/go-build -gno-record-gcc-switches'
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMOD='/dev/null'
GOMODCACHE='/home/mcyoung/projects/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/home/mcyoung/projects/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/home/mcyoung/sdk/gotip'
GOSUMDB='sum.golang.org'
GOTELEMETRY='local'
GOTELEMETRYDIR='/home/mcyoung/.config/go/telemetry'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/home/mcyoung/sdk/gotip/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='devel go1.25-38d146d5 Sun Mar 16 15:46:25 2025 -0700'
GOWORK=''
PKG_CONFIG='pkg-config'

What did you do?

I wrote the following package and generated assembly from it.

package x

type state struct {
  x0 func(state) state
  x1 uint64
  x2 uint64
  x3 uint64
  x4 uint64
  x5 uint64
  x6 uint64
  x7 uint64
  x8 uint64
}

//go:nosplit
func X(s state) state {
    return s.x0(s)
}

This is an example of threaded code, where all operations are replaced with function calls that consume the whole state and return it by value, so that the state can remain in registers. This is a technique for writing highly optimized parsers, such as the one in UPB and Protobuf C++.

What did you see happen?

The output was the following assembly.

TEXT	command-line-arguments.X(SB), NOSPLIT|ABIInternal, $224-72
  PUSHQ	BP
  MOVQ	SP, BP
  SUBQ	$216, SP

  ; Spill state into the caller-provided spill region.
  MOVQ	AX, 232(SP)
  MOVQ	BX, 240(SP)
  MOVQ	CX, 248(SP)
  MOVQ	DI, 256(SP)
  MOVQ	SI, 264(SP)
  MOVQ	R8, 272(SP)
  MOVQ	R9, 280(SP)
  MOVQ	R10, 288(SP)
  MOVQ	R11, 296(SP)

  ; Dead(?) stores to the stack.
  MOVUPS	X15, 72(SP)
  MOVUPS	X15, 80(SP)
  MOVUPS	X15, 96(SP)
  MOVUPS	X15, 112(SP)
  MOVUPS	X15, 128(SP)

  ; Load rdx with the funcval, and r12 with funcval.pc
  MOVQ	232(SP), DX
  MOVQ	(DX), R12

  ; Re-hydrate state. These loads are dead.
  MOVQ	240(SP), BX
  MOVQ	248(SP), CX
  MOVQ	256(SP), DI
  MOVQ	264(SP), SI
  MOVQ	272(SP), R8
  MOVQ	280(SP), R9
  MOVQ	288(SP), R10
  MOVQ	296(SP), R11
  MOVQ	DX, AX ; This register move is also dead.

  CALL	R12  ; Do the call.
  
  ; Spill the result of the caller-provided spill region.
  MOVQ	AX, 144(SP)
  MOVQ	BX, 152(SP)
  MOVQ	CX, 160(SP)
  MOVQ	DI, 168(SP)
  MOVQ	SI, 176(SP)
  MOVQ	R8, 184(SP)
  MOVQ	R9, 192(SP)
  MOVQ	R10, 200(SP)
  MOVQ	R11, 208(SP)
	
  ; Copy the above into our stack frame.
  ; Stores are dead.
  MOVQ	144(SP), R12
  MOVQ	R12, 72(SP)
  MOVUPS	152(SP), X0
  MOVUPS	X0, 80(SP)
  MOVUPS	168(SP), X0
  MOVUPS	X0, 96(SP)
  MOVUPS	184(SP), X0
  MOVUPS	X0, 112(SP)
  MOVUPS	200(SP), X0
  MOVUPS	X0, 128(SP)
   
  ; Re-hydrate the return value. These loads are dead.
  MOVQ	72(SP), AX
  MOVQ	80(SP), BX
  MOVQ	88(SP), CX
  MOVQ	96(SP), DI
  MOVQ	104(SP), SI
  MOVQ	112(SP), R8
  MOVQ	120(SP), R9
  MOVQ	128(SP), R10
  MOVQ	136(SP), R11
  
  ; Epilogue.
  ADDQ	$216, SP
  POPQ	BP
  RET

There are several problems with this code:

  1. s is spilled in two separate places, even though the ABI requires the caller to guarantee spill space
    for arguments. state's shape is pointer-free except for the x0 *funcval, so this can't be aiding stack scanning.
  2. Virtually all of the loads from the stack are dead: they are essentially *x = y; y = *x.

What did you expect to see?

I expected to see approximately the following code.

TEXT	command-line-arguments.X(SB), NOSPLIT|ABIInternal, $224-72
  MOVQ	AX, DX
  MOVQ	(AX), R12
  CALL	R12
  RET

Of course, this breaks symbolization of arguments/returns in backtraces (and, presumably debuggers). To my knowledge, there is no way to instruct gc to not perform such spills.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Mar 17, 2025
@mcy
Copy link
Author

mcy commented Mar 17, 2025

Now, I expected the above to be the end of the story. But I tried unrolling the arguments, such that X takes a dozen uintptrs as arguments and then returns them. And... the code is exactly what I want?!

func X2(
  x0 F,
  x1 uint64, x2 uint64, x3 uint64, x4 uint64,
  x5 uint64, x6 uint64, x7 uint64, x8 uint64,
) (
  F,
  uint64, uint64, uint64, uint64,
  uint64, uint64, uint64, uint64,
) {
  return x0(x0, x1, x2, x3, x4, x5, x6, x7, x8)
}
TEXT    command-line-arguments.X2(SB), NOSPLIT|ABIInternal, $80-72
        PUSHQ   BP
        MOVQ    SP, BP
        SUBQ    $72, SP
        MOVQ    (AX), R12
        MOVQ    AX, DX
        CALL    R12
        ADDQ    $72, SP
        POPQ    BP
        RET

It is somewhat obnoxious that this is forced to waste stack space for the callee, which will be nosplit as well, but that's not something I can reasonably expect to work.

So, this is a bug with struct arguments specifically: struct arguments have the same ABI as individual values, but they result in terrible register allocation outcomes in the body.

On the bright side, at least I can write threaded interpreters in Go, it's just going to be excruciating until this bug is fixed XD

@mcy
Copy link
Author

mcy commented Mar 17, 2025

As an addendum, here's a godbolt with all the code, and some other functions. https://godbolt.org/z/bs8xvMsM7

Of interest is this variant:

//go:nosplit
func Y(s state) state {
  s.x1++
  return s
}

Not only does this spill its guts EXACTLY like X() does, but it has the temerity to lower s.x1++ to INCQ 96(SP), instead of INCQ BX. This is a smoking gun for me: maybe there's a missing SROA pass somewhere, but despite aggressive explosion in the ABI, codegen really wants to treat structs as being passed on the stack.

@mcy
Copy link
Author

mcy commented Mar 17, 2025

And a microbenchmark, for good measure:

$ gotip test threaded_test.go -bench .
goos: linux
goarch: amd64
cpu: AMD Ryzen Threadripper 3960X 24-Core Processor 
BenchmarkX/slow-48         	52043440	       22.63 ns/op
BenchmarkX/fast-48         	434398324	        2.459 ns/op
PASS
ok  	command-line-arguments	2.252s

I am actually shocked at it being 10x slower. I expected a 2x slowdown at most!

Harness:

package x

import "testing"

type state struct {
  x0 func(state) state
  x1 uint64
  x2 uint64
  x3 uint64
  x4 uint64
  x5 uint64
  x6 uint64
  x7 uint64
  x8 uint64
}

//go:nosplit
func X(s state) state {
    return s.x0(s)
}

//go:nosplit
func Y(s state) state {
  s.x1++
  return s
}

type F func(
  x0 F,
  x1 uint64, x2 uint64, x3 uint64, x4 uint64,
  x5 uint64, x6 uint64, x7 uint64, x8 uint64,
) (
  F,
  uint64, uint64, uint64, uint64,
  uint64, uint64, uint64, uint64,
)

//go:nosplit
func X2(
  x0 F,
  x1 uint64, x2 uint64, x3 uint64, x4 uint64,
  x5 uint64, x6 uint64, x7 uint64, x8 uint64,
) (
  F,
  uint64, uint64, uint64, uint64,
  uint64, uint64, uint64, uint64,
) {
  return x0(x0, x1, x2, x3, x4, x5, x6, x7, x8)
}

//go:nosplit
func Y2(
  x0 F,
  x1 uint64, x2 uint64, x3 uint64, x4 uint64,
  x5 uint64, x6 uint64, x7 uint64, x8 uint64,
) (
  F,
  uint64, uint64, uint64, uint64,
  uint64, uint64, uint64, uint64,
) {
  x1++
  return x0, x1, x2, x3, x4, x5, x6, x7, x8
}

func BenchmarkX(b *testing.B) {
  b.Run("slow", func(b *testing.B) {
    for b.Loop() {
      X(state{x0: Y})
    }
  })
  b.Run("fast", func(b *testing.B) {
    for b.Loop() {
      X2(Y2, 0, 0, 0, 0, 0, 0, 0, 0)
    }
  })
}

@gabyhelp gabyhelp added the BugReport Issues describing a possible bug in the Go implementation. label Mar 17, 2025
@Jorropo

This comment has been minimized.

@mcy

This comment has been minimized.

@Jorropo
Copy link
Member

Jorropo commented Mar 17, 2025

My bad it looks like I can't read assembly at 2am.

I have a fix for this here https://go-review.googlesource.com/c/go/+/561695 but it makes compilation scaling worst than pseudo-linear.

Maybe a dup of #65495

That very relevant here and would make 561695 acceptable to submit imo:

If we have support for variable based ssaification of structs I think it would make sense to vary the limit based on available registers.

#65495 (comment) & #65495 (comment)

@Jorropo Jorropo changed the title cmd/compile: gc produces excessive spills for "threaded code" style functions cmd/compile: structs with more than the hardcoded 4 words limits will always be spilled onto the stack even when passed in and out through registers by regabi Mar 17, 2025
@Jorropo
Copy link
Member

Jorropo commented Mar 17, 2025

The test I've wrote for CL 561695 test the exact same behaviors (altho it's not as complete):

// asmcheck

// Copyright 2024 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.

package a

type desc struct{ t, x, y, z, p int }

func enc() byte {
	// amd64:-"MOV.+autotmp"
	return byte(process().z)
}

//go:noinline
func process() desc {
	return desc{}
}

I do think this is a dup.

@Jorropo Jorropo closed this as completed Mar 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BugReport Issues describing a possible bug in the Go implementation. compiler/runtime Issues related to the Go compiler and/or runtime.
Projects
None yet
Development

No branches or pull requests

4 participants