runtime: SIGPROF walking stack-in-motion causes missed write barrier panic #12932

DanielMorsing · 2015-10-14T14:50:37Z

This looks a lot like #11863

Stack trace:

found next stack barrier at 0xc857445a00; expected [*0xc8427059f0=0x869015 *0xc842705e38=0x61b1de]
fatal error: missed stack barrier

goroutine 0 [idle]:
runtime.throw(0xcb8fc0, 0x14)
    /root/.gvm/gos/go1.5.1/src/runtime/panic.go:527 +0x90
runtime.gentraceback(0x413929, 0x7fea597f9b48, 0x0, 0xc8206ca300, 0x0, 0xc8207a7738, 0x40, 0x0, 0x0, 0x6, ...)
    /root/.gvm/gos/go1.5.1/src/runtime/traceback.go:261 +0x1053
runtime.sigprof(0x413929, 0x7fea597f9b48, 0x0, 0xc8206ca300, 0xc82079c000)
    /root/.gvm/gos/go1.5.1/src/runtime/proc1.go:2585 +0x50b
runtime.sighandler(0xc80000001b, 0xc8207a7bb0, 0xc8207a7a80, 0xc8206ca300)
    /root/.gvm/gos/go1.5.1/src/runtime/signal_amd64x.go:47 +0xbe
runtime.sigtrampgo(0x1b, 0xc8207a7bb0, 0xc8207a7a80)
    /root/.gvm/gos/go1.5.1/src/runtime/signal_linux.go:94 +0x95
runtime.sigtramp(0x1, 0x0, 0xc8207a0000, 0x0, 0x7fa0, 0x4000000, 0xc859cb1c80, 0x0, 0x0, 0x3, ...)
    /root/.gvm/gos/go1.5.1/src/runtime/sys_linux_amd64.s:234 +0x1b
runtime.sigreturn(0x0, 0xc8207a0000, 0x0, 0x7fa0, 0x4000000, 0xc859cb1c80, 0x0, 0x0, 0x3, 0xd817e3, ...)
    /root/.gvm/gos/go1.5.1/src/runtime/sys_linux_amd64.s:238

goroutine 870933 [copystack]:
runtime.callers(0x4, 0xc842705298, 0x20, 0x20, 0xd)
    /root/.gvm/gos/go1.5.1/src/runtime/traceback.go:562 fp=0xc857445268 sp=0xc857445260
runtime.mProf_Malloc(0xc82e414580, 0x40)
    /root/.gvm/gos/go1.5.1/src/runtime/mprof.go:235 +0x7f fp=0xc8574453e8 sp=0xc857445268
runtime.profilealloc(0xc82079c000, 0xc82e414580, 0x40)
    /root/.gvm/gos/go1.5.1/src/runtime/malloc.go:811 +0x98 fp=0xc857445410 sp=0xc8574453e8
runtime.mallocgc(0x40, 0x0, 0x3, 0x412bd5)
    /root/.gvm/gos/go1.5.1/src/runtime/malloc.go:699 +0x5d3 fp=0xc8574454e0 sp=0xc857445410
runtime.rawstring(0x31, 0x0, 0x0, 0x0, 0x0, 0x0)
    /root/.gvm/gos/go1.5.1/src/runtime/string.go:264 +0x70 fp=0xc857445528 sp=0xc8574454e0
runtime.rawstringtmp(0x0, 0x31, 0x0, 0x0, 0x0, 0x0, 0x0)
    /root/.gvm/gos/go1.5.1/src/runtime/string.go:107 +0xb7 fp=0xc857445560 sp=0xc857445528
runtime.concatstrings(0x0, 0xc8427056e0, 0x3, 0x3, 0x0, 0x0)
    /root/.gvm/gos/go1.5.1/src/runtime/string.go:48 +0x1c2 fp=0xc857445698 sp=0xc857445560
runtime.concatstring3(0x0, 0xc82e412420, 0x28, 0xc2d0e8, 0x4, 0xc841aaf3d0, 0x5, 0x0, 0x0)
    /root/.gvm/gos/go1.5.1/src/runtime/string.go:62 +0x6a fp=0xc8574456e8 sp=0xc857445698
found next stack barrier at 0xc857445a00; expected [*0xc8427059f0=0x869015 *0xc842705e38=0x61b1de]
fatal error: missed stack barrier
panic during panic

goroutine 0 [idle]:
runtime.startpanic_m()
    /root/.gvm/gos/go1.5.1/src/runtime/panic1.go:67 +0x141
runtime.systemstack(0xd862e0)
    /root/.gvm/gos/go1.5.1/src/runtime/asm_amd64.s:278 +0xab
runtime.startpanic()
    /root/.gvm/gos/go1.5.1/src/runtime/panic.go:505 +0x14
runtime.throw(0xcb8fc0, 0x14)
    /root/.gvm/gos/go1.5.1/src/runtime/panic.go:526 +0x83
runtime.gentraceback(0x452670, 0xc857445260, 0x0, 0xc859cb1c80, 0x0, 0x0, 0x64, 0x0, 0x0, 0x0, ...)
    /root/.gvm/gos/go1.5.1/src/runtime/traceback.go:261 +0x1053
runtime.traceback1(0xffffffffffffffff, 0xffffffffffffffff, 0x0, 0xc859cb1c80, 0x0)
    /root/.gvm/gos/go1.5.1/src/runtime/traceback.go:552 +0xc8
runtime.traceback(0xffffffffffffffff, 0xffffffffffffffff, 0x0, 0xc859cb1c80)
    /root/.gvm/gos/go1.5.1/src/runtime/traceback.go:529 +0x48
runtime.tracebackothers(0xc8206ca180)
    /root/.gvm/gos/go1.5.1/src/runtime/traceback.go:666 +0xda
runtime.dopanic_m(0xc8206ca180, 0x42e9a0, 0xc8207a7588)
    /root/.gvm/gos/go1.5.1/src/runtime/panic1.go:104 +0x1f9
runtime.dopanic.func1()
    /root/.gvm/gos/go1.5.1/src/runtime/panic.go:514 +0x32
runtime.systemstack(0xc8207a7560)
    /root/.gvm/gos/go1.5.1/src/runtime/asm_amd64.s:278 +0xab
runtime.dopanic(0x0)
    /root/.gvm/gos/go1.5.1/src/runtime/panic.go:515 +0x61
runtime.throw(0xcb8fc0, 0x14)
    /root/.gvm/gos/go1.5.1/src/runtime/panic.go:527 +0x90
runtime.gentraceback(0x413929, 0x7fea597f9b48, 0x0, 0xc8206ca300, 0x0, 0xc8207a7738, 0x40, 0x0, 0x0, 0x6, ...)
    /root/.gvm/gos/go1.5.1/src/runtime/traceback.go:261 +0x1053
runtime.sigprof(0x413929, 0x7fea597f9b48, 0x0, 0xc8206ca300, 0xc82079c000)
    /root/.gvm/gos/go1.5.1/src/runtime/proc1.go:2585 +0x50b
runtime.sighandler(0xc80000001b, 0xc8207a7bb0, 0xc8207a7a80, 0xc8206ca300)
    /root/.gvm/gos/go1.5.1/src/runtime/signal_amd64x.go:47 +0xbe
runtime.sigtrampgo(0x1b, 0xc8207a7bb0, 0xc8207a7a80)
    /root/.gvm/gos/go1.5.1/src/runtime/signal_linux.go:94 +0x95
runtime.sigtramp(0x1, 0x0, 0xc8207a0000, 0x0, 0x7fa0, 0x4000000, 0xc859cb1c80, 0x0, 0x0, 0x3, ...)
    /root/.gvm/gos/go1.5.1/src/runtime/sys_linux_amd64.s:234 +0x1b
runtime.sigreturn(0x0, 0xc8207a0000, 0x0, 0x7fa0, 0x4000000, 0xc859cb1c80, 0x0, 0x0, 0x3, 0xd817e3, ...)
    /root/.gvm/gos/go1.5.1/src/runtime/sys_linux_amd64.s:238

The text was updated successfully, but these errors were encountered:

DanielMorsing · 2015-10-14T17:00:56Z

newstack() has a call to scanstack that isn't protected by the cas lock. Thinking this is the culprit.

aclements · 2015-10-15T18:53:48Z

Hi Daniel. I don't think it's the scanstack in newstack that's the problem. Based on the "copystack" state of goroutine 870933, the huge distance between where we found the stack barrier and where we were expecting to find it, and the location of the recursive "missed stack barrier" panic while backtracing for the first panic, I think goroutine 870933 is in copystack trying to grow its stack, it has adjusted the stack barriers for its new stack, but not yet switched to the new stack, and a SIGPROF came in on its thread. The SIGPROF is scanning the old stack, but expecting the adjusted stack barrier locations.

I'm still thinking about how to fix this. Unfortunately we can't just throw the cas lock around copystack because the SIGPROF will self-deadlock. We may just have to ignore a SIGPROF that comes in while the stack is inconsistent (we can probably narrow the window of time during which that's the case).

How easily can you reproduce this?

DanielMorsing · 2015-10-15T19:00:17Z

We've been running a stress test on a program over a couple of hours with CPU profiling enabled. We can reproduce, but it's not exactly easy.

aclements · 2015-10-15T19:02:16Z

Okay. I just wanted to make sure there's a way to test the fix once I figure out what is it. :)

otoolep · 2015-10-15T19:04:26Z

As @DanielMorsing said, it takes hours for us to hit this, but we usually have 1 failure per 2-3 days of stress. We're disabling profiling now, during our stress runs until this is fixed (and the fix ideally in a released version of Go), but I think we would feel confident it was fixed if we didn't see it for a week or so.

aclements · 2015-10-15T19:06:03Z

Okay. If you have/get any more stack traces, could you paste them in the issue? That will help me confirm that I understand what's going on.

otoolep · 2015-10-15T19:32:40Z

This is the first ticket we filed on our side.

influxdata/influxdb#4328

We can add more traces if and when we get them.

aclements · 2015-10-16T19:00:07Z

@DanielMorsing or @otoolep, can you test https://go-review.googlesource.com/#/c/15995?

gopherbot · 2015-10-16T19:02:09Z

CL https://golang.org/cl/15995 mentions this issue.

gopherbot · 2015-10-16T20:02:04Z

CL https://golang.org/cl/15996 mentions this issue.

aclements · 2015-10-16T21:22:22Z

@DanielMorsing, if you still have the binary that gave the original stack trace, can you run

echo 0x413929 | go tool addr2line <binary>

This is the PC where the signal happened, so it should tell us if it falls in copystack or anything related. The SP is 0x7fea597f9b48, which is certainly a system stack.

I assume what you pasted is the entire panic, so there aren't other goroutines?

DanielMorsing · 2015-10-16T21:32:12Z

There are certainly other goroutines present, but I think they're not shown because the panic causes a traceback, which then causes another panic when it hits the bad stack barrier and terminates the stack printing.

Unfortunately, we don't have the binary any more.

aclements · 2015-10-16T22:16:44Z

That's too bad.

Is it possible to set GOTRACEBACK=2 so we can get a more complete traceback the next time it fails?

otoolep · 2015-10-28T16:57:36Z

Our long running testing of InfluxDB, without profiling enabled, shows us that this only happens when profiling is enabled. I think we all knew this, but I wanted to confirm.

gopherbot · 2015-11-11T21:02:16Z

CL https://golang.org/cl/16819 mentions this issue.

otoolep · 2015-11-11T23:51:44Z

We are reverting to Go 1.4.2, partly because of this issue.

gopherbot · 2015-11-16T23:01:07Z

CL https://golang.org/cl/16985 mentions this issue.

…ng copied sigprof tracebacks the stack across systemstack switches to make profile tracebacks more complete. However, it does this even if the user stack is currently being copied, which means it may be in an inconsistent state that will cause the traceback to panic. One specific way this can happen is during stack shrinking. Some goroutine blocks for STW, then enters gchelper, which then assists with root marking. If that root marking happens to pick the original goroutine and its stack needs to be shrunk, it will begin to copy that stack. During this copy, the stack is generally inconsistent and, in particular, the actual locations of the stack barriers and their recorded locations are temporarily out of sync. If a SIGPROF happens during this inconsistency, it will walk the stack all the way back to the blocked goroutine and panic when it fails to unwind the stack barriers. Fix this by disallowing jumping to the user stack during SIGPROF if that user stack is in the process of being copied. Fixes #12932. Change-Id: I9ef694c2c01e3653e292ce22612418dd3daff1b4 Reviewed-on: https://go-review.googlesource.com/16819 Reviewed-by: Daniel Morsing <daniel.morsing@gmail.com> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-on: https://go-review.googlesource.com/16985 Reviewed-by: Ian Lance Taylor <iant@golang.org> Reviewed-by: Russ Cox <rsc@golang.org>

gopherbot · 2015-11-23T21:02:31Z

CL https://golang.org/cl/17192 mentions this issue.

gopherbot · 2015-11-23T21:02:44Z

CL https://golang.org/cl/17191 mentions this issue.

Commit bbd1a1c prevented SIGPROF from scanning stacks that were being copied, but it didn't prevent a stack copy (specifically a stack shrink) from happening while SIGPROF is scanning the stack. As a result, a stack copy may adjust stack barriers while SIGPROF is in the middle of scanning a stack, causing SIGPROF to panic when it detects an inconsistent stack barrier. Fix this by taking the stack barrier lock while adjusting the stack. In addition to preventing SIGPROF from scanning this stack, this will block until any in-progress SIGPROF is done scanning the stack. For 1.5.2. Fixes #13362. Updates #12932. Change-Id: I422219c363054410dfa56381f7b917e04690e5dd Reviewed-on: https://go-review.googlesource.com/17191 Run-TryBot: Austin Clements <austin@google.com> Reviewed-by: Russ Cox <rsc@golang.org> Run-TryBot: Russ Cox <rsc@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>

gopherbot · 2015-11-24T02:00:28Z

CL https://golang.org/cl/17194 mentions this issue.

Commit bbd1a1c prevented SIGPROF from scanning stacks that were being copied, but it didn't prevent a stack copy (specifically a stack shrink) from happening while SIGPROF is scanning the stack. As a result, a stack copy may adjust stack barriers while SIGPROF is in the middle of scanning a stack, causing SIGPROF to panic when it detects an inconsistent stack barrier. Fix this by taking the stack barrier lock while adjusting the stack. In addition to preventing SIGPROF from scanning this stack, this will block until any in-progress SIGPROF is done scanning the stack. For 1.5.2. Fixes #13362. Updates #12932. Change-Id: I422219c363054410dfa56381f7b917e04690e5dd Reviewed-on: https://go-review.googlesource.com/17191 Run-TryBot: Austin Clements <austin@google.com> Reviewed-by: Russ Cox <rsc@golang.org> Run-TryBot: Russ Cox <rsc@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-on: https://go-review.googlesource.com/17194

The addition of stack barrier locking to copystack subsumes the partial fix from commit bbd1a1c for SIGPROF during copystack. With the stack barrier locking, this commit simplifies the rule in sigprof to: the user stack can be traced only if sigprof can acquire the stack barrier lock. Updates #12932, #13362. Change-Id: I1c1f80015053d0ac7761e9e0c7437c2aba26663f Reviewed-on: https://go-review.googlesource.com/17192 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Russ Cox <rsc@golang.org>

DanielMorsing assigned aclements Oct 14, 2015

ianlancetaylor added this to the Go1.5.2 milestone Oct 14, 2015

daviesalex mentioned this issue Oct 28, 2015

Move to Go 1.5.2 when released (was: 0.9.5-nightly-827c513 panic: objectstart: bad pointer in unexpected span) influxdata/influxdb#4554

Closed

aclements closed this as completed in bbd1a1c Nov 12, 2015

rsc changed the title ~~runtime: Missed stack barrier in go 1.5.1~~ runtime: SIGPROF walking stack-in-motion causes missed write barrier panic Nov 13, 2015

rsc mentioned this issue Nov 17, 2015

runtime: stack barrier inconsistency during profiling (dup #12932?) #12528

Closed

aclements mentioned this issue Nov 23, 2015

runtime: TestStackBarrierProfiling missed stack barrier #13362

Closed

golang locked and limited conversation to collaborators Nov 27, 2016

gopherbot added the FrozenDueToAge label Nov 27, 2016

rsc unassigned aclements Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: SIGPROF walking stack-in-motion causes missed write barrier panic #12932

runtime: SIGPROF walking stack-in-motion causes missed write barrier panic #12932

DanielMorsing commented Oct 14, 2015

DanielMorsing commented Oct 14, 2015

aclements commented Oct 15, 2015

DanielMorsing commented Oct 15, 2015

aclements commented Oct 15, 2015

otoolep commented Oct 15, 2015

aclements commented Oct 15, 2015

otoolep commented Oct 15, 2015

aclements commented Oct 16, 2015

gopherbot commented Oct 16, 2015

gopherbot commented Oct 16, 2015

aclements commented Oct 16, 2015

DanielMorsing commented Oct 16, 2015

aclements commented Oct 16, 2015

otoolep commented Oct 28, 2015

gopherbot commented Nov 11, 2015

otoolep commented Nov 11, 2015

gopherbot commented Nov 16, 2015

gopherbot commented Nov 23, 2015

gopherbot commented Nov 23, 2015

gopherbot commented Nov 24, 2015

runtime: SIGPROF walking stack-in-motion causes missed write barrier panic #12932

runtime: SIGPROF walking stack-in-motion causes missed write barrier panic #12932

Comments

DanielMorsing commented Oct 14, 2015

DanielMorsing commented Oct 14, 2015

aclements commented Oct 15, 2015

DanielMorsing commented Oct 15, 2015

aclements commented Oct 15, 2015

otoolep commented Oct 15, 2015

aclements commented Oct 15, 2015

otoolep commented Oct 15, 2015

aclements commented Oct 16, 2015

gopherbot commented Oct 16, 2015

gopherbot commented Oct 16, 2015

aclements commented Oct 16, 2015

DanielMorsing commented Oct 16, 2015

aclements commented Oct 16, 2015

otoolep commented Oct 28, 2015

gopherbot commented Nov 11, 2015

otoolep commented Nov 11, 2015

gopherbot commented Nov 16, 2015

gopherbot commented Nov 23, 2015

gopherbot commented Nov 23, 2015

gopherbot commented Nov 24, 2015