Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: SIGPROF walking stack-in-motion causes missed write barrier panic #12932

Closed
DanielMorsing opened this issue Oct 14, 2015 · 20 comments
Closed

Comments

@DanielMorsing
Copy link
Contributor

This looks a lot like #11863

Stack trace:

found next stack barrier at 0xc857445a00; expected [*0xc8427059f0=0x869015 *0xc842705e38=0x61b1de]
fatal error: missed stack barrier

goroutine 0 [idle]:
runtime.throw(0xcb8fc0, 0x14)
    /root/.gvm/gos/go1.5.1/src/runtime/panic.go:527 +0x90
runtime.gentraceback(0x413929, 0x7fea597f9b48, 0x0, 0xc8206ca300, 0x0, 0xc8207a7738, 0x40, 0x0, 0x0, 0x6, ...)
    /root/.gvm/gos/go1.5.1/src/runtime/traceback.go:261 +0x1053
runtime.sigprof(0x413929, 0x7fea597f9b48, 0x0, 0xc8206ca300, 0xc82079c000)
    /root/.gvm/gos/go1.5.1/src/runtime/proc1.go:2585 +0x50b
runtime.sighandler(0xc80000001b, 0xc8207a7bb0, 0xc8207a7a80, 0xc8206ca300)
    /root/.gvm/gos/go1.5.1/src/runtime/signal_amd64x.go:47 +0xbe
runtime.sigtrampgo(0x1b, 0xc8207a7bb0, 0xc8207a7a80)
    /root/.gvm/gos/go1.5.1/src/runtime/signal_linux.go:94 +0x95
runtime.sigtramp(0x1, 0x0, 0xc8207a0000, 0x0, 0x7fa0, 0x4000000, 0xc859cb1c80, 0x0, 0x0, 0x3, ...)
    /root/.gvm/gos/go1.5.1/src/runtime/sys_linux_amd64.s:234 +0x1b
runtime.sigreturn(0x0, 0xc8207a0000, 0x0, 0x7fa0, 0x4000000, 0xc859cb1c80, 0x0, 0x0, 0x3, 0xd817e3, ...)
    /root/.gvm/gos/go1.5.1/src/runtime/sys_linux_amd64.s:238

goroutine 870933 [copystack]:
runtime.callers(0x4, 0xc842705298, 0x20, 0x20, 0xd)
    /root/.gvm/gos/go1.5.1/src/runtime/traceback.go:562 fp=0xc857445268 sp=0xc857445260
runtime.mProf_Malloc(0xc82e414580, 0x40)
    /root/.gvm/gos/go1.5.1/src/runtime/mprof.go:235 +0x7f fp=0xc8574453e8 sp=0xc857445268
runtime.profilealloc(0xc82079c000, 0xc82e414580, 0x40)
    /root/.gvm/gos/go1.5.1/src/runtime/malloc.go:811 +0x98 fp=0xc857445410 sp=0xc8574453e8
runtime.mallocgc(0x40, 0x0, 0x3, 0x412bd5)
    /root/.gvm/gos/go1.5.1/src/runtime/malloc.go:699 +0x5d3 fp=0xc8574454e0 sp=0xc857445410
runtime.rawstring(0x31, 0x0, 0x0, 0x0, 0x0, 0x0)
    /root/.gvm/gos/go1.5.1/src/runtime/string.go:264 +0x70 fp=0xc857445528 sp=0xc8574454e0
runtime.rawstringtmp(0x0, 0x31, 0x0, 0x0, 0x0, 0x0, 0x0)
    /root/.gvm/gos/go1.5.1/src/runtime/string.go:107 +0xb7 fp=0xc857445560 sp=0xc857445528
runtime.concatstrings(0x0, 0xc8427056e0, 0x3, 0x3, 0x0, 0x0)
    /root/.gvm/gos/go1.5.1/src/runtime/string.go:48 +0x1c2 fp=0xc857445698 sp=0xc857445560
runtime.concatstring3(0x0, 0xc82e412420, 0x28, 0xc2d0e8, 0x4, 0xc841aaf3d0, 0x5, 0x0, 0x0)
    /root/.gvm/gos/go1.5.1/src/runtime/string.go:62 +0x6a fp=0xc8574456e8 sp=0xc857445698
found next stack barrier at 0xc857445a00; expected [*0xc8427059f0=0x869015 *0xc842705e38=0x61b1de]
fatal error: missed stack barrier
panic during panic

goroutine 0 [idle]:
runtime.startpanic_m()
    /root/.gvm/gos/go1.5.1/src/runtime/panic1.go:67 +0x141
runtime.systemstack(0xd862e0)
    /root/.gvm/gos/go1.5.1/src/runtime/asm_amd64.s:278 +0xab
runtime.startpanic()
    /root/.gvm/gos/go1.5.1/src/runtime/panic.go:505 +0x14
runtime.throw(0xcb8fc0, 0x14)
    /root/.gvm/gos/go1.5.1/src/runtime/panic.go:526 +0x83
runtime.gentraceback(0x452670, 0xc857445260, 0x0, 0xc859cb1c80, 0x0, 0x0, 0x64, 0x0, 0x0, 0x0, ...)
    /root/.gvm/gos/go1.5.1/src/runtime/traceback.go:261 +0x1053
runtime.traceback1(0xffffffffffffffff, 0xffffffffffffffff, 0x0, 0xc859cb1c80, 0x0)
    /root/.gvm/gos/go1.5.1/src/runtime/traceback.go:552 +0xc8
runtime.traceback(0xffffffffffffffff, 0xffffffffffffffff, 0x0, 0xc859cb1c80)
    /root/.gvm/gos/go1.5.1/src/runtime/traceback.go:529 +0x48
runtime.tracebackothers(0xc8206ca180)
    /root/.gvm/gos/go1.5.1/src/runtime/traceback.go:666 +0xda
runtime.dopanic_m(0xc8206ca180, 0x42e9a0, 0xc8207a7588)
    /root/.gvm/gos/go1.5.1/src/runtime/panic1.go:104 +0x1f9
runtime.dopanic.func1()
    /root/.gvm/gos/go1.5.1/src/runtime/panic.go:514 +0x32
runtime.systemstack(0xc8207a7560)
    /root/.gvm/gos/go1.5.1/src/runtime/asm_amd64.s:278 +0xab
runtime.dopanic(0x0)
    /root/.gvm/gos/go1.5.1/src/runtime/panic.go:515 +0x61
runtime.throw(0xcb8fc0, 0x14)
    /root/.gvm/gos/go1.5.1/src/runtime/panic.go:527 +0x90
runtime.gentraceback(0x413929, 0x7fea597f9b48, 0x0, 0xc8206ca300, 0x0, 0xc8207a7738, 0x40, 0x0, 0x0, 0x6, ...)
    /root/.gvm/gos/go1.5.1/src/runtime/traceback.go:261 +0x1053
runtime.sigprof(0x413929, 0x7fea597f9b48, 0x0, 0xc8206ca300, 0xc82079c000)
    /root/.gvm/gos/go1.5.1/src/runtime/proc1.go:2585 +0x50b
runtime.sighandler(0xc80000001b, 0xc8207a7bb0, 0xc8207a7a80, 0xc8206ca300)
    /root/.gvm/gos/go1.5.1/src/runtime/signal_amd64x.go:47 +0xbe
runtime.sigtrampgo(0x1b, 0xc8207a7bb0, 0xc8207a7a80)
    /root/.gvm/gos/go1.5.1/src/runtime/signal_linux.go:94 +0x95
runtime.sigtramp(0x1, 0x0, 0xc8207a0000, 0x0, 0x7fa0, 0x4000000, 0xc859cb1c80, 0x0, 0x0, 0x3, ...)
    /root/.gvm/gos/go1.5.1/src/runtime/sys_linux_amd64.s:234 +0x1b
runtime.sigreturn(0x0, 0xc8207a0000, 0x0, 0x7fa0, 0x4000000, 0xc859cb1c80, 0x0, 0x0, 0x3, 0xd817e3, ...)
    /root/.gvm/gos/go1.5.1/src/runtime/sys_linux_amd64.s:238
@DanielMorsing
Copy link
Contributor Author

newstack() has a call to scanstack that isn't protected by the cas lock. Thinking this is the culprit.

@aclements
Copy link
Member

Hi Daniel. I don't think it's the scanstack in newstack that's the problem. Based on the "copystack" state of goroutine 870933, the huge distance between where we found the stack barrier and where we were expecting to find it, and the location of the recursive "missed stack barrier" panic while backtracing for the first panic, I think goroutine 870933 is in copystack trying to grow its stack, it has adjusted the stack barriers for its new stack, but not yet switched to the new stack, and a SIGPROF came in on its thread. The SIGPROF is scanning the old stack, but expecting the adjusted stack barrier locations.

I'm still thinking about how to fix this. Unfortunately we can't just throw the cas lock around copystack because the SIGPROF will self-deadlock. We may just have to ignore a SIGPROF that comes in while the stack is inconsistent (we can probably narrow the window of time during which that's the case).

How easily can you reproduce this?

@DanielMorsing
Copy link
Contributor Author

We've been running a stress test on a program over a couple of hours with CPU profiling enabled. We can reproduce, but it's not exactly easy.

@aclements
Copy link
Member

Okay. I just wanted to make sure there's a way to test the fix once I figure out what is it. :)

@otoolep
Copy link

otoolep commented Oct 15, 2015

As @DanielMorsing said, it takes hours for us to hit this, but we usually have 1 failure per 2-3 days of stress. We're disabling profiling now, during our stress runs until this is fixed (and the fix ideally in a released version of Go), but I think we would feel confident it was fixed if we didn't see it for a week or so.

@aclements
Copy link
Member

Okay. If you have/get any more stack traces, could you paste them in the issue? That will help me confirm that I understand what's going on.

@otoolep
Copy link

otoolep commented Oct 15, 2015

This is the first ticket we filed on our side.

influxdata/influxdb#4328

We can add more traces if and when we get them.

@aclements
Copy link
Member

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/15995 mentions this issue.

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/15996 mentions this issue.

@aclements
Copy link
Member

@DanielMorsing, if you still have the binary that gave the original stack trace, can you run

echo 0x413929 | go tool addr2line <binary>

This is the PC where the signal happened, so it should tell us if it falls in copystack or anything related. The SP is 0x7fea597f9b48, which is certainly a system stack.

I assume what you pasted is the entire panic, so there aren't other goroutines?

@DanielMorsing
Copy link
Contributor Author

There are certainly other goroutines present, but I think they're not shown because the panic causes a traceback, which then causes another panic when it hits the bad stack barrier and terminates the stack printing.

Unfortunately, we don't have the binary any more.

@aclements
Copy link
Member

That's too bad.

Is it possible to set GOTRACEBACK=2 so we can get a more complete traceback the next time it fails?

@otoolep
Copy link

otoolep commented Oct 28, 2015

Our long running testing of InfluxDB, without profiling enabled, shows us that this only happens when profiling is enabled. I think we all knew this, but I wanted to confirm.

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/16819 mentions this issue.

@otoolep
Copy link

otoolep commented Nov 11, 2015

We are reverting to Go 1.4.2, partly because of this issue.

@rsc rsc changed the title runtime: Missed stack barrier in go 1.5.1 runtime: SIGPROF walking stack-in-motion causes missed write barrier panic Nov 13, 2015
@gopherbot
Copy link
Contributor

CL https://golang.org/cl/16985 mentions this issue.

aclements added a commit that referenced this issue Nov 17, 2015
…ng copied

sigprof tracebacks the stack across systemstack switches to make
profile tracebacks more complete. However, it does this even if the
user stack is currently being copied, which means it may be in an
inconsistent state that will cause the traceback to panic.

One specific way this can happen is during stack shrinking. Some
goroutine blocks for STW, then enters gchelper, which then assists
with root marking. If that root marking happens to pick the original
goroutine and its stack needs to be shrunk, it will begin to copy that
stack. During this copy, the stack is generally inconsistent and, in
particular, the actual locations of the stack barriers and their
recorded locations are temporarily out of sync. If a SIGPROF happens
during this inconsistency, it will walk the stack all the way back to
the blocked goroutine and panic when it fails to unwind the stack
barriers.

Fix this by disallowing jumping to the user stack during SIGPROF if
that user stack is in the process of being copied.

Fixes #12932.

Change-Id: I9ef694c2c01e3653e292ce22612418dd3daff1b4
Reviewed-on: https://go-review.googlesource.com/16819
Reviewed-by: Daniel Morsing <daniel.morsing@gmail.com>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-on: https://go-review.googlesource.com/16985
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Reviewed-by: Russ Cox <rsc@golang.org>
@gopherbot
Copy link
Contributor

CL https://golang.org/cl/17192 mentions this issue.

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/17191 mentions this issue.

aclements added a commit that referenced this issue Nov 24, 2015
Commit bbd1a1c prevented SIGPROF from scanning stacks that were being
copied, but it didn't prevent a stack copy (specifically a stack
shrink) from happening while SIGPROF is scanning the stack. As a
result, a stack copy may adjust stack barriers while SIGPROF is in the
middle of scanning a stack, causing SIGPROF to panic when it detects
an inconsistent stack barrier.

Fix this by taking the stack barrier lock while adjusting the stack.
In addition to preventing SIGPROF from scanning this stack, this will
block until any in-progress SIGPROF is done scanning the stack.

For 1.5.2.

Fixes #13362.
Updates #12932.

Change-Id: I422219c363054410dfa56381f7b917e04690e5dd
Reviewed-on: https://go-review.googlesource.com/17191
Run-TryBot: Austin Clements <austin@google.com>
Reviewed-by: Russ Cox <rsc@golang.org>
Run-TryBot: Russ Cox <rsc@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@gopherbot
Copy link
Contributor

CL https://golang.org/cl/17194 mentions this issue.

aclements added a commit that referenced this issue Nov 24, 2015
Commit bbd1a1c prevented SIGPROF from scanning stacks that were being
copied, but it didn't prevent a stack copy (specifically a stack
shrink) from happening while SIGPROF is scanning the stack. As a
result, a stack copy may adjust stack barriers while SIGPROF is in the
middle of scanning a stack, causing SIGPROF to panic when it detects
an inconsistent stack barrier.

Fix this by taking the stack barrier lock while adjusting the stack.
In addition to preventing SIGPROF from scanning this stack, this will
block until any in-progress SIGPROF is done scanning the stack.

For 1.5.2.

Fixes #13362.
Updates #12932.

Change-Id: I422219c363054410dfa56381f7b917e04690e5dd
Reviewed-on: https://go-review.googlesource.com/17191
Run-TryBot: Austin Clements <austin@google.com>
Reviewed-by: Russ Cox <rsc@golang.org>
Run-TryBot: Russ Cox <rsc@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-on: https://go-review.googlesource.com/17194
aclements added a commit that referenced this issue Dec 15, 2015
The addition of stack barrier locking to copystack subsumes the
partial fix from commit bbd1a1c for SIGPROF during copystack. With the
stack barrier locking, this commit simplifies the rule in sigprof to:
the user stack can be traced only if sigprof can acquire the stack
barrier lock.

Updates #12932, #13362.

Change-Id: I1c1f80015053d0ac7761e9e0c7437c2aba26663f
Reviewed-on: https://go-review.googlesource.com/17192
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Russ Cox <rsc@golang.org>
@golang golang locked and limited conversation to collaborators Nov 27, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants