-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: SIGPROF walking stack-in-motion causes missed write barrier panic #12932
Comments
newstack() has a call to scanstack that isn't protected by the cas lock. Thinking this is the culprit. |
Hi Daniel. I don't think it's the scanstack in newstack that's the problem. Based on the "copystack" state of goroutine 870933, the huge distance between where we found the stack barrier and where we were expecting to find it, and the location of the recursive "missed stack barrier" panic while backtracing for the first panic, I think goroutine 870933 is in copystack trying to grow its stack, it has adjusted the stack barriers for its new stack, but not yet switched to the new stack, and a SIGPROF came in on its thread. The SIGPROF is scanning the old stack, but expecting the adjusted stack barrier locations. I'm still thinking about how to fix this. Unfortunately we can't just throw the cas lock around copystack because the SIGPROF will self-deadlock. We may just have to ignore a SIGPROF that comes in while the stack is inconsistent (we can probably narrow the window of time during which that's the case). How easily can you reproduce this? |
We've been running a stress test on a program over a couple of hours with CPU profiling enabled. We can reproduce, but it's not exactly easy. |
Okay. I just wanted to make sure there's a way to test the fix once I figure out what is it. :) |
As @DanielMorsing said, it takes hours for us to hit this, but we usually have 1 failure per 2-3 days of stress. We're disabling profiling now, during our stress runs until this is fixed (and the fix ideally in a released version of Go), but I think we would feel confident it was fixed if we didn't see it for a week or so. |
Okay. If you have/get any more stack traces, could you paste them in the issue? That will help me confirm that I understand what's going on. |
This is the first ticket we filed on our side. We can add more traces if and when we get them. |
@DanielMorsing or @otoolep, can you test https://go-review.googlesource.com/#/c/15995? |
CL https://golang.org/cl/15995 mentions this issue. |
CL https://golang.org/cl/15996 mentions this issue. |
@DanielMorsing, if you still have the binary that gave the original stack trace, can you run
This is the PC where the signal happened, so it should tell us if it falls in copystack or anything related. The SP is 0x7fea597f9b48, which is certainly a system stack. I assume what you pasted is the entire panic, so there aren't other goroutines? |
There are certainly other goroutines present, but I think they're not shown because the panic causes a traceback, which then causes another panic when it hits the bad stack barrier and terminates the stack printing. Unfortunately, we don't have the binary any more. |
That's too bad. Is it possible to set GOTRACEBACK=2 so we can get a more complete traceback the next time it fails? |
Our long running testing of InfluxDB, without profiling enabled, shows us that this only happens when profiling is enabled. I think we all knew this, but I wanted to confirm. |
CL https://golang.org/cl/16819 mentions this issue. |
We are reverting to Go 1.4.2, partly because of this issue. |
CL https://golang.org/cl/16985 mentions this issue. |
…ng copied sigprof tracebacks the stack across systemstack switches to make profile tracebacks more complete. However, it does this even if the user stack is currently being copied, which means it may be in an inconsistent state that will cause the traceback to panic. One specific way this can happen is during stack shrinking. Some goroutine blocks for STW, then enters gchelper, which then assists with root marking. If that root marking happens to pick the original goroutine and its stack needs to be shrunk, it will begin to copy that stack. During this copy, the stack is generally inconsistent and, in particular, the actual locations of the stack barriers and their recorded locations are temporarily out of sync. If a SIGPROF happens during this inconsistency, it will walk the stack all the way back to the blocked goroutine and panic when it fails to unwind the stack barriers. Fix this by disallowing jumping to the user stack during SIGPROF if that user stack is in the process of being copied. Fixes #12932. Change-Id: I9ef694c2c01e3653e292ce22612418dd3daff1b4 Reviewed-on: https://go-review.googlesource.com/16819 Reviewed-by: Daniel Morsing <daniel.morsing@gmail.com> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-on: https://go-review.googlesource.com/16985 Reviewed-by: Ian Lance Taylor <iant@golang.org> Reviewed-by: Russ Cox <rsc@golang.org>
CL https://golang.org/cl/17192 mentions this issue. |
CL https://golang.org/cl/17191 mentions this issue. |
Commit bbd1a1c prevented SIGPROF from scanning stacks that were being copied, but it didn't prevent a stack copy (specifically a stack shrink) from happening while SIGPROF is scanning the stack. As a result, a stack copy may adjust stack barriers while SIGPROF is in the middle of scanning a stack, causing SIGPROF to panic when it detects an inconsistent stack barrier. Fix this by taking the stack barrier lock while adjusting the stack. In addition to preventing SIGPROF from scanning this stack, this will block until any in-progress SIGPROF is done scanning the stack. For 1.5.2. Fixes #13362. Updates #12932. Change-Id: I422219c363054410dfa56381f7b917e04690e5dd Reviewed-on: https://go-review.googlesource.com/17191 Run-TryBot: Austin Clements <austin@google.com> Reviewed-by: Russ Cox <rsc@golang.org> Run-TryBot: Russ Cox <rsc@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>
CL https://golang.org/cl/17194 mentions this issue. |
Commit bbd1a1c prevented SIGPROF from scanning stacks that were being copied, but it didn't prevent a stack copy (specifically a stack shrink) from happening while SIGPROF is scanning the stack. As a result, a stack copy may adjust stack barriers while SIGPROF is in the middle of scanning a stack, causing SIGPROF to panic when it detects an inconsistent stack barrier. Fix this by taking the stack barrier lock while adjusting the stack. In addition to preventing SIGPROF from scanning this stack, this will block until any in-progress SIGPROF is done scanning the stack. For 1.5.2. Fixes #13362. Updates #12932. Change-Id: I422219c363054410dfa56381f7b917e04690e5dd Reviewed-on: https://go-review.googlesource.com/17191 Run-TryBot: Austin Clements <austin@google.com> Reviewed-by: Russ Cox <rsc@golang.org> Run-TryBot: Russ Cox <rsc@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-on: https://go-review.googlesource.com/17194
The addition of stack barrier locking to copystack subsumes the partial fix from commit bbd1a1c for SIGPROF during copystack. With the stack barrier locking, this commit simplifies the rule in sigprof to: the user stack can be traced only if sigprof can acquire the stack barrier lock. Updates #12932, #13362. Change-Id: I1c1f80015053d0ac7761e9e0c7437c2aba26663f Reviewed-on: https://go-review.googlesource.com/17192 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Russ Cox <rsc@golang.org>
This looks a lot like #11863
Stack trace:
The text was updated successfully, but these errors were encountered: