-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: panic on system stack during cgo callback #12238
Comments
/cc @rsc @aclements @dvyukov |
@aclements This is crashing in this code in traceback.go, which I believe you wrote:
As you can see from the stack trace, cgo code has called into Go code, the Go code is allocating memory, the profiler has decided to jump in and get a stack trace, and getting the stack trace crashes. |
I suspect this has something to do with the funny tricks cgocallback_gofunc plays with the frame. I'm trying to put together a reproducer, but probably won't have something until tomorrow. @noxiouz, do you happen to know if the call from C to go_stat_callback happens on a thread that was created by C or by Go? (I suspect it doesn't matter, but might help me narrow down the reproducer.) In the mean time, there are a few things you can do to work around this. The simplest is probably to set GODEBUG=gcstackbarrieroff=1. You can also disable memory profiling. |
BTW, anything you can tell us about the caller of go_stat_callback would be useful. Based on the name, I assume this was a function pointer passed in to C? I tried writing the obvious reproducer where I set stack barriers to be installed at every frame, and made a call from Go to C and back to Go, then invoked runtime.Callers() while stack barriers were installed, but that wasn't enough to trigger this problem. |
@aclements yes, this callback is called from C thread. |
@aclements BTW, GODEBUG=gcstackbarrieroff=1 makes the program work without panic. |
Thanks.
It most likely is related to the size of the frame, but it may just be bad luck (stack barriers are inserted at exponentially-spaced points in the stack, so it's hard to predict where they'll fall, but they tend to fall in the same place). What do you mean by "stat callback receives megabytes of data"? It's all on the heap, not on the stack, presumably? Though, a large GoBytes allocation could be triggering the garbage collector, which could be part of why you're repeatably seeing the failure at this point.
No, thanks, though if this is happening in an open source application or tests, it would be great if you could paste commands to reproduce it. |
I believe I'm able to produce a similar crash involving C->Go callbacks and stack barriers:
This crash happens repeatably within 10s of starting this benchmark, on amd64. The crash does not happen in Go 1.4.2, or in Go 1.5.0 with GODEBUG=gcstackbarrieroff=1. To reproduce, Note that the cgo code in my fork of go-sqlite3 is my first attempt at doing C->Go callbacks, so it's possible that I just messed that up and am corrupting memory. However, the fact that it doesn't crash in 1.4 and that this bug talks about bugs triggered by C->Go->C transitions makes me suspicious, as this benchmark is doing hundreds of thousands of those transitions. |
Hi Dave! I tried reproducing the failure with your gipam benchmark, but I get an immediate segfault.
It looks like it's making a cgo call to a NULL function pointer, but I can't get a real backtrace out of either Go or GDB. |
@danderson, any ideas why I can't run your benchmark? If not, I can work on debugging it, but I'd rather avoid debugging something in order to debug something. :) |
sorry for not answering, @aclements. My case is not easy to reproduce. As it's a part storage system and requires a lot to do to install it. I'll try to find a way to reproduce it anyway. |
@aclements I've seen that segfault as well now. I'm not sure what's causing it. AFAICT, it is necessary to have a Go->C->Go->C transition for it to happen, but that's all I know right now. I'm trying to narrow it down to a small repro case that doesn't involve the entire sqlite library, I'll update if I manage to get one. Unfortunately this isn't my day job, so it may be slow going :( |
To be clear: in my current codebase, the crash only happens with the new code I added to go-sqlite, which introduces C->Go calls to go-sqlite. Without a C->Go transition, I cannot trigger the crash. I've walked through the call chain on a reduced test case that still involves go-sqlite, and afaict the code is well-formed and never calls any NULL C functions. I'll get back to you with the smallest repro case I can produce - hopefully one that has no sqlite in it at all, but if not, then I can at least remove the intermediate layers in my benchmark to narrow things down. |
@noxiouz, @danderson, I have a fix for what I believe is causing your crashes. Can you try applying https://go-review.googlesource.com/13944 and let me know if it fixes the crashes? |
CL https://golang.org/cl/13944 mentions this issue. |
CL https://golang.org/cl/13947 mentions this issue. |
CL https://golang.org/cl/13948 mentions this issue. |
Checking now, I'll report back after some stress-testing. |
LGTM++, I'm unable to cause a crash in any of the cases that trivially blow up with an unpatched 1.5. The change looks great. |
Hmm I have a situation which probably is a Go -> C -> Go -> C case, all or almost all calls to panic() from inside the Go callback function seemingly stops the executing, no printing stack traces or exiting. I'll start by trying the patch.. edit: This seems to have solved my problem, I created panic calls in 40 random places in my code where it did not work before a and simple random trigger.. My app is currently being executed in a bash loop and has successfully panicked several hundred or thousands times by now.. |
Im checking now |
@aclements my case has been working for more than 11 hours already. Before the patch it broke down every several minutes. Seems it works. |
@danderson, @noxiouz, thanks for testing! |
For posterity, the program I used to reproduce this is below. The failure mode is somewhat different from the other reports in this issue, and it's timing-dependent, though it tries hard not to be too sensitive. It should be run with GODEBUG=gcstackbarrierall=1.
|
Currently enabling the debugging mode where stack barriers are installed at every frame requires recompiling the runtime. However, this is potentially useful for field debugging and for runtime tests, so make this mode a GODEBUG. Updates #12238. Change-Id: I6fb128f598b19568ae723a612e099c0ed96917f5 Reviewed-on: https://go-review.googlesource.com/13947 Reviewed-by: Russ Cox <rsc@golang.org>
Currently the stack barrier stub blindly unwinds the next stack barrier from the G's stack barrier array without checking that it's the right stack barrier. If through some bug the stack barrier array position gets out of sync with where we actually are on the stack, this could return to the wrong PC, which would lead to difficult to debug crashes. To address this, this commit adds a check to the amd64 stack barrier stub that it's unwinding the correct stack barrier. Updates #12238. Change-Id: If824d95191d07e2512dc5dba0d9978cfd9f54e02 Reviewed-on: https://go-review.googlesource.com/13948 Reviewed-by: Russ Cox <rsc@golang.org>
CL https://golang.org/cl/14240 mentions this issue. |
CL https://golang.org/cl/14241 mentions this issue. |
CL https://golang.org/cl/14229 mentions this issue. |
…ry frame Currently enabling the debugging mode where stack barriers are installed at every frame requires recompiling the runtime. However, this is potentially useful for field debugging and for runtime tests, so make this mode a GODEBUG. Updates #12238. Change-Id: I6fb128f598b19568ae723a612e099c0ed96917f5 Reviewed-on: https://go-review.googlesource.com/13947 Reviewed-by: Russ Cox <rsc@golang.org> Reviewed-on: https://go-review.googlesource.com/14240 Reviewed-by: Austin Clements <austin@google.com>
…allback_gofunc's frame Currently the runtime can install stack barriers in any frame. However, the frame of cgocallback_gofunc is special: it's the one function that switches from a regular G stack to the system stack on return. Hence, the return PC slot in its frame on the G stack is actually used to save getg().sched.pc (so tracebacks appear to unwind to the last Go function running on that G), and not as an actual return PC for cgocallback_gofunc. Because of this, if we install a stack barrier in cgocallback_gofunc's return PC slot, when cgocallback_gofunc does return, it will move the stack barrier stub PC in to getg().sched.pc and switch back to the system stack. The rest of the runtime doesn't know how to deal with a stack barrier stub in sched.pc: nothing knows how to match it up with the G's stack barrier array and, when the runtime removes stack barriers, it doesn't know to undo the one in sched.pc. Hence, if the C code later returns back in to Go code, it will attempt to return through the stack barrier saved in sched.pc, which may no longer have correct unwinding information. Fix this by blacklisting cgocallback_gofunc's frame so the runtime won't install a stack barrier in it's return PC slot. Fixes #12238. Change-Id: I46aa2155df2fd050dd50de3434b62987dc4947b8 Reviewed-on: https://go-review.googlesource.com/13944 Reviewed-by: Russ Cox <rsc@golang.org> Reviewed-on: https://go-review.googlesource.com/14229 Reviewed-by: Austin Clements <austin@google.com>
… sync Currently the stack barrier stub blindly unwinds the next stack barrier from the G's stack barrier array without checking that it's the right stack barrier. If through some bug the stack barrier array position gets out of sync with where we actually are on the stack, this could return to the wrong PC, which would lead to difficult to debug crashes. To address this, this commit adds a check to the amd64 stack barrier stub that it's unwinding the correct stack barrier. Updates #12238. Change-Id: If824d95191d07e2512dc5dba0d9978cfd9f54e02 Reviewed-on: https://go-review.googlesource.com/13948 Reviewed-by: Russ Cox <rsc@golang.org> Reviewed-on: https://go-review.googlesource.com/14241 Reviewed-by: Austin Clements <austin@google.com>
A similar issue : #12582 |
I upgraded from 1.4 to 1.5 to have #11907 fixed. And every several minutes I get "panic during panic". That's all that I can see. It always occurs on cgo callback.
Please, let me know which information could be useful, I'll provide it.
The text was updated successfully, but these errors were encountered: