-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: infinite loop in lockextra on linux/arm #34391
Comments
We have been seeing some strange failures on the linux/ppc64 builder. The problems are not consistently reproducible but I was able to find that the test TestStress in os/signal hangs consistently on my ppc64 system. It does not always hang when running all.bash but does consistently hang if I run that test by itself. I am still trying to see if I can make the same thing happen on ppc64le. This starts to happen with commit 904f046 which was the fix for the issue mentioned above #34030. I'm guessing it is the same problem identified in this issue but if not I can open a new issue. There are many failures on the ppc64 build dashboard since this commit mostly appearing as 20m timeouts. In one case I was able to make it happen just like the builder log and found that it was hung during a compile. If you look through the ppc64 failures the hang is not always on the same test but usually a 20m timeout. |
/cc @aclements @rsc @randall77 who own runtime. I don't see why this should livelock, so this could be a bug. |
The commit mentioned above is incorrect for ppc64/ppc64le. R30 is used for G in Go and will never be clobbered by the vdso code since R30 is considered nonvolatile in the non-Go world. Removing ppc64, ppc64le from the case in sigFetchG resolves the issues I was seeing. |
Change https://golang.org/cl/196658 mentions this issue: |
/cc @ianlancetaylor who worked with me to make the commit mentioned above. |
When TestStress in os/signal hangs on ppc64, I can attach with gdb and see this stack:
It didn't go down this path before because I don't think 'g' could be nil. So it looks like this is uncovering another problem because when badsignal is called, it doesn't return, which causes the hang. |
Thanks @nyuichi and @laboger for the backtraces. I think the problem is, when sigFetchG returns nil, it calls badsignal which, calls needm, which wants an "extram", but for a non-cgo program we never created one, so it never succeed. Maybe we have to create an extra M on these platforms for non-cgo programs? |
I think the current Perhaps the code in |
I just hit a compile time hang when doing a golang build on one of our power8 ppc64le machines. Here is what the trace looks like:
This must be the reason for all the 20 minute hangs we have been seeing since the original sigFetchG change went in. I think sig 17 is SIGCHLD which seems like a signal that should be ignored. |
This fixes a regression introduced with CL 192937. That change was intended to fix a problem in arm and arm64 but also added code to change the behavior in ppc64 and ppc64le even though the error never occurred there. The change to function sigFetchG assumes that the register holding 'g' could be clobbered by vdso code when in fact 'g' is in R30 and that is nonvolatile in the 64-bit PowerPC ELF ABI so would not be clobbered in vdso code. So if this happens somehow the path it takes is incorrect, falling through to a call to badsignal which doesn't seem right. This regression caused intermittent hangs on the builder dashboard for ppc64, and can be reproduced consistently when running os/signal TestStress on some ppc64 systems. I mentioned this problem is issue #34391 because I thought it was related to another problem described there. Change-Id: I2ee3606de302bafe509d300077ce3b44b88571a1 Reviewed-on: https://go-review.googlesource.com/c/go/+/196658 Reviewed-by: Cherry Zhang <cherryyz@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
I started looking into this problem. So @ianlancetaylor, do you mean we should have a thread-local bitmask for keeping signals received during a VDSO? I was not able to figure out how to create a thread local storage in a go program not linked to libc. It seems |
We don't need a thread-local bitmask. Signals are delivered to the process as a whole. Go doesn't support delivering signals to a specific thread. An atomically accessed process-wide bitmask would suffice. |
Makes sense.
Can I assume also that Go doesn't support real time signals? For them POSIX defines order and number of delivery are guaranteed. |
Real time signals are an interesting case, but I think that we don't need to worry about them. In a pure Go program they can only be seen by code that calls I think we really only need the suggested atomic bitmask for signals that we aren't going to forward and that we aren't going to ignore. In other words, signals for which we are going to crash. So I don't think we need to be super careful about how we handle them. |
Sorry for the late reply. I don't quite understand some points:
|
This commit fixes issue golang#34391, which is due to an incorrect patch merged in golang#34030. sigtrampgo is modified to record incoming signals in a globally shared atomic bitmask during when the G register is clobbered. When the execution exits from vdso it checks if there is a pending signal it re-raises them to its own process.
Sorry, I'm not sure how to map what you call the first, second, and third cases into what I wrote.
|
@randall77 I see you reopened #22047. We have seen similar hangs recently and I suspect it could be related to the issue with signals described here. I was unable to add any comments to issue #22047 because it was locked. I have been able to reproduce some hangs on my systems by using GOMAXPROCS=2 like is done with builder testing, and in those cases I see the stack that I provided above in the signal handling code. (Maybe there should be a different issue for the ppc64{le} hangs? I don't like to put comments in multiple issues but in this case the headline is misleading.) |
I unlocked #22047. |
This commit fixes issue golang#34391, which is due to an incorrect patch merged in golang#34030. sigtrampgo is modified to record incoming signals in a globally shared atomic bitmask when the G register is clobbered. When the execution exits from vdso it checks if there is a pending signal and in that case it re-raises them to its own process.
Change https://golang.org/cl/201899 mentions this issue: |
This commit fixes issue golang#34391, which is due to an incorrect patch merged in CL 192937. sigtrampgo is modified to record incoming signals in a globally shared atomic bitmask when the G register is clobbered. When the execution exits from vdso it checks if there are pending signals and in that case it re-raises them to its own process.
Change https://golang.org/cl/202759 mentions this issue: |
Change https://golang.org/cl/206959 mentions this issue: |
When we receive a signal, if G is nil we call badsignal, which calls needm. When cgo is not used, there is no extra M, so needm will just hang. In this situation, even GOTRACEBACK=crash cannot get a stack trace, as we're in the signal handler and cannot receive another signal (SIGQUIT). Instead, just crash. For #35554. Updates #34391. Change-Id: I061ac43fc0ac480435c050083096d126b149d21f Reviewed-on: https://go-review.googlesource.com/c/go/+/206959 Run-TryBot: Cherry Zhang <cherryyz@google.com> Reviewed-by: Ian Lance Taylor <iant@golang.org>
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
I only reproduced this with HEAD due to another issue #34030.
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
My environment is Raspberry pi 3 running raspbian buster.
What did you expect to see?
done is printed.
What did you see instead?
The execution got stuck running into an infinite loop (maybe a livelock).
It appears the issue is caused in the same situation of #34030, namely a signal occurs while VDSO is running.
In the patch I attached in #34030 it is tested that incoming SIGPROF signals no longer cause segv, but I hadn't fully tested other signals.
The text was updated successfully, but these errors were encountered: