-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: apparent deadlock in image/gif test on linux-ppc64-buildlet #32613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I looked through the dashboard failures and the timeouts in image/gif and runtime have happened on both ppc64 and ppc64le power8. We have not seen these timeouts on the power9 builder before today. (I see a new power9 failure out there from today but not the same as these.) The power9 builder uses Debian 9 and I believe Brad told me once that the power8 ppc64le builder used Debian 7 and I am not aware it has been upgraded since. Can we verify it's not the distro or kernel before wasting too much time on these failures. Could be some kernel bug that has since been fixed. Brad suggested using gomote to figure out the distro but looks like that needs a token which I don't have. |
One more today ( @dmitshur @toothrot: could you help with either a gomote token or distro information? |
I don't think the distro is at fault here anymore. I found that these are using Debian 8 based on some comments I found. My concern was if it was Debian 7. Can we bump up the timeout value for ppc64 to 10m or so just to rule out a deadlock vs. something making it take a long time? |
Change https://golang.org/cl/197237 mentions this issue: |
I was not aware the builder machines only had 2 processors each. I will try and see if that helps to reproduce the problem (ours have at least 16, some many more). So that means the default value for the test parallelism should be 2, but I've seen failure logs where there are many more than 2 goroutine stacks running tests. I guess that should be OK but I did not expect that. |
I have not been able to reproduce this one with GOMAXPROCS=2. It would help to know what is on the stack that is unavailable, is there any way to get that information? |
@aclements, @ianlancetaylor: any tips on coaxing the runtime into providing more stacks? |
You can probably get more stacks by running with |
Change https://golang.org/cl/203886 mentions this issue: |
Collaboration with @tiborvass at Docker who got Docker running on big-endian PPC64. Go for ppc64 doesn't support cgo or external linking, so runc doesn't work, but a new OCI-compliant runc implementation written in C (https://github.com/containers/crun) means we can run Docker after all. See NOTES & build-*.sh Then add a Dockerfile & associated cleanup in buildlet & stage0 to use rundockerbuildlet. Once done, might help with golang/go#35188, golang/go#32613, etc. Fixes golang/go#34830 Updates golang/go#21260 Change-Id: I43d7afa1d58bbdfa16e3c57670bc41f1d1932d80 Reviewed-on: https://go-review.googlesource.com/c/build/+/203886 Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Collaboration with @tiborvass at Docker who got Docker running on big-endian PPC64. Go for ppc64 doesn't support cgo or external linking, so runc doesn't work, but a new OCI-compliant runc implementation written in C (https://github.com/containers/crun) means we can run Docker after all. See NOTES & build-*.sh Then add a Dockerfile & associated cleanup in buildlet & stage0 to use rundockerbuildlet. Once done, might help with golang/go#35188, golang/go#32613, etc. Fixes golang/go#34830 Updates golang/go#21260 Change-Id: I43d7afa1d58bbdfa16e3c57670bc41f1d1932d80 Reviewed-on: https://go-review.googlesource.com/c/build/+/203886 Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
I don't think we've seen this since the move to Docker with a more recent kernel. |
Thanks. Closing on the theory that this was a kernel bug. |
There was a timeout in the
image/gif
test, but from the symptoms it looks more like a runtime bug to me: one of the threads is idle onruntime.futex
viaruntime.mcall
, and the the other one saysgoroutine running on other thread; stack unavailable
.That combination of symptoms is similar to #32327, although the path from
runtime.mcall
toruntime.futex
differs.https://build.golang.org/log/e08f0037958f84cf1b1fe6b9f80c8208d332104c
CC @laboger @aclements @mknyszek @randall77
The text was updated successfully, but these errors were encountered: