-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: go stuck in a dead lock between threads epoll_pwait/futex(..., FUTEX_WAKE_PRIVATE, 1) on arm64 #55120
Comments
I don't know what is happening here, but the pattern you are showing is normal. You have to pass the I'm not saying that there isn't a problem, but we'll need more information to diagnose it. Especially since other people on linux-arm64 are not reporting similar problems. Thanks. |
@ianlancetaylor thanks for looking into this. I did as you suggested with the following example
and attaching just for clarity I picked podman as an example but this behavior can also be seen with any other Thanks |
CC @golang/runtime |
Hi, Thanks, |
When the program is stuck, could you attach with gdb and see where all of the threads are ( |
Hi,
in working environment this code takes up to 5 seconds to run, on before mentioned NXP platform it was running 1 minute and after I saw it has stucked (one thread taking 100% CPU time and 4 threads waiting on futexes or nano sleeping ) I have connected gdb to it and I've got such result from thread backtrace:
As you can see, thread 1 and 3 are waiting on mutex, thread 4 has already finished calculating the Fibonacci(45) value.
as you can see nothing has changed. As previously Marcus stated, it has non-deterministic behaviour, playing with SIGINT and "continue" a few times in the gdb has unblocked the program and it has finished successfully. The amount of SIGINT - continue phases vary, once it is one time another time it is 4 times to break and continue the process. As already mentioned, creating a new ssh connection to the board unblocks always stuck threads and the program finishes successfully. Important: whenever the program got stuck, the thread responsible for displaying the "simplified progress notification" in the routine spinner was printing the values properly the whole time |
@ianlancetaylor @prattmic Another point: I just compiled this little Fibonacci program with 1.13.15 version, latest before 1.14 version, and observed while it works ~1 minute vs expected 5-7 seconds, it still performs reliably and deterministically slow, without being stuck. This is a trace from it, so as you see no
Does it ring a bell? |
1.14 added asynchronous preemption. Could you try setting the |
@prattmic thanks for the hints. We have tested it on the board and the issue persists with and without the asynchronous preemption. After further digging into this we believe this is not a go issue. The board which is used here has 4 cores. 2 real CPUs exposing two cores each. If we run applications limited to the cores of one CPU (taskset --cpu-list 0,1 ...) the dead lock issue does not appear and the behavior becomes deterministic. Thus I think the issue lives on the kernel/hardware side of things. The reason why we thought go could be the problem was because go uses threads a lot and therefore was constantly running into the dead lock issue when all other applications worked as expected. But none of them uses more than two threads. So we were running into the wrong direction, really sorry for that but all your tips helped to debug the problem. There is still no solution for us but I wanted to share our findings here such that you know If you don't mind I'd like to keep this open for a few more days even though I don't expect any fundamental changes with regards to the reasons of the issue. Again thanks for your help |
Thanks for following up. I'm going to close this issue, but please feel free to comment if it does seem to be a Go problem after all. |
Does this issue reproduce with the latest release?
yes I tried go 1.18 and go 1.19
What operating system and processor architecture are you using (
go env
)?What did you do?
I'm working with go and go compiled binaries on an NXP S32g devel board and observed a dead lock of the
programs that makes them stuck forever. The problem seems to be related to how go applications synchronize
in its threads and the mutex locks. For a simple reproduced I wrote the following code:
main.go
mprint.go
The actual program code is immaterial as the issue appears when compiling with
It actually should be done in a second but the compilation is stuck forever. The process list shows me
Checking with strace what's going on exposes the following:
So it's in the epoll_pwait() forever. Neither the descriptor nor the connected SIGURG signals seems
to get it out of this polling. I couldn't get a clue on this and after some time I decided to write
this report here.
Unfortunately this seems to be a non deterministic behavior because I cannot reproduce it to this
extend on other arm boards (e.g raspberry) or on other archs (e.g x86_64). However I was able
to get into the dead lock on x86_64 if I run go compiled code with ltrace, e.g
ltrace podman --version
So I think something is causing a race condition and it might also be related to the board
hardware such that I can see it on this board in any case but only occasionally elsewhere. I know this makes
it a bad issue for you but I really hope you can give me some ideas to try, patches for testing
or just abuse me as tester of ideas on the board.
I did several tests with other programming languages, e.g C (simple mutex testing) or rust (which also has
some thread safe programming model) and they did not expose this sort of issues. Thus I believe it's not
a generaly broken board hardware but some sort of unfortunate circumstances.
Another fun fact; If the processes are stuck they can be made unstuck if you ssh into the machine.
Sounds strange yes but my assumption with this is that this behavior could be related
to ssh which avoid orphaned processes by using TCP out-of-band data also generating a SIGURG.
In go it seems SIGURG is also used for thread preemption. It feels like that it has race conditions which
are triggered any time on my arm board but does happen only occasionally on other hardware.
So if I ssh, a stuck go program can be made to continue until the next dead lock condition ;)
What did you expect to see?
I expected the go compiler not to hang in a dead lock. I also wondered about compiled go based programs
to run into dead locks with the same behavior as soon as threads are used.
What did you see instead?
go itself and go compiled binaries are stuck in a dead lock as described.
I'm running out of ideas what else I could do and kindly ask for help from the experts.
If wished I can provide temporary ssh access to the arm board I'm using which allows to
reproduce the problem as described.
Thanks a ton
Cheers,
Marcus
The text was updated successfully, but these errors were encountered: