-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: preemption in startTemplateThread may cause infinite hang #38931
Comments
@gopherbot backport please 1.13 1.14 |
Backport issue(s) opened: #38932 (for 1.13), #38933 (for 1.14). Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases. |
Doh, it turns out reproducer programs don't work well if you build them with the fix in the runtime. Without the fix, the above repro hangs in ~1/50 runs, right where expected:
|
Change https://golang.org/cl/232978 mentions this issue: |
N.B. the test in https://golang.org/cl/232978 reproduces this issue in 1.14. It does not reproduce on 1.13, but I believe the issue should still exist there, the repro timing may just be a bit off. |
Change https://golang.org/cl/234885 mentions this issue: |
Change https://golang.org/cl/234888 mentions this issue: |
…hread When a locked M wants to start a new M, it hands off to the template thread to actually call clone and start the thread. The template thread is lazily created the first time a thread is locked (or if cgo is in use). stoplockedm will release the P (_Pidle), then call handoffp to give the P to another M. In the case of a pending STW, one of two things can happen: 1. handoffp starts an M, which does acquirep followed by schedule, which will finally enter _Pgcstop. 2. handoffp immediately enters _Pgcstop. This only occurs if the P has no local work, GC work, and no spinning M is required. If handoffp starts an M, and must create a new M to do so, then newm will simply queue the M on newmHandoff for the template thread to do the clone. When a stop-the-world is required, stopTheWorldWithSema will start the stop and then wait for all Ps to enter _Pgcstop. If the template thread is not fully created because startTemplateThread gets stopped, then another stoplockedm may queue an M that will never get created, and the handoff P will never leave _Pidle. Thus stopTheWorldWithSema will wait forever. A sequence to trigger this hang when STW occurs can be visualized with two threads: T1 T2 ------------------------------- ----------------------------- LockOSThread LockOSThread haveTemplateThread == 0 startTemplateThread haveTemplateThread = 1 newm haveTemplateThread == 1 preempt -> schedule g.m.lockedExt++ gcstopm -> _Pgcstop g.m.lockedg = ... park g.lockedm = ... return ... (any code) preempt -> schedule stoplockedm releasep -> _Pidle handoffp startm (first 3 handoffp cases) newm g.m.lockedExt != 0 Add to newmHandoff, return park Note that the P in T2 is stuck sitting in _Pidle. Since the template thread isn't running, the new M will not be started complete the transition to _Pgcstop. To resolve this, we disable preemption around the assignment of haveTemplateThread and the creation of the template thread in order to guarantee that if handTemplateThread is set then the template thread will eventually exist, in the presence of stops. For #38931 Fixes #38933 Change-Id: I50535fbbe2f328f47b18e24d9030136719274191 Reviewed-on: https://go-review.googlesource.com/c/go/+/232978 Run-TryBot: Michael Pratt <mpratt@google.com> Reviewed-by: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> (cherry picked from commit 11b3730) Reviewed-on: https://go-review.googlesource.com/c/go/+/234885 Reviewed-by: Cherry Zhang <cherryyz@google.com>
…hread When a locked M wants to start a new M, it hands off to the template thread to actually call clone and start the thread. The template thread is lazily created the first time a thread is locked (or if cgo is in use). stoplockedm will release the P (_Pidle), then call handoffp to give the P to another M. In the case of a pending STW, one of two things can happen: 1. handoffp starts an M, which does acquirep followed by schedule, which will finally enter _Pgcstop. 2. handoffp immediately enters _Pgcstop. This only occurs if the P has no local work, GC work, and no spinning M is required. If handoffp starts an M, and must create a new M to do so, then newm will simply queue the M on newmHandoff for the template thread to do the clone. When a stop-the-world is required, stopTheWorldWithSema will start the stop and then wait for all Ps to enter _Pgcstop. If the template thread is not fully created because startTemplateThread gets stopped, then another stoplockedm may queue an M that will never get created, and the handoff P will never leave _Pidle. Thus stopTheWorldWithSema will wait forever. A sequence to trigger this hang when STW occurs can be visualized with two threads: T1 T2 ------------------------------- ----------------------------- LockOSThread LockOSThread haveTemplateThread == 0 startTemplateThread haveTemplateThread = 1 newm haveTemplateThread == 1 preempt -> schedule g.m.lockedExt++ gcstopm -> _Pgcstop g.m.lockedg = ... park g.lockedm = ... return ... (any code) preempt -> schedule stoplockedm releasep -> _Pidle handoffp startm (first 3 handoffp cases) newm g.m.lockedExt != 0 Add to newmHandoff, return park Note that the P in T2 is stuck sitting in _Pidle. Since the template thread isn't running, the new M will not be started complete the transition to _Pgcstop. To resolve this, we disable preemption around the assignment of haveTemplateThread and the creation of the template thread in order to guarantee that if handTemplateThread is set then the template thread will eventually exist, in the presence of stops. For #38931 Fixes #38932 Change-Id: I50535fbbe2f328f47b18e24d9030136719274191 Reviewed-on: https://go-review.googlesource.com/c/go/+/232978 Run-TryBot: Michael Pratt <mpratt@google.com> Reviewed-by: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> (cherry picked from commit 11b3730) Reviewed-on: https://go-review.googlesource.com/c/go/+/234888 Reviewed-by: Cherry Zhang <cherryyz@google.com>
This fixes the following issue in Go runtime, which could result in program hang - golang/go#38931
This fixes the following issue in Go runtime, which could result in program hang - golang/go#38931
Summary:
If a call to
startTemplateThread
is preempted by a gcstop before creating the new thread, then newmHandoff looks initialized even though it is not. When other locked Ms stop they may handoff their (now _Pidle) P to a new M. Creation of that new M will never occur because the template thread is not running, and stopTheWorldWithSema will hang forever waiting for the P to enter _Pgcstop.Details:
When a locked M wants to start a new M, it hands off to the template thread to actually call clone and start the thread. The template thread is lazily created the first time a thread is locked (or if cgo is in use).
stoplockedm will release the P (_Pidle), then call handoffp to give the P to another M. In the case of a pending STW, one of two things can happen:
handoffp starts an M, which does acquirep followed by schedule, which will finally enter _Pgcstop.
handoffp immediately enters _Pgcstop. This only occurs if the P has no local work, GC work, and no spinning M is required.
If handoffp starts an M, and must create a new M to do so, then newm simply queue the M on newmHandoff for the template thread to do the clone.
If the template thread was never fully created because startTemplateThread is stopped, then that thread will never get created, and the handoff P will never enter _Pgcstop.
To visualize, consider two threads running when a STW occurs:
Note that T2's P is sitting in _Pidle. It expects the template thread to clone the new M, which would then acquirep, schedule, and gcstopm. Since the template thread doesn't exist yet, this never happens, and stopTheWorldWithSema waits forever for the last P to enter stop.
I believe the best fix for this is to disable preemption in startTemplateThread (acquirem) so we can guarantee that the template thread will eventually exist after haveTemplateThread = 1.
I've reproduced this issue with an internal program, and the suggested acquirem fixes that case. This should be reproducible with something like the below program, but I haven't quite managed to get that to trigger the bug:
cc @aclements
The text was updated successfully, but these errors were encountered: