Skip to content

Commit 1f4394a

Browse files
rscgopherbot
authored andcommitted
runtime: work around Apple libc bugs to make exec stop hanging
For a while now, we've had intermittent reports about problems with os/exec on macOS, but no clear way to reproduce them. Recent changes in the os/exec package test seem to have aligned the stars just right, at least on my two x86 and ARM MacBook Pro laptops, to make the package test hang with roughly 50% probability. When it does hang, the stacks I see in the hung process match the ones reported for the Go-based hangs in #33565. (They do not match the ones reported in the so-called C reproducer in that issue, but I think that reproducer is actually reproducing a different race, between fork and exit.) The stacks obtained from the hung child processes are in libSystem_atfork_child, which is supposed to reinitialize various parts of the C library in the new process. One common stack dies in _notify_fork_child calling _notify_globals (inlined) calling _os_alloc_once, because _os_alloc_once detects that the once lock is held by the parent process and then calls _os_once_gate_corruption_abort. The allocation is setting up the globals for the notification subsystem. See the source code at [1]. To work around this, we can allocate the globals earlier in the Go program's lifetime, before any execs are involved, by calling any notify routine that is exported, calls _notify_globals, and doesn't do anything too expensive otherwise. notify_is_valid_token(0) fits the bill. The other common stack dies in xpc_atfork_child calling _objc_msgSend_uncached which ends up in WAITING_FOR_ANOTHER_THREAD_TO_FINISH_CALLING_+initialize. Of course, whatever thread the child is waiting for is in the parent process and is not going to finish anything in the child process. There is no public source code for these routines, so it is unclear exactly what the problem is. However, xpc_atfork_child turns out to be exported (for use by libSystem_atfork_child, which is in a different library, so xpc_atfork_child is unlikely to be unexported any time soon). It also stands to reason that since xpc_atfork_child is called at the start of any forked child process, it can't be too harmful to call at the start of an ordinary Go process. And whatever caches it needs for a non-deadlocking fast path during exec empirically do get initialized by calling it at startup. This CL introduces a function osinit_hack, called at osinit time, which calls notify_is_valid_token(0) and xpc_atfork_child(). Doing so makes the os/exec test pass reliably on both my laptops - I can run it successfully hundreds of times in a row when my previous record was twice in a row. Fixes #33565. Fixes #56784. [1] https://opensource.apple.com/source/Libnotify/Libnotify-241/notify_client.c.auto.html Change-Id: I16a14a800893c40244678203532a3e8d6214b6bd Reviewed-on: https://go-review.googlesource.com/c/go/+/451735 Run-TryBot: Russ Cox <rsc@golang.org> Auto-Submit: Russ Cox <rsc@golang.org> Reviewed-by: Cherry Mui <cherryyz@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
1 parent 6e0e492 commit 1f4394a

File tree

4 files changed

+65
-0
lines changed

4 files changed

+65
-0
lines changed

Diff for: src/runtime/os_darwin.go

+2
Original file line numberDiff line numberDiff line change
@@ -136,6 +136,8 @@ func osinit() {
136136

137137
ncpu = getncpu()
138138
physPageSize = getPageSize()
139+
140+
osinit_hack()
139141
}
140142

141143
func sysctlbynameInt32(name []byte) (int32, int32) {

Diff for: src/runtime/sys_darwin.go

+48
Original file line numberDiff line numberDiff line change
@@ -179,6 +179,51 @@ func pthread_kill(t pthread, sig uint32) {
179179
}
180180
func pthread_kill_trampoline()
181181

182+
// osinit_hack is a clumsy hack to work around Apple libc bugs
183+
// causing fork+exec to hang in the child process intermittently.
184+
// See go.dev/issue/33565 and go.dev/issue/56784 for a few reports.
185+
//
186+
// The stacks obtained from the hung child processes are in
187+
// libSystem_atfork_child, which is supposed to reinitialize various
188+
// parts of the C library in the new process.
189+
//
190+
// One common stack dies in _notify_fork_child calling _notify_globals
191+
// (inlined) calling _os_alloc_once, because _os_alloc_once detects that
192+
// the once lock is held by the parent process and then calls
193+
// _os_once_gate_corruption_abort. The allocation is setting up the
194+
// globals for the notification subsystem. See the source code at [1].
195+
// To work around this, we can allocate the globals earlier in the Go
196+
// program's lifetime, before any execs are involved, by calling any
197+
// notify routine that is exported, calls _notify_globals, and doesn't do
198+
// anything too expensive otherwise. notify_is_valid_token(0) fits the bill.
199+
//
200+
// The other common stack dies in xpc_atfork_child calling
201+
// _objc_msgSend_uncached which ends up in
202+
// WAITING_FOR_ANOTHER_THREAD_TO_FINISH_CALLING_+initialize. Of course,
203+
// whatever thread the child is waiting for is in the parent process and
204+
// is not going to finish anything in the child process. There is no
205+
// public source code for these routines, so it is unclear exactly what
206+
// the problem is. However, xpc_atfork_child turns out to be exported
207+
// (for use by libSystem_atfork_child, which is in a different library,
208+
// so xpc_atfork_child is unlikely to be unexported any time soon).
209+
// It also stands to reason that since xpc_atfork_child is called at the
210+
// start of any forked child process, it can't be too harmful to call at
211+
// the start of an ordinary Go process. And whatever caches it needs for
212+
// a non-deadlocking fast path during exec empirically do get initialized
213+
// by calling it at startup.
214+
//
215+
// So osinit_hack_trampoline (in sys_darwin_$GOARCH.s) calls
216+
// notify_is_valid_token(0) and xpc_atfork_child(), which makes the
217+
// fork+exec hangs stop happening. If Apple fixes the libc bug in
218+
// some future version of macOS, then we can remove this awful code.
219+
//
220+
//go:nosplit
221+
func osinit_hack() {
222+
libcCall(unsafe.Pointer(abi.FuncPCABI0(osinit_hack_trampoline)), nil)
223+
return
224+
}
225+
func osinit_hack_trampoline()
226+
182227
// mmap is used to do low-level memory allocation via mmap. Don't allow stack
183228
// splits, since this function (used by sysAlloc) is called in a lot of low-level
184229
// parts of the runtime and callers often assume it won't acquire any locks.
@@ -548,3 +593,6 @@ func setNonblock(fd int32) {
548593
//go:cgo_import_dynamic libc_pthread_cond_wait pthread_cond_wait "/usr/lib/libSystem.B.dylib"
549594
//go:cgo_import_dynamic libc_pthread_cond_timedwait_relative_np pthread_cond_timedwait_relative_np "/usr/lib/libSystem.B.dylib"
550595
//go:cgo_import_dynamic libc_pthread_cond_signal pthread_cond_signal "/usr/lib/libSystem.B.dylib"
596+
597+
//go:cgo_import_dynamic libc_notify_is_valid_token notify_is_valid_token "/usr/lib/libSystem.B.dylib"
598+
//go:cgo_import_dynamic libc_xpc_atfork_child xpc_atfork_child "/usr/lib/libSystem.B.dylib"

Diff for: src/runtime/sys_darwin_amd64.s

+9
Original file line numberDiff line numberDiff line change
@@ -597,6 +597,15 @@ TEXT runtime·pthread_kill_trampoline(SB),NOSPLIT,$0
597597
POPQ BP
598598
RET
599599

600+
TEXT runtime·osinit_hack_trampoline(SB),NOSPLIT,$0
601+
PUSHQ BP
602+
MOVQ SP, BP
603+
MOVQ $0, DI // arg 1 val
604+
CALL libc_notify_is_valid_token(SB)
605+
CALL libc_xpc_atfork_child(SB)
606+
POPQ BP
607+
RET
608+
600609
// syscall calls a function in libc on behalf of the syscall package.
601610
// syscall takes a pointer to a struct like:
602611
// struct {

Diff for: src/runtime/sys_darwin_arm64.s

+6
Original file line numberDiff line numberDiff line change
@@ -458,6 +458,12 @@ TEXT runtime·pthread_setspecific_trampoline(SB),NOSPLIT,$0
458458
BL libc_pthread_setspecific(SB)
459459
RET
460460

461+
TEXT runtime·osinit_hack_trampoline(SB),NOSPLIT,$0
462+
MOVD $0, R0 // arg 1 val
463+
BL libc_notify_is_valid_token(SB)
464+
BL libc_xpc_atfork_child(SB)
465+
RET
466+
461467
// syscall calls a function in libc on behalf of the syscall package.
462468
// syscall takes a pointer to a struct like:
463469
// struct {

0 commit comments

Comments
 (0)