Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Foreign threads: GC runs on cancelled thread, causes segfault #47590

Closed
maleadt opened this issue Nov 16, 2022 · 0 comments · Fixed by #48223
Closed

Foreign threads: GC runs on cancelled thread, causes segfault #47590

maleadt opened this issue Nov 16, 2022 · 0 comments · Fixed by #48223
Assignees
Labels
GC Garbage collector multithreading Base.Threads and related functionality

Comments

@maleadt
Copy link
Member

maleadt commented Nov 16, 2022

I'm experimenting with the new foreign thread support, and encountered a case where GC seems to run on a pthread after cancellation. I realize that cancelling threads is Tricky Business, but I hope we can make our scheduler resilient to it (or improve my code to safely do so). Even if actively cancelling threads is rare, threads exiting after their work is done is much more common, and both are pretty much related AFAIK.

Anyway, a MWE:

const pthread_t = Culong

if Sys.isapple()
    const PTHREAD_CANCEL_ENABLE = 1
    const PTHREAD_CANCEL_DISABLE = 0
elseif Sys.islinux()
    const PTHREAD_CANCEL_ENABLE = 0
    const PTHREAD_CANCEL_DISABLE = 1
end

function pthread_setcancelstate(enable::Bool)
    status = ccall(:pthread_setcancelstate, Cint, (Cint, Ptr{Cint}),
                   enable ? PTHREAD_CANCEL_ENABLE : PTHREAD_CANCEL_DISABLE, C_NULL)
    @assert status == 0
    return
end

if Sys.isapple()
    const PTHREAD_CANCEL_ENABLE = 1
    const PTHREAD_CANCEL_DISABLE = 0
    const PTHREAD_CANCEL_DEFERRED = 2
    const PTHREAD_CANCEL_ASYNCHRONOUS = 0
elseif Sys.islinux()
    const PTHREAD_CANCEL_ENABLE = 0
    const PTHREAD_CANCEL_DISABLE = 1
    const PTHREAD_CANCEL_DEFERRED = 0
    const PTHREAD_CANCEL_ASYNCHRONOUS = 1
end

function pthread_setcanceltype(typ)
    status = ccall(:pthread_setcanceltype, Cint, (Cint, Ptr{Cint}),
                   typ, C_NULL)
    status == 0 || pthread_error("pthread_setcanceltype", status)
    return
end

pthread_testcancel() = ccall(:pthread_testcancel, Cvoid, ())

function pthread_worker(data::Ptr{Nothing})
    # only cancel at safe points
    pthread_setcancelstate(false)
    pthread_setcanceltype(PTHREAD_CANCEL_DEFERRED)

    # this print (or something like it) is required in order to have GC run on this thread
    println("waiting for cancellation on thread $(Threads.threadid())...")

    while true
        pthread_setcancelstate(true)
        #GC.enable(false)
        pthread_testcancel()    # <-- this is cancellation point where we'll die
        #GC.enable(true)
        pthread_setcancelstate(false)
    end
    
    return
end
pthread_worker_cb = @cfunction(pthread_worker, Cvoid, (Ptr{Cvoid},))

# XXX: executing this code at top level results in a failed assertion
function main()
    # create a thread
    thread = Ref{pthread_t}()
    status = ccall(:pthread_create, Cint,
                    (Ptr{pthread_t}, Ptr{Nothing}, Ptr{Nothing}, Ptr{Nothing}),
                    thread, C_NULL, pthread_worker_cb, C_NULL)
    @assert status == 0
    thread = thread[]

    # wait for a bit so that the thread has disabled cancellation
    sleep(0.1)

    # submit a cancellation request
    status = ccall(:pthread_cancel, Cint, (pthread_t,), thread)
    @assert status == 0

    state = ccall(:jl_gc_safe_enter, Int8, ())
    status = ccall(:pthread_join, Cint,
                    (pthread_t, Ptr{Ptr{Nothing}}),
                    thread, C_NULL)
    state = ccall(:jl_gc_safe_leave, Cvoid, (Int8,), state)
    @assert status == 0

    GC.gc(true)
end
main()

It's a bit of code, so summarizing the steps:

  • create a thread that runs a worker function
  • that function disables (asynchronous) cancellation, instead actively testing for cancellation requests in order to exit safely (otherwise we can die at random places, e.g. after having taken a lock during codegen or so)
  • we give the thread a bit of time to set-up, and then cancel it
  • finally, we call pthread_join in order to clean up resources related to the thread

After these steps, if the GC runs, we get a segfault:

waiting for cancellation on thread 2...

[28824] signal (11.2): Segmentation fault: 11
in expression starting at /Users/tim/Julia/pkg/pthreads/wip.jl:87
gc_read_stack at /Users/tim/Julia/src/julia/src/gc.c:1785 [inlined]
gc_mark_loop at /Users/tim/Julia/src/julia/src/gc.c:2863
_jl_gc_collect at /Users/tim/Julia/src/julia/src/gc.c:3275
ijl_gc_collect at /Users/tim/Julia/src/julia/src/gc.c:3566
gc at ./gcutils.jl:98 [inlined]
main at /Users/tim/Julia/pkg/pthreads/wip.jl:85
unknown function (ip: 0x100ce0323)
_jl_invoke at /Users/tim/Julia/src/julia/src/gf.c:2450
ijl_apply_generic at /Users/tim/Julia/src/julia/src/gf.c:2632
jl_apply at /Users/tim/Julia/src/julia/src/julia.h:1868 [inlined]
do_call at /Users/tim/Julia/src/julia/src/interpreter.c:126
eval_body at /Users/tim/Julia/src/julia/src/interpreter.c:0
jl_interpret_toplevel_thunk at /Users/tim/Julia/src/julia/src/interpreter.c:762
jl_toplevel_eval_flex at /Users/tim/Julia/src/julia/src/toplevel.c:912
jl_toplevel_eval_flex at /Users/tim/Julia/src/julia/src/toplevel.c:856
ijl_toplevel_eval at /Users/tim/Julia/src/julia/src/toplevel.c:921 [inlined]
ijl_toplevel_eval_in at /Users/tim/Julia/src/julia/src/toplevel.c:971
eval at ./boot.jl:370 [inlined]
include_string at ./loading.jl:1522
_jl_invoke at /Users/tim/Julia/src/julia/src/gf.c:2431
ijl_apply_generic at /Users/tim/Julia/src/julia/src/gf.c:2632
_include at ./loading.jl:1582
include at ./Base.jl:450
jfptr_include_30533 at /Users/tim/Julia/src/julia/build/dev/usr/lib/julia/sys.dylib (unknown line)
_jl_invoke at /Users/tim/Julia/src/julia/src/gf.c:2431
ijl_apply_generic at /Users/tim/Julia/src/julia/src/gf.c:2632
exec_options at ./client.jl:307
_start at ./client.jl:522
jfptr__start_28810 at /Users/tim/Julia/src/julia/build/dev/usr/lib/julia/sys.dylib (unknown line)
_jl_invoke at /Users/tim/Julia/src/julia/src/gf.c:2431
ijl_apply_generic at /Users/tim/Julia/src/julia/src/gf.c:2632
jl_apply at /Users/tim/Julia/src/julia/src/julia.h:1868 [inlined]
true_main at /Users/tim/Julia/src/julia/src/jlapi.c:573
jl_repl_entrypoint at /Users/tim/Julia/src/julia/src/jlapi.c:717
Allocations: 46319 (Pool: 46289; Big: 30); GC: 0
fatal: error thrown and no exception handler available.
ErrorException("`body` expression must terminate in `return`. Use `block` instead.")
ijl_error at /Users/tim/Julia/src/julia/src/rtutils.c:41
eval_body at /Users/tim/Julia/src/julia/src/interpreter.c:442
jl_interpret_toplevel_thunk at /Users/tim/Julia/src/julia/src/interpreter.c:762
Assertion failed: (i < jl_array_len(a)), function jl_array_ptr_ref, file julia.h, line 1016.

[28824] signal (6): Abort trap: 6
in expression starting at /Users/tim/Julia/pkg/pthreads/wip.jl:87
__pthread_kill at /usr/lib/system/libsystem_kernel.dylib (unknown line)
Allocations: 46319 (Pool: 46289; Big: 30); GC: 1

Note that this seems to indicate that the segfault happened during a GC run on thread 2, which is the pthread we just canceled!

Running this code from top level results in a different crash:

waiting for cancellation on thread 2...
Assertion failed: (jl_atomic_load_relaxed(&ptls->gc_state) == 0), function jl_gc_pool_alloc_inner, file gc.c, line 1315.

[28843] signal (6): Abort trap: 6
in expression starting at /Users/tim/Julia/pkg/pthreads/wip.jl:78
__pthread_kill at /usr/lib/system/libsystem_kernel.dylib (unknown line)
Allocations: 2969 (Pool: 2958; Big: 11); GC: 0

On Linux, the crashes is reported as originating from [8905] signal (6.-6): Aborted; assuming the -6 should be a valid thread ID this does seem like corruption of scheduler state.

The workaround for these crashes is to disable the GC around the call to pthread_testcancel. The issue looks related to #47185, but calling jl_gc_safe_enter/jl_gc_safe_leave around pthread_cancel doesn't seem to help.

@maleadt maleadt added multithreading Base.Threads and related functionality GC Garbage collector labels Nov 16, 2022
vtjnash added a commit that referenced this issue Jan 11, 2023
Closes #47590 (pthread_cancel still forbidden though, since async mode
will corrupt the process, and synchronously tested is just a slow
implementation of a boolean)

Refs #47201 (only deals with thread exit, not other case where this is
an issue, like cfunction exit and gc-safe-leave)

May help #46537, by blocking jl_wake_libuv before uv_library_shutdown,
and other tweaks to GC mode. For example:

[4011824] signal (6.-6): Aborted
gsignal at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
uv__async_send at /workspace/srcdir/libuv/src/unix/async.c:198
uv_async_send at /workspace/srcdir/libuv/src/unix/async.c:73
jl_wake_libuv at /data/vtjnash/julia1/src/jl_uv.c:44 [inlined]
JL_UV_LOCK at /data/vtjnash/julia1/src/jl_uv.c:64 [inlined]
ijl_iolock_begin at /data/vtjnash/julia1/src/jl_uv.c:72
iolock_begin at ./libuv.jl:48 [inlined]
_trywait at ./asyncevent.jl:140
wait at ./asyncevent.jl:155 [inlined]
profile_printing_listener at /data/vtjnash/julia1/usr/share/julia/stdlib/v1.10/Profile/src/Profile.jl:39
jfptr_YY.3_58617 at /data/vtjnash/julia1/usr/lib/julia/sys.so (unknown line)
_jl_invoke at /data/vtjnash/julia1/src/gf.c:2665 [inlined]
ijl_apply_generic at /data/vtjnash/julia1/src/gf.c:2866
jl_apply at /data/vtjnash/julia1/src/julia.h:1870 [inlined]
start_task at /data/vtjnash/julia1/src/task.c:1093
Aborted

Fixes #37400
vtjnash added a commit that referenced this issue Jan 11, 2023
Closes #47590 (pthread_cancel still forbidden though, since async mode
will corrupt the process, and synchronously tested is just a slow
implementation of a boolean)

Refs #47201 (only deals with thread exit, not other case where this is
an issue, like cfunction exit and gc-safe-leave)

May help #46537, by blocking jl_wake_libuv before uv_library_shutdown,
and other tweaks to GC mode. For example:

[4011824] signal (6.-6): Aborted
gsignal at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
uv__async_send at /workspace/srcdir/libuv/src/unix/async.c:198
uv_async_send at /workspace/srcdir/libuv/src/unix/async.c:73
jl_wake_libuv at /data/vtjnash/julia1/src/jl_uv.c:44 [inlined]
JL_UV_LOCK at /data/vtjnash/julia1/src/jl_uv.c:64 [inlined]
ijl_iolock_begin at /data/vtjnash/julia1/src/jl_uv.c:72
iolock_begin at ./libuv.jl:48 [inlined]
_trywait at ./asyncevent.jl:140
wait at ./asyncevent.jl:155 [inlined]
profile_printing_listener at /data/vtjnash/julia1/usr/share/julia/stdlib/v1.10/Profile/src/Profile.jl:39
jfptr_YY.3_58617 at /data/vtjnash/julia1/usr/lib/julia/sys.so (unknown line)
_jl_invoke at /data/vtjnash/julia1/src/gf.c:2665 [inlined]
ijl_apply_generic at /data/vtjnash/julia1/src/gf.c:2866
jl_apply at /data/vtjnash/julia1/src/julia.h:1870 [inlined]
start_task at /data/vtjnash/julia1/src/task.c:1093
Aborted

Fixes #37400
vtjnash added a commit that referenced this issue Jan 11, 2023
Closes #47590 (pthread_cancel still forbidden though, since async mode
will corrupt the process, and synchronously tested is just a slow
implementation of a boolean)

Refs #47201 (only deals with thread exit, not other case where this is
an issue, like cfunction exit and gc-safe-leave)

May help #46537, by blocking jl_wake_libuv before uv_library_shutdown,
and other tweaks to GC mode. For example:

[4011824] signal (6.-6): Aborted
gsignal at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
uv__async_send at /workspace/srcdir/libuv/src/unix/async.c:198
uv_async_send at /workspace/srcdir/libuv/src/unix/async.c:73
jl_wake_libuv at /data/vtjnash/julia1/src/jl_uv.c:44 [inlined]
JL_UV_LOCK at /data/vtjnash/julia1/src/jl_uv.c:64 [inlined]
ijl_iolock_begin at /data/vtjnash/julia1/src/jl_uv.c:72
iolock_begin at ./libuv.jl:48 [inlined]
_trywait at ./asyncevent.jl:140
wait at ./asyncevent.jl:155 [inlined]
profile_printing_listener at /data/vtjnash/julia1/usr/share/julia/stdlib/v1.10/Profile/src/Profile.jl:39
jfptr_YY.3_58617 at /data/vtjnash/julia1/usr/lib/julia/sys.so (unknown line)
_jl_invoke at /data/vtjnash/julia1/src/gf.c:2665 [inlined]
ijl_apply_generic at /data/vtjnash/julia1/src/gf.c:2866
jl_apply at /data/vtjnash/julia1/src/julia.h:1870 [inlined]
start_task at /data/vtjnash/julia1/src/task.c:1093
Aborted

Fixes #37400
vtjnash added a commit that referenced this issue Jan 11, 2023
Closes #47590 (pthread_cancel still forbidden though, since async mode
will corrupt the process, and synchronously tested is just a slow
implementation of a boolean)

Refs #47201 (only deals with thread exit, not other case where this is
an issue, like cfunction exit and gc-safe-leave)

May help #46537, by blocking jl_wake_libuv before uv_library_shutdown,
and other tweaks to GC mode. For example:

[4011824] signal (6.-6): Aborted
gsignal at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
uv__async_send at /workspace/srcdir/libuv/src/unix/async.c:198
uv_async_send at /workspace/srcdir/libuv/src/unix/async.c:73
jl_wake_libuv at /data/vtjnash/julia1/src/jl_uv.c:44 [inlined]
JL_UV_LOCK at /data/vtjnash/julia1/src/jl_uv.c:64 [inlined]
ijl_iolock_begin at /data/vtjnash/julia1/src/jl_uv.c:72
iolock_begin at ./libuv.jl:48 [inlined]
_trywait at ./asyncevent.jl:140
wait at ./asyncevent.jl:155 [inlined]
profile_printing_listener at /data/vtjnash/julia1/usr/share/julia/stdlib/v1.10/Profile/src/Profile.jl:39
jfptr_YY.3_58617 at /data/vtjnash/julia1/usr/lib/julia/sys.so (unknown line)
_jl_invoke at /data/vtjnash/julia1/src/gf.c:2665 [inlined]
ijl_apply_generic at /data/vtjnash/julia1/src/gf.c:2866
jl_apply at /data/vtjnash/julia1/src/julia.h:1870 [inlined]
start_task at /data/vtjnash/julia1/src/task.c:1093
Aborted

Fixes #37400
vtjnash added a commit that referenced this issue Jan 13, 2023
Closes #47590 (pthread_cancel still forbidden though, since async mode
will corrupt or deadlock the process, and synchronously tested with
cancelation disabled whenever this is a lock is just a slow
implementation of a boolean)

Refs #47201 (only deals with thread exit, not other case where this is
an issue, like cfunction exit and gc-safe-leave)

May help #46537, by blocking jl_wake_libuv before uv_library_shutdown,
and other tweaks to GC mode. For example, avoiding:

[4011824] signal (6.-6): Aborted
gsignal at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
uv__async_send at /workspace/srcdir/libuv/src/unix/async.c:198
uv_async_send at /workspace/srcdir/libuv/src/unix/async.c:73
jl_wake_libuv at /data/vtjnash/julia1/src/jl_uv.c:44 [inlined]
JL_UV_LOCK at /data/vtjnash/julia1/src/jl_uv.c:64 [inlined]
ijl_iolock_begin at /data/vtjnash/julia1/src/jl_uv.c:72
iolock_begin at ./libuv.jl:48 [inlined]
_trywait at ./asyncevent.jl:140
wait at ./asyncevent.jl:155 [inlined]
profile_printing_listener at /data/vtjnash/julia1/usr/share/julia/stdlib/v1.10/Profile/src/Profile.jl:39
jfptr_YY.3_58617 at /data/vtjnash/julia1/usr/lib/julia/sys.so (unknown line)
_jl_invoke at /data/vtjnash/julia1/src/gf.c:2665 [inlined]
ijl_apply_generic at /data/vtjnash/julia1/src/gf.c:2866
jl_apply at /data/vtjnash/julia1/src/julia.h:1870 [inlined]
start_task at /data/vtjnash/julia1/src/task.c:1093
Aborted

Fixes #37400
vtjnash added a commit that referenced this issue Jan 13, 2023
Closes #47590 (pthread_cancel still forbidden though, since async mode
will corrupt or deadlock the process, and synchronously tested with
cancelation disabled whenever this is a lock is just a slow
implementation of a boolean)

Refs #47201 (only deals with thread exit, not other case where this is
an issue, like cfunction exit and gc-safe-leave)

May help #46537, by blocking jl_wake_libuv before uv_library_shutdown,
and other tweaks to GC mode. For example, avoiding:

[4011824] signal (6.-6): Aborted
gsignal at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
uv__async_send at /workspace/srcdir/libuv/src/unix/async.c:198
uv_async_send at /workspace/srcdir/libuv/src/unix/async.c:73
jl_wake_libuv at /data/vtjnash/julia1/src/jl_uv.c:44 [inlined]
JL_UV_LOCK at /data/vtjnash/julia1/src/jl_uv.c:64 [inlined]
ijl_iolock_begin at /data/vtjnash/julia1/src/jl_uv.c:72
iolock_begin at ./libuv.jl:48 [inlined]
_trywait at ./asyncevent.jl:140
wait at ./asyncevent.jl:155 [inlined]
profile_printing_listener at /data/vtjnash/julia1/usr/share/julia/stdlib/v1.10/Profile/src/Profile.jl:39
jfptr_YY.3_58617 at /data/vtjnash/julia1/usr/lib/julia/sys.so (unknown line)
_jl_invoke at /data/vtjnash/julia1/src/gf.c:2665 [inlined]
ijl_apply_generic at /data/vtjnash/julia1/src/gf.c:2866
jl_apply at /data/vtjnash/julia1/src/julia.h:1870 [inlined]
start_task at /data/vtjnash/julia1/src/task.c:1093
Aborted

Fixes #37400
vtjnash added a commit that referenced this issue Jan 14, 2023
Closes #47590 (pthread_cancel still forbidden though, since async mode
will corrupt or deadlock the process, and synchronously tested with
cancelation disabled whenever this is a lock is just a slow
implementation of a boolean)

Refs #47201 (only deals with thread exit, not other case where this is
an issue, like cfunction exit and gc-safe-leave)

May help #46537, by blocking jl_wake_libuv before uv_library_shutdown,
and other tweaks to GC mode. For example, avoiding:

[4011824] signal (6.-6): Aborted
gsignal at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
uv__async_send at /workspace/srcdir/libuv/src/unix/async.c:198
uv_async_send at /workspace/srcdir/libuv/src/unix/async.c:73
jl_wake_libuv at /data/vtjnash/julia1/src/jl_uv.c:44 [inlined]
JL_UV_LOCK at /data/vtjnash/julia1/src/jl_uv.c:64 [inlined]
ijl_iolock_begin at /data/vtjnash/julia1/src/jl_uv.c:72
iolock_begin at ./libuv.jl:48 [inlined]
_trywait at ./asyncevent.jl:140
wait at ./asyncevent.jl:155 [inlined]
profile_printing_listener at /data/vtjnash/julia1/usr/share/julia/stdlib/v1.10/Profile/src/Profile.jl:39
jfptr_YY.3_58617 at /data/vtjnash/julia1/usr/lib/julia/sys.so (unknown line)
_jl_invoke at /data/vtjnash/julia1/src/gf.c:2665 [inlined]
ijl_apply_generic at /data/vtjnash/julia1/src/gf.c:2866
jl_apply at /data/vtjnash/julia1/src/julia.h:1870 [inlined]
start_task at /data/vtjnash/julia1/src/task.c:1093
Aborted

Fixes #37400
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GC Garbage collector multithreading Base.Threads and related functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants