Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add an ability to suspend/resume a thread in a GC-safe way #51489

Merged
merged 4 commits into from
Oct 6, 2023

Conversation

vtjnash
Copy link
Sponsor Member

@vtjnash vtjnash commented Sep 28, 2023

This exposes the GC "stop the world" API to the user, for causing a thread to quickly stop executing Julia code. This adds two APIs (that will need to be exported and documented later):

julia> @ccall jl_safepoint_suspend_thread(#=tid=#1::Cint, #=magicnumber=#2::Cint)::Cint # roughly tkill(1, SIGSTOP)

julia> @ccall jl_safepoint_resume_thread(#=tid=#1::Cint)::Cint # roughly tkill(1, SIGCONT)

You can even suspend yourself, if there is another task to resume you 10 seconds later:

julia> ccall(:jl_enter_threaded_region, Cvoid, ())

julia> t = @task let; Libc.systemsleep(10); print("\nhello from $(Threads.threadid())\n"); @ccall jl_safepoint_resume_thread(0::Cint)::Cint; end; ccall(:jl_set_task_tid, Cint, (Any, Cint), t, 1); schedule(t);

julia> @time @ccall jl_safepoint_suspend_thread(0::Cint, 2::Cint)::Cint

hello from 2
  10 seconds (6 allocations: 264 bytes)
1

The meaning of the magic number is actually the kind of stop that you want:

// n.b. suspended threads may still run in the GC or GC safe regions
// but shouldn't be observable, depending on which enum the user picks (only 1 and 2 are typically recommended here)
// waitstate = 0 : do not wait for suspend to finish
// waitstate = 1 : wait for gc_state != 0 (JL_GC_STATE_WAITING or JL_GC_STATE_SAFE)
// waitstate = 2 : wait for gc_state != 0 (JL_GC_STATE_WAITING or JL_GC_STATE_SAFE) and that GC is not running on that thread
// waitstate = 3 : wait for full suspend (gc_state == JL_GC_STATE_WAITING) -- this may never happen if thread is sleeping currently
// if another thread comes along and calls jl_safepoint_resume, we also return early
// return new suspend count on success, 0 on failure

Only magic number 2 is currently meaningful to the user though. The difference between waitstate 1 and 2 is only relevant in C code which is calling this from JL_GC_STATE_SAFE, since otherwise it is a priori known that GC isn't running, else we too would be running the GC. But the distinction of those states might be useful if we have a concurrent collector.

@vtjnash vtjnash added the multithreading Base.Threads and related functionality label Sep 28, 2023
@vtjnash vtjnash merged commit 3f23533 into master Oct 6, 2023
5 of 7 checks passed
@vtjnash vtjnash deleted the jn/SIGSTOP-SIGCONT branch October 6, 2023 13:33
@vchuravy
Copy link
Member

vchuravy commented Oct 6, 2023

Would be good to document this somehwere? Maybe the dev docs.

@vtjnash
Copy link
Sponsor Member Author

vtjnash commented Oct 6, 2023

I didn't want to spend too much more time on it just at the moment, since it is not currently important for anyone, but may become useful later.

topolarity added a commit to topolarity/julia that referenced this pull request Nov 2, 2023
The timing system does not currently support nesting task suspensions,
so this `JL_TIMING_SUSPEND_TASK` added in JuliaLang#51489 is not permitted since
it is called from within the GC suspension.

This was causing Tracy to crash upon recording with "zone ended twice"
KristofferC pushed a commit that referenced this pull request Nov 3, 2023
The timing system does not currently support nesting task suspensions,
so this `JL_TIMING_SUSPEND_TASK` added in #51489 is not permitted since
it is called from within the GC suspension.

This was causing Tracy to crash upon recording with "zone ended twice"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multithreading Base.Threads and related functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants