-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make profiling more robust with many tasks #42978
Conversation
Definite no to this, but it has appeared to me that we need to make |
Yeah, I closed the issue since I realized I might be breaking the invariance that the lock is meant to protect. Can a signal get lost? It explains the behavior I see, but, if so, we can't call Somewhat aside and I'm no expert in signal programming, but aren't there too many synchronizations in the signal handler? Why not fill the profiling buffer in a lock-free manner directly in the signal handler of each worker? Maybe a naive thought, but it seems to simplify a lot of tricky communications. |
A signal cannot get lost by the kernel, but some libc APIs may discard them, and ptrace especially may end up in them being accidentally lost. Yes, we would likely switch to That would be generally valid also, but less flexible. |
This makes it difficult for `usr2_handler` to observe null ptls.
It makes sure the current task to have valid ptls.
I think it'd be nice to separate the commits for the edge case handling (use |
lastt->ptls = NULL; | ||
} | ||
|
||
// set up global state for new task and clear global state for old task | ||
t->ptls = ptls; | ||
jl_atomic_store_relaxed(&ptls->current_task, t); | ||
JL_GC_PROMISE_ROOTED(t); | ||
jl_signal_fence(); | ||
jl_set_pgcstack(&t->gcstack); | ||
jl_signal_fence(); | ||
lastt->ptls = NULL; | ||
#ifdef MIGRATE_TASKS | ||
ptls->previous_task = lastt; | ||
#endif | ||
jl_set_pgcstack(&t->gcstack); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a debugging session today, Jameson figured out that the request was not delivered because we are hitting if (ptls == NULL)
branch in usr2_handler
Lines 469 to 471 in 569d56f
jl_ptls_t ptls = ct->ptls; | |
if (ptls == NULL) | |
return; |
It then would "look like" the signal is lost (but actually we were just ignoring it).
The fix was to ensure setting lastt->ptls = NULL
after jl_set_pgcstack(&t->gcstack)
. Since lastt
is still the current task until jl_set_pgcstack(&t->gcstack)
takes effect, we were previously observing jl_get_current_task()->ptls == NULL
. This is fixed by the above patch.
test/threads.jl
Outdated
@@ -147,3 +147,39 @@ end | |||
|
|||
# We don't need the watchdog anymore | |||
close(proc.in) | |||
|
|||
# https://github.com/JuliaLang/julia/pull/42973 | |||
@testset "spawn and wait *a lot* of tasks in @profile" begin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@testset "spawn and wait *a lot* of tasks in @profile" begin | |
Sys.islinux() && @testset "spawn and wait *a lot* of tasks in @profile" begin |
perhaps limit this to linux, then merge, and open a new PR to remove the conditional and then we can work on any issues found by other platforms?
Co-authored-by: Jameson Nash <vtjnash@gmail.com>
This patch includes two sets of changes. (1) `jl_thread_suspend_and_get_state` uses `pthread_cond_timedwait` to recover from the case where the request is not received by the signal handler. This is required because `usr2_handler` contains some paths for the case where it is not possible to obtain `ptls`. (2) `ctx_switch` now makes sure to null out `ptls` of the last task (`lastt->ptls = NULL`) after changing the current task by updating pgcstack (`jl_set_pgcstack(&t->gcstack)`). This closes the gap in which `usr2_handler` can observe the null `ptls`. Co-authored-by: Jameson Nash <vtjnash@gmail.com> (cherry picked from commit 8131580)
This patch includes two sets of changes. (1) `jl_thread_suspend_and_get_state` uses `pthread_cond_timedwait` to recover from the case where the request is not received by the signal handler. This is required because `usr2_handler` contains some paths for the case where it is not possible to obtain `ptls`. (2) `ctx_switch` now makes sure to null out `ptls` of the last task (`lastt->ptls = NULL`) after changing the current task by updating pgcstack (`jl_set_pgcstack(&t->gcstack)`). This closes the gap in which `usr2_handler` can observe the null `ptls`. Co-authored-by: Jameson Nash <vtjnash@gmail.com> (cherry picked from commit 8131580)
This patch includes two sets of changes. (1) `jl_thread_suspend_and_get_state` uses `pthread_cond_timedwait` to recover from the case where the request is not received by the signal handler. This is required because `usr2_handler` contains some paths for the case where it is not possible to obtain `ptls`. (2) `ctx_switch` now makes sure to null out `ptls` of the last task (`lastt->ptls = NULL`) after changing the current task by updating pgcstack (`jl_set_pgcstack(&t->gcstack)`). This closes the gap in which `usr2_handler` can observe the null `ptls`. Co-authored-by: Jameson Nash <vtjnash@gmail.com>
This patch includes two sets of changes. (1) `jl_thread_suspend_and_get_state` uses `pthread_cond_timedwait` to recover from the case where the request is not received by the signal handler. This is required because `usr2_handler` contains some paths for the case where it is not possible to obtain `ptls`. (2) `ctx_switch` now makes sure to null out `ptls` of the last task (`lastt->ptls = NULL`) after changing the current task by updating pgcstack (`jl_set_pgcstack(&t->gcstack)`). This closes the gap in which `usr2_handler` can observe the null `ptls`. Co-authored-by: Jameson Nash <vtjnash@gmail.com>
As I mentioned in #42973, there seems to be a deadlock problem in the profiler. A bit of
rr
ing points to that there seems to be a problem in lock ordering ofjl_lock_profile
(threadsafe
) andjl_thread_suspend_and_get_state
.Here's an illustrative
rr replay
session of the execution recorded byOPENBLAS_NUM_THREADS=1 rr record --num-cores=8 julia-debug script.jl
and terminated bySIGHUP
once the process is idle.As you can see, thread 1 is executing
jl_profile_atomic
:julia/src/debuginfo.cpp
Lines 393 to 406 in 564ddfe
and thread 2 is executing
jl_thread_suspend_and_get_state
:julia/src/signals-unix.c
Lines 753 to 760 in 564ddfe
My patch 71041f7 movesjl_lock_profile
afterjl_thread_suspend_and_get_state
to avoid suspending the worker thread while it's waiting for the profilerthreadsafe
lock.close #42975