-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
threading: fixup scheduler statepoints for GC #32238
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yuyichao is this right/sufficient?
Which one is the fix? Lockwise there seems to be fewer safepoints in the loop? |
There's now a statepoint (well, two) in every get_task_next call, instead of just one at the top of the loop. The missing one per the issue comment is the one just before the blocking uv_run call. The other part of the fix is ensuring that we don't run the Julia code callback from inside the region marked gc-safe. |
src/partr.c
Outdated
@@ -481,16 +484,18 @@ JL_DLLEXPORT jl_task_t *jl_task_get_next(jl_value_t *getsticky) | |||
} | |||
} | |||
// the other threads will just wait for on signal to resume | |||
int8_t gc_state = jl_gc_safe_enter(ptls); | |||
uv_mutex_lock(&ptls->sleep_lock); | |||
while (jl_atomic_load(&sleep_check_state) == sleeping) { | |||
task = get_next_task(getsticky); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This call also can't happen while holding the sleep lock (because it contains gc allocations and/or safepoints) and/or we cannot attempt to wake up all the other threads to get them to help with the gc effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, in this case assuming get_next_task can call julia code/trigger GC then the uv_mutex_lock(&ptls->sleep_lock);
must be done in GC safe region. Alternatively, all holders of the lock must not trigger any GC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense to me that we ought to be able to check for available tasks without triggering GC. Of course, that's hard to guarantee from julia code.
src/partr.c
Outdated
@@ -481,16 +484,18 @@ JL_DLLEXPORT jl_task_t *jl_task_get_next(jl_value_t *getsticky) | |||
} | |||
} | |||
// the other threads will just wait for on signal to resume | |||
int8_t gc_state = jl_gc_safe_enter(ptls); | |||
uv_mutex_lock(&ptls->sleep_lock); | |||
while (jl_atomic_load(&sleep_check_state) == sleeping) { | |||
task = get_next_task(getsticky); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, in this case assuming get_next_task can call julia code/trigger GC then the uv_mutex_lock(&ptls->sleep_lock);
must be done in GC safe region. Alternatively, all holders of the lock must not trigger any GC.
We basically did not have a GC safepoint in some paths that could lead to a thread blocking. This should fix the issue noticed in #32217 (comment). Commits can be squashed.