Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up atomic access to sleep_check_state #36785

Merged
merged 1 commit into from
Jul 24, 2020
Merged

Clean up atomic access to sleep_check_state #36785

merged 1 commit into from
Jul 24, 2020

Conversation

Keno
Copy link
Member

@Keno Keno commented Jul 24, 2020

Fixes #36699 I belive. Here's my best understanding of what's happening.
The relevant operations are:

Thread 1:

push!(W, task) # assign the task to the waitqueue, implicitly a non-atomic assign to .head of the queue protected by a lock
# Some time later
if (atomic_load(sleep_state) == sleeping) {
   if (atomic_exchange(...)) {
     wake
   }
}

Thread 2:

if isempty(W) {
   atomic_store(sleep_state)
   !isempty(W) && continue()
   sleep() 
}

Now, x86 has a very strong memory model, but one thing in particular that is allowed is reordering loads before previous stores to another memory location. In particular, it is legal for the atomic_load() of sleep state to be speculated before the addition of task to W. As a result, it is possible to have the following interleaving:

T1:
r = atomic_load(sleep_state) # r == not_sleeping

T2: 
   atomic_store(sleep_state)
   isempty(W) # true
   sleep()

T1:
push!(W, task) # assign the task to the waitqueue, implicitly a non-atomic assign to .head of the queue protected by a lock
# Some time later
if (r == sleeping #= true =#) {

}

Thus T1 never wakes up T2 the observed deadlock results.

This commit fixes that by dropping the atomic_load and using and atomic exchange instead, which may not be re-ordered in the same way. A different fix would be to place an explicit ReadWrite barrier before attempting to wake up the threads.
That said, I think we need to take a close look at this code in general with respect to atomic correctness. I suspect there are more demons.

Fixes #36699 I belive. I think I know why this fixes it, but
I need to validate that I got it correct before writing it up.
Nevertheless, since I think this fixes the issue and the issue
is release blocking, here's the fix ahead of the writeup.
@Keno Keno added backport 1.5 bugfix This change fixes an existing bug multithreading Base.Threads and related functionality labels Jul 24, 2020
Copy link
Member

@tkf tkf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This fixes the problem for me. (I tried 10x longer version of #36699 a few times. No deadlock so far.)

@vtjnash
Copy link
Member

vtjnash commented Jul 24, 2020

So the issue is just that may_sleep is getting hoisted (by the compiler, I assume)?

Copy link
Member

@c42f c42f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent some time staring at this in confusion :-)

At a surface level I can see is that there's various worrying-looking patterns going on here in terms of threads loading from sleep_check_state without using atomic operations (not even relaxed loads) and also using multiple atomics to act on that state non atomically.

I noticed that there's some loads of sleep_check_state in this file that this patch doesn't cover: line 436 and 441 — they should be relaxed loads at least? Also if sleep_check_state is used to protect other state, do we need to be using acquire loads?

For the state transition not_sleeping <--> sleeping, this is done in various places, and as a pair of atomics rather than a CAS which might be fine but seems non-obvious. I think it may be clearer to factor the transition into a function in the spirit of may_sleep.

As to the exact cause of the deadlock I sure didn't see it by just eyeballing this diff! I look forward to the explanation.

@@ -495,7 +494,7 @@ JL_DLLEXPORT jl_task_t *jl_task_get_next(jl_value_t *trypoptask, jl_value_t *q)
if (!_threadedregion && active && ptls->tid == 0) {
// thread 0 is the only thread permitted to run the event loop
// so it needs to stay alive
if (ptls->sleep_check_state != not_sleeping)
if (jl_atomic_load_relaxed(&ptls->sleep_check_state) != not_sleeping)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (jl_atomic_load_relaxed(&ptls->sleep_check_state) != not_sleeping)
if (jl_atomic_load_relaxed(&ptls->sleep_check_state) == sleeping)

@@ -476,8 +476,7 @@ JL_DLLEXPORT jl_task_t *jl_task_get_next(jl_value_t *trypoptask, jl_value_t *q)
JL_UV_UNLOCK();
// optimization: check again first if we may have work to do
if (!may_sleep(ptls)) {
if (ptls->sleep_check_state != not_sleeping)
jl_atomic_store(&ptls->sleep_check_state, not_sleeping); // let other threads know they don't need to wake us
assert(ptls->sleep_check_state == not_sleeping);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just make every access of sleep_check_state atomic as a matter of course?

Suggested change
assert(ptls->sleep_check_state == not_sleeping);
assert(jl_atomic_load_relaxed(&ptls->sleep_check_state) == not_sleeping);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. The change to use assert also makes sense (since that's currently the implementation of may_sleep), but assumes that we don't alter the implementation of may_sleep to possibly have disjoint conditions that we want to check (i.e. the original partr code had a global sleep_check_state, so that bulk operations were also attempted).

// sleep_check_state is only transitioned from not_sleeping to sleeping
// by the thread itself. As a result, if this returns false, it will
// continue returning false. If it returns true, there are no guarantees.
return jl_atomic_load_relaxed(&ptls->sleep_check_state) == sleeping;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need something stronger than relaxed ordering here, because sleep_check_state is meant to ensure consistency of other loads/stores?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual condition variable is protected by a lock, so we don't depend on any memory access dependencies here, I don't think. That's why I picked relaxed. Do you disagree?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you disagree?

I don't disagree, I'm just unsure.

The condition variable seems fine. But I noticed that may_sleep is also used in a branch related to the event loop which is protected with jl_uv_mutex. So may_sleep doesn't always go along with the same lock. Which seems potentially confusing but perhaps harmless.

@@ -358,7 +355,7 @@ JL_DLLEXPORT void jl_wakeup_thread(int16_t tid)
JULIA_DEBUG_SLEEPWAKE( wakeup_enter = cycleclock() );
if (tid == self || tid == -1) {
// we're already awake, but make sure we'll exit uv_run
if (ptls->sleep_check_state != not_sleeping)
if (jl_atomic_load_relaxed(&ptls->sleep_check_state) == sleeping)
jl_atomic_store(&ptls->sleep_check_state, not_sleeping);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this pair of operations looks like a CAS, but done non-atomically. Can that be a problem?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a data race, because the only other writer is monotonic (makes the same change). In fact, we can probably relax the atomic store here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right I think that makes sense. I wondered whether the store could be less strict.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon further reflection the atomic does guard the memory contents of the work queue, so we might want something stronger. I need to think about it.

Copy link
Member

@vtjnash vtjnash Jul 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this now reads like an CAS, but it's not. I'd suggest writing this as != sleeping, so it's clearer that it's just a regular (relaxed) store.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the version which avoids the double negative. To be more explanatory it would be helpful to factor the uses of sleep_check_state into a minimum of locations and expand the comments describing which state/invariant it's meant to protect and how this implies the atomic ops which are used.

@JeffBezanson
Copy link
Member

Comparing to 1.4, the test case in the issue (if you put it inside a function) goes from 3.45 seconds to 3.0 seconds, so that's also excellent 🎉 (4 threads in my case)

@Keno
Copy link
Member Author

Keno commented Jul 24, 2020

At a surface level I can see is that there's various worrying-looking patterns going on here in terms of threads loading from sleep_check_state without using atomic operations

Yes, I think this code could use a bit of a refactor. However, I would like to do that more systematically and also make use of tsan support to check these things under the test suite. This PR is just here to get the bug fixed.

@c42f
Copy link
Member

c42f commented Jul 24, 2020

This PR is just here to get the bug fixed.

Fine by me, this is my first look into this code.

@JeffBezanson
Copy link
Member

We should add the test case from the issue, just to check that it doesn't hang.

@Keno
Copy link
Member Author

Keno commented Jul 24, 2020

I've updated the description with my root cause analysis of this issue.

@Keno
Copy link
Member Author

Keno commented Jul 24, 2020

We should add the test case from the issue, just to check that it doesn't hang.

I think you're inviting hangs on non-x86 platforms that have an even weaker memory model. I would certainly recommend against backporting such a test.

@JeffBezanson JeffBezanson merged commit d81f044 into master Jul 24, 2020
@JeffBezanson JeffBezanson deleted the kf/36699 branch July 24, 2020 18:14
JeffBezanson pushed a commit that referenced this pull request Jul 24, 2020
simeonschaub pushed a commit to simeonschaub/julia that referenced this pull request Aug 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix This change fixes an existing bug multithreading Base.Threads and related functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deadlock from simple repeated @spawn and wait
6 participants