Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock from simple repeated @spawn and wait #36699

Closed
tkf opened this issue Jul 16, 2020 · 11 comments · Fixed by #36785
Closed

Deadlock from simple repeated @spawn and wait #36699

tkf opened this issue Jul 16, 2020 · 11 comments · Fixed by #36785
Labels
multithreading Base.Threads and related functionality regression Regression in behavior compared to a previous version
Milestone

Comments

@tkf
Copy link
Member

tkf commented Jul 16, 2020

Simply repeating @spawn and wait causes a dead lock:

julia> for _ in 1:1000
           for _ in 1:1000
               ts = [(Threads.@spawn nothing) for _ in 1:Threads.nthreads()]
               foreach(wait, ts)
           end
           print('.')
           flush(stdout)
       end
............................................^CERROR: InterruptException:
Stacktrace:
 [1] poptask(W::Base.InvasiveLinkedListSynchronized{Task})
   @ Base ./task.jl:706
 [2] wait
   @ ./task.jl:714 [inlined]
 [3] wait(c::Base.GenericCondition{Base.Threads.SpinLock})
   @ Base ./condition.jl:106
 [4] _wait(t::Task)
   @ Base ./task.jl:240
 [5] wait
   @ ./task.jl:267 [inlined]
 [6] foreach(f::typeof(wait), itr::Vector{Task})
   @ Base ./abstractarray.jl:2072
 [7] top-level scope
   @ REPL[4]:4

To edit a specific method, type the corresponding number into the REPL and press Ctrl+Q

julia> versioninfo()
Julia Version 1.6.0-DEV.477
Commit b07594dac0 (2020-07-16 18:42 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)

julia> Threads.nthreads()
4

In the above session, I saw '.'s printed pretty rapidly until it hangs. Then I terminated the evaluation with Ctrl-C.

I see this in 1.5.0-rc1.0 but not in 1.4.

(I tried JULIA_RR_RECORD_ARGS='--chaos --num-cores=4' JULIA_NUM_THREADS=4 julia --startup-file=no --bug-report=rr etc. to get a deadlock in rr. But so far I couldn't get it under rr.)

@tkf tkf added the multithreading Base.Threads and related functionality label Jul 16, 2020
@tkf
Copy link
Member Author

tkf commented Jul 16, 2020

I can reproduce this quite robustly in my laptop with 4 physical cores using JULIA_NUM_THREADS=2,3,4,5,6 (maybe less frequently with 2 and 6). OTOH, I couldn't get the hang in another laptop with 2 physical cores.

@tkf tkf added this to the 1.5 milestone Jul 16, 2020
@tkf
Copy link
Member Author

tkf commented Jul 16, 2020

I just added it to the 1.5 milestone. Feel free to remove it if it's not reproducible.

@MasonProtter
Copy link
Contributor

MasonProtter commented Jul 16, 2020

Yep, I'm seeing this on

julia> versioninfo()
Julia Version 1.5.0-rc1.0
Commit 24f033c951* (2020-06-26 20:13 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen 5 2600 Six-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, znver1)
Environment:
  JULIA_NUM_THREADS = 3

for nthreads() in 3:7. I have 6 physical 12 logical cores. Haven't managed to reproduce for nthreads() in 1:2 or nthreads() in 8:12.

@tkf
Copy link
Member Author

tkf commented Jul 16, 2020

Thanks!

So it looks like robustly reproducing this requires that JULIA_NUM_THREADS is around the number of physical cores and greater than 2.

@mcabbott
Copy link
Contributor

I see this too, with 6 cores. When it runs it takes 5s or so. Sometimes it pauses, and can be re-started by pressing enter, and then takes 8s or 20s depending when I react. And sometimes it needs to be interrupted:

julia> versioninfo()
Julia Version 1.5.0-rc1.0
Commit 24f033c951 (2020-06-26 20:13 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin19.4.0)
  CPU: Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 6

On 1.4, no problem. Also takes about 5s. (I used @time for _ in 1:1000 ... always.)

On 1.6.0-DEV.305, instead it runs very slowly, 90s every time, no pauses no stops.

And on a 2-core machine, no problems.

@c42f
Copy link
Member

c42f commented Jul 17, 2020

I can reproduced this on a recent 1.6.0-DEV.464 (Commit 29826c2), though it seems to take more than 1000 outer iterations on average. (Linux 4.15.0-101-generic #102-Ubuntu, Intel i7-7700HQ laptop CPU)

@JeffBezanson JeffBezanson added the regression Regression in behavior compared to a previous version label Jul 21, 2020
@JeffBezanson
Copy link
Member

Looks like this is probably 65b8e7e.

@JeffBezanson
Copy link
Member

This seems to fix it:

--- a/src/partr.c
+++ b/src/partr.c
@@ -367,8 +367,7 @@ JL_DLLEXPORT void jl_wakeup_thread(int16_t tid)
         // something added to the sticky-queue: notify that thread
         wake_thread(tid);
         // check if we need to notify uv_run too
-        unsigned long system_tid = jl_all_tls_states[tid]->system_id;
-        if (uvlock != system_self && jl_atomic_load(&jl_uv_mutex.owner) == system_tid)
+        if (uvlock != system_self)
             wake_libuv();
     }

@tkf
Copy link
Member Author

tkf commented Jul 22, 2020

I tried the patch and it fixes the issue for me. I tried a 10x longer outer loop and it finishes without a deadlock.

@vtjnash
Copy link
Member

vtjnash commented Jul 22, 2020

If that is sufficient, then I think this should also have been sufficient (but it's not):

diff --git a/src/partr.c b/src/partr.c
index 61e28814c2..fdf25b3312 100644
--- a/src/partr.c
+++ b/src/partr.c
@@ -327,7 +327,7 @@ static int sleep_check_after_threshold(uint64_t *start_cycles)
 }
 
 
-static void wake_thread(int16_t tid)
+static int wake_thread(int16_t tid)
 {
     jl_ptls_t other = jl_all_tls_states[tid];
     if (jl_atomic_load(&other->sleep_check_state) != not_sleeping) {
@@ -336,8 +336,10 @@ static void wake_thread(int16_t tid)
             uv_mutex_lock(&other->sleep_lock);
             uv_cond_signal(&other->wake_signal);
             uv_mutex_unlock(&other->sleep_lock);
+            return 1;
         }
     }
+    return 0;
 }
 
 
@@ -365,10 +367,8 @@ JL_DLLEXPORT void jl_wakeup_thread(int16_t tid)
     }
     else {
         // something added to the sticky-queue: notify that thread
-        wake_thread(tid);
         // check if we need to notify uv_run too
-        unsigned long system_tid = jl_all_tls_states[tid]->system_id;
-        if (uvlock != system_self && jl_atomic_load(&jl_uv_mutex.owner) == system_tid)
+        if (wake_thread(tid) && uvlock != system_self && tid == 0)
             wake_libuv();
     }
     // check if the other threads might be sleeping

Keno added a commit that referenced this issue Jul 24, 2020
Fixes #36699 I belive. I think I know why this fixes it, but
I need to validate that I got it correct before writing it up.
Nevertheless, since I think this fixes the issue and the issue
is release blocking, here's the fix ahead of the writeup.
@Keno
Copy link
Member

Keno commented Jul 24, 2020

See if #36785 helps.

Keno added a commit that referenced this issue Jul 24, 2020
Fixes #36699 I belive. I think I know why this fixes it, but
I need to validate that I got it correct before writing it up.
Nevertheless, since I think this fixes the issue and the issue
is release blocking, here's the fix ahead of the writeup.
JeffBezanson pushed a commit that referenced this issue Jul 24, 2020
simeonschaub pushed a commit to simeonschaub/julia that referenced this issue Aug 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multithreading Base.Threads and related functionality regression Regression in behavior compared to a previous version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants