Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PR #32599 introduced deadlocks #35441

Closed
tkf opened this issue Apr 12, 2020 · 9 comments
Closed

PR #32599 introduced deadlocks #35441

tkf opened this issue Apr 12, 2020 · 9 comments
Labels
multithreading Base.Threads and related functionality

Comments

@tkf
Copy link
Member

tkf commented Apr 12, 2020

I think it is very likely that PR #32599 introduced a bug that causes deadlocks. I have two MEWs below.

Since these MWEs require running for a very long time and there is a non-negligible change that the MWEs cause the segfault fixed by #35387, I back-ported #35387 to the version before and after #32599:

The MWEs below cause a deadlock in post32599-pr35387 and master but not in pre32599-pr35387 (the MWEs run until the end).

(I used JULIA_NUM_THREADS=13 when running the following MWEs. The specific number of threads to invoke this deadlock seem to be different for each machine. @chriselrod seems to need only JULIA_NUM_THREADS=8.)

MWE 1

This is the snippet based on the one #35387 (comment) by @chriselrod. In my machine, single @benchmark tproduct_fm($x) didn't cause the deadlock so I put it in a loop. In the first run, it caused a deadlock after printing i = 84.

using Base.Threads: @spawn, @sync, nthreads
using LinearAlgebra, BenchmarkTools

function tproduct_fm(x)
    @fastmath begin
        X3 = cbrt.(x)
        X9 = cbrt.(X3).^2
    end
    lx = length(x)
    H = Matrix{eltype(x)}(undef, lx, lx)
    @sync for x_index_chunk in Iterators.partition(eachindex(x), lx ÷ 2nthreads())
        @spawn begin
            for i  x_index_chunk
                x3 = X3[i]
                x9 = X9[i]
                js = firstindex(x):i
                @fastmath for j in js
                    A = x3 + X3[j]
                    B = x3 * X3[j]
                    C = x9 + X9[j]
                    f = A^2 * C
                    g = -(B*A)
                    H[j, i] = f * g
                end
            end
        end
    end
    return Symmetric(H)
end
x = rand(1000);

for i in 1:1000
    @show i
    @benchmark tproduct_fm($x)
end
Example backtrace after Ctrl-C
^C                                                       
signal (2): Interrupt                                   
in expression starting at /tmp/tproduct_fm.jl:32         
pthread_cond_wait at /lib/x86_64-linux-gnu/libpthread.so.0 (unknown line)
uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:827
jl_task_get_next at /post32599-pr35387/src/partr.c:475                     
poptaskref at ./task.jl:702                                  
wait at ./task.jl:709                                                    
task_done_hook at ./task.jl:444                                
_jl_invoke at /post32599-pr35387/src/gf.c:2144 [inlined]     
jl_apply_generic at /post32599-pr35387/src/gf.c:2328            
jl_apply at /post32599-pr35387/src/julia.h:1695 [inlined]          
jl_finish_task at /post32599-pr35387/src/task.c:198   
start_task at /post32599-pr35387/src/task.c:697          
unknown function (ip: (nil))                                      
unknown function (ip: (nil))                             
pthread_cond_wait at /lib/x86_64-linux-gnu/libpthread.so.0 (unknown line)
uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:827
jl_task_get_next at /post32599-pr35387/src/partr.c:475
poptaskref at ./task.jl:702
wait at ./task.jl:709
task_done_hook at ./task.jl:444
_jl_invoke at /post32599-pr35387/src/gf.c:2144 [inlined]
jl_apply_generic at /post32599-pr35387/src/gf.c:2328
jl_apply at /post32599-pr35387/src/julia.h:1695 [inlined]
jl_finish_task at /post32599-pr35387/src/task.c:198
start_task at /post32599-pr35387/src/task.c:697
unknown function (ip: (nil))
unknown function (ip: (nil))
pthread_cond_wait at /lib/x86_64-linux-gnu/libpthread.so.0 (unknown line)
uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:827
jl_task_get_next at /post32599-pr35387/src/partr.c:475
poptaskref at ./task.jl:702
wait at ./task.jl:709
task_done_hook at ./task.jl:444
_jl_invoke at /post32599-pr35387/src/gf.c:2144 [inlined]
jl_apply_generic at /post32599-pr35387/src/gf.c:2328
jl_apply at /post32599-pr35387/src/julia.h:1695 [inlined]
jl_finish_task at /post32599-pr35387/src/task.c:198
start_task at /post32599-pr35387/src/task.c:697
unknown function (ip: (nil))
unknown function (ip: (nil))
pthread_cond_wait at /lib/x86_64-linux-gnu/libpthread.so.0 (unknown line)
uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:827
jl_task_get_next at /post32599-pr35387/src/partr.c:475
poptaskref at ./task.jl:702
wait at ./task.jl:709
task_done_hook at ./task.jl:444
_jl_invoke at /post32599-pr35387/src/gf.c:2144 [inlined]
jl_apply_generic at /post32599-pr35387/src/gf.c:2328
jl_apply at /post32599-pr35387/src/julia.h:1695 [inlined]
jl_finish_task at /post32599-pr35387/src/task.c:198
start_task at /post32599-pr35387/src/task.c:697
unknown function (ip: (nil))
unknown function (ip: (nil))
pthread_cond_wait at /lib/x86_64-linux-gnu/libpthread.so.0 (unknown line)
uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:827
jl_task_get_next at /post32599-pr35387/src/partr.c:475
poptaskref at ./task.jl:702
wait at ./task.jl:709
task_done_hook at ./task.jl:444
_jl_invoke at /post32599-pr35387/src/gf.c:2144 [inlined]
jl_apply_generic at /post32599-pr35387/src/gf.c:2328
jl_apply at /post32599-pr35387/src/julia.h:1695 [inlined]
jl_finish_task at /post32599-pr35387/src/task.c:198
start_task at /post32599-pr35387/src/task.c:697
unknown function (ip: (nil))
unknown function (ip: (nil))
pthread_cond_wait at /lib/x86_64-linux-gnu/libpthread.so.0 (unknown line)
uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:827
jl_task_get_next at /post32599-pr35387/src/partr.c:475
poptaskref at ./task.jl:702
wait at ./task.jl:709
task_done_hook at ./task.jl:444
_jl_invoke at /post32599-pr35387/src/gf.c:2144 [inlined]
jl_apply_generic at /post32599-pr35387/src/gf.c:2328
jl_apply at /post32599-pr35387/src/julia.h:1695 [inlined]
jl_finish_task at /post32599-pr35387/src/task.c:198
start_task at /post32599-pr35387/src/task.c:697
unknown function (ip: (nil))
unknown function (ip: (nil))
pthread_cond_wait at /lib/x86_64-linux-gnu/libpthread.so.0 (unknown line)
uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:827
jl_task_get_next at /post32599-pr35387/src/partr.c:475
poptaskref at ./task.jl:702
wait at ./task.jl:709
task_done_hook at ./task.jl:444
_jl_invoke at /post32599-pr35387/src/gf.c:2144 [inlined]
jl_apply_generic at /post32599-pr35387/src/gf.c:2328
jl_apply at /post32599-pr35387/src/julia.h:1695 [inlined]
jl_finish_task at /post32599-pr35387/src/task.c:198
start_task at /post32599-pr35387/src/task.c:697
unknown function (ip: (nil))
unknown function (ip: (nil))
pthread_cond_wait at /lib/x86_64-linux-gnu/libpthread.so.0 (unknown line)
uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:827
jl_task_get_next at /post32599-pr35387/src/partr.c:475
poptaskref at ./task.jl:702
wait at ./task.jl:709
task_done_hook at ./task.jl:444
_jl_invoke at /post32599-pr35387/src/gf.c:2144 [inlined]
jl_apply_generic at /post32599-pr35387/src/gf.c:2328
jl_apply at /post32599-pr35387/src/julia.h:1695 [inlined]
jl_finish_task at /post32599-pr35387/src/task.c:198
start_task at /post32599-pr35387/src/task.c:697
unknown function (ip: (nil))
unknown function (ip: (nil))
pthread_cond_wait at /lib/x86_64-linux-gnu/libpthread.so.0 (unknown line)
uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:827
jl_task_get_next at /post32599-pr35387/src/partr.c:475
poptaskref at ./task.jl:702
wait at ./task.jl:709
task_done_hook at ./task.jl:444
_jl_invoke at /post32599-pr35387/src/gf.c:2144 [inlined]
jl_apply_generic at /post32599-pr35387/src/gf.c:2328
jl_apply at /post32599-pr35387/src/julia.h:1695 [inlined]
jl_finish_task at /post32599-pr35387/src/task.c:198
start_task at /post32599-pr35387/src/task.c:697
unknown function (ip: (nil))
unknown function (ip: (nil))
pthread_cond_wait at /lib/x86_64-linux-gnu/libpthread.so.0 (unknown line)
uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:827
jl_task_get_next at /post32599-pr35387/src/partr.c:475
poptaskref at ./task.jl:702
wait at ./task.jl:709
task_done_hook at ./task.jl:444
_jl_invoke at /post32599-pr35387/src/gf.c:2144 [inlined]
jl_apply_generic at /post32599-pr35387/src/gf.c:2328
jl_apply at /post32599-pr35387/src/julia.h:1695 [inlined]
jl_finish_task at /post32599-pr35387/src/task.c:198
start_task at /post32599-pr35387/src/task.c:697
unknown function (ip: (nil))
unknown function (ip: (nil))
pthread_cond_wait at /lib/x86_64-linux-gnu/libpthread.so.0 (unknown line)
uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:827
jl_task_get_next at /post32599-pr35387/src/partr.c:475
poptaskref at ./task.jl:702
wait at ./task.jl:709
task_done_hook at ./task.jl:444
_jl_invoke at /post32599-pr35387/src/gf.c:2144 [inlined]
jl_apply_generic at /post32599-pr35387/src/gf.c:2328
jl_apply at /post32599-pr35387/src/julia.h:1695 [inlined]
jl_finish_task at /post32599-pr35387/src/task.c:198
start_task at /post32599-pr35387/src/task.c:697
unknown function (ip: (nil))
unknown function (ip: (nil))
pthread_cond_wait at /lib/x86_64-linux-gnu/libpthread.so.0 (unknown line)
uv_cond_wait at /workspace/srcdir/libuv/src/unix/thread.c:827
jl_task_get_next at /post32599-pr35387/src/partr.c:475
poptaskref at ./task.jl:702
...

MWE 2

This is the MWE I've reported in #35341. In the first run, it caused a deadlock after printing seed = 287.

using BenchmarkTools, ThreadsX, Random
for seed in 1:1000
    @btime ThreadsX.sort($(rand(MersenneTwister(@show(seed)), 0:0.01:1, 1_000_000)))
end
Example backtrace after Ctrl-C
^CERROR: InterruptException:
Stacktrace:
 [1] poptaskref(::Base.InvasiveLinkedListSynchronized{Task}) at ./task.jl:702
 [2] wait() at ./task.jl:709
 [3] wait(::Base.GenericCondition{Base.Threads.SpinLock}) at ./condition.jl:106
 [4] _wait(::Task) at ./task.jl:238
 [5] sync_end(::Array{Any,1}) at ./task.jl:294
 [6] macro expansion at ./task.jl:335 [inlined]
 [7] maptasks(::ThreadsX.Implementations.var"#97#98"{Float64,Base.Order.ForwardOrdering}, ::Base.Iterators.Zip{Tuple{Base.Iterators.PartitionIterator{SubArray{Float64,1,Array{Float64,1},Tuple{UnitRange{Int64}},true}},Base.Iterators.PartitionIterator{Array{Int8,1}}}}) at /root/.julia/packages/ThreadsX/OsJPr/src/utils.jl:49
 [8] _quicksort!(::Array{Float64,1}, ::SubArray{Float64,1,Array{Float64,1},Tuple{UnitRange{Int64}},true}, ::ThreadsX.Implementations.ParallelQuickSortAlg{Base.Sort.QuickSortAlg,Int64,Int64}, ::Base.Order.ForwardOrdering, ::Array{Int8,1}, ::Bool, ::Bool) at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:74
 [9] sort!(::Array{Float64,1}, ::Int64, ::Int64, ::ThreadsX.Implementations.ParallelQuickSortAlg{Base.Sort.QuickSortAlg,Nothing,Int64}, ::Base.Order.ForwardOrdering) at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:22
 [10] _sort! at /root/.julia/packages/ThreadsX/OsJPr/src/mergesort.jl:130 [inlined]
 [11] #sort!#86 at /root/.julia/packages/ThreadsX/OsJPr/src/mergesort.jl:170 [inlined]
 [12] sort! at /root/.julia/packages/ThreadsX/OsJPr/src/mergesort.jl:156 [inlined]
 [13] #sort#85 at /root/.julia/packages/ThreadsX/OsJPr/src/mergesort.jl:143 [inlined]
 [14] sort at /root/.julia/packages/ThreadsX/OsJPr/src/mergesort.jl:143 [inlined]
 [15] ##core#1405(::Array{Float64,1}) at /root/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:371
 [16] ##sample#1406(::BenchmarkTools.Parameters) at /root/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:377
 [17] _run(::BenchmarkTools.Benchmark{Symbol("##benchmark#1404")}, ::BenchmarkTools.Parameters; verbose::Bool, pad::String,
kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /root/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:411
 [18] _run(::BenchmarkTools.Benchmark{Symbol("##benchmark#1404")}, ::BenchmarkTools.Parameters) at /root/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:399
 [19] #invokelatest#1 at ./essentials.jl:710 [inlined]
 [20] invokelatest at ./essentials.jl:709 [inlined]
 [21] #run_result#37 at /root/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:32 [inlined]
 [22] run_result at /root/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:32 [inlined] (repeats 2 times)
 [23] top-level scope at /root/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:483
 [24] top-level scope at none:1

How to solve this?

The next step for me is to try rr as it helped a lot to fixed one of the bugs I reported in #35341.

Meanwhile, it'd be nice if core devs try reproducing the deadlock in their machines. I know @Keno tried doing this but I don't know if he tried different JULIA_NUM_THREADS. I'll also try to find some relationship between JULIA_NUM_THREADS and the average time/probability of the deadlock (e.g., does it have to be within a "hotspot" or is it just have to be large enough).

Also, it'd be nice if people familiar with threading internals can review #32599 again. I think it's fair to say that this is not the most thoroughly reviewed PR. (Needless to say, this is not a complaint as I understand there are only very few people who can review it and I imagine they are very busy.)

Finally, I think it makes sense to consider taking this bug into account in the release strategy of 1.5. Maybe revert #32599 if it cannot be fixed before the feature freeze? Or, in 1.5-RC announcement, ask people with multi-threaded code base to try stress-test with the RC?

@Keno
Copy link
Member

Keno commented Apr 12, 2020

You could also try building julia with thread sanitizer. That way if it's a data race, you might be more likely to see it.

@tkf
Copy link
Member Author

tkf commented Apr 12, 2020

Do you mean #27167? I thought it's still WIP.

@tkf
Copy link
Member Author

tkf commented Apr 12, 2020

Or I can just put CXXFLAGS += -fsanitize=thread in Make.user?

@Keno
Copy link
Member

Keno commented Apr 12, 2020

Paging @vchuravy for the latest work on tsan.

@tkf
Copy link
Member Author

tkf commented Apr 12, 2020

Or I can just put CXXFLAGS += -fsanitize=thread in Make.user?

I just tried this anyway. But

make cleanall
make -C deps install-llvm
PATH="$PWD/usr/tools:$PATH" make USECLANG=1

failed with

...
    CC src/llvm-ptls.o
    CC src/jltypes.o
/julia/src/jltypes.c:1768:39: error: initializer element is not a compile-time constant
                                      jl_typename_type,
                                      ^~~~~~~~~~~~~~~~
./julia_internal.h:336:41: note: expanded from macro 'jl_svec'
                n == sizeof((void *[]){ __VA_ARGS__ })/sizeof(void *),        \
                                        ^~~~~~~~~~~ 
/usr/include/x86_64-linux-gnu/sys/cdefs.h:424:53: note: expanded from macro '_Static_assert'
      [!!sizeof (struct { int __error_if_negative: (expr) ? 2 : -1; })]
                                                    ^~~~
/julia/src/jltypes.c:1793:42: error: initializer element is not a compile-time constant
    jl_typename_type->types = jl_svec(8, jl_symbol_type, jl_any_type, jl_simplevector_type,
                                         ^~~~~~~~~~~~~~
./julia_internal.h:336:41: note: expanded from macro 'jl_svec'
                n == sizeof((void *[]){ __VA_ARGS__ })/sizeof(void *),        \

...

/julia/src/jltypes.c:2058:36: error: initializer element is not a compile-time constant
                        jl_svec(1, jl_array_any_type), 0, 0, 1);
                                   ^~~~~~~~~~~~~~~~~
./julia_internal.h:336:41: note: expanded from macro 'jl_svec'
                n == sizeof((void *[]){ __VA_ARGS__ })/sizeof(void *),        \
                                        ^~~~~~~~~~~ 
/usr/include/x86_64-linux-gnu/sys/cdefs.h:424:53: note: expanded from macro '_Static_assert'
      [!!sizeof (struct { int __error_if_negative: (expr) ? 2 : -1; })]
                                                    ^~~~
fatal error: too many errors emitted, stopping now [-ferror-limit=]
20 errors generated.
Makefile:161: recipe for target 'jltypes.o' failed
make[1]: *** [jltypes.o] Error 1
Makefile:75: recipe for target 'julia-src-release' failed
make: *** [julia-src-release] Error 2

@vchuravy
Copy link
Member

I got it to the point where one could sanitize user code, but then got stuck on making it user-friendly. Rebasing that PR shouldn't be that bad, especially since the ASAN fixes have since landed.

@JeffBezanson
Copy link
Member

I think it's fair to say that this is not the most thoroughly reviewed PR.

We went through it in person, it's just very hard to imagine all the possible cases and interactions.

Thanks for your help in debugging this.

@JeffBezanson JeffBezanson added the multithreading Base.Threads and related functionality label Apr 13, 2020
@tkf
Copy link
Member Author

tkf commented Apr 15, 2020

Sorry, I should've realized that reviews and discussions can happen behind the scene, even when it's invisible when just looking at github. I also can imagine that it's very hard to catch multithreading bugs in a review. I hope something like TSAN can help us here.

@vtjnash
Copy link
Member

vtjnash commented Aug 28, 2020

fixed by #36785

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multithreading Base.Threads and related functionality
Projects
None yet
Development

No branches or pull requests

5 participants