Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QuickSort hangs #104

Open
tkf opened this issue Mar 29, 2020 · 26 comments
Open

QuickSort hangs #104

tkf opened this issue Mar 29, 2020 · 26 comments

Comments

@tkf
Copy link
Owner

tkf commented Mar 29, 2020

In @chriselrod's machine, running the quicksort benchmark hangs with 8 threads. When it hangs, the CPU usage 0%. Originally reported here: https://discourse.julialang.org/t/ann-threadsx-jl-parallelized-base-functions/36666/14

At the moment, it seems that the MWE is while true; @btime ThreadsX.sort($(rand(0:0.01:1, 1_000_000))); end.

Stacktrace:

[ Info: Running: `scaling_nthreads_target.jl` with 8 thread(s)                                                                                                                                                                                                                                                                                                                       [160/1900]
[ Info: Benchmark: sum_sin
[ Info: Parameter: (datasize = 1000000,)
[ Info: Parameter: (datasize = 10000000,)
[ Info: Benchmark: foreach_symmetrize
[ Info: Parameter: (datasize = 2000,)
[ Info: Parameter: (datasize = 6000,)
[ Info: Benchmark: sort
[ Info: Parameter: (datasize = 1000000, distribution = "wide", alg = "ThreadsX.MergeSort", basesize = nothing, smallsize = nothing)
[ Info: Parameter: (datasize = 10000000, distribution = "wide", alg = "ThreadsX.MergeSort", basesize = nothing, smallsize = nothing)
[ Info: Parameter: (datasize = 1000000, distribution = "narrow", alg = "ThreadsX.MergeSort", basesize = nothing, smallsize = nothing)
[ Info: Parameter: (datasize = 10000000, distribution = "narrow", alg = "ThreadsX.MergeSort", basesize = nothing, smallsize = nothing)
[ Info: Parameter: (datasize = 1000000, distribution = "wide", alg = "ThreadsX.QuickSort", basesize = nothing, smallsize = nothing)
[ Info: Parameter: (datasize = 10000000, distribution = "wide", alg = "ThreadsX.QuickSort", basesize = nothing, smallsize = nothing)
[ Info: Parameter: (datasize = 1000000, distribution = "narrow", alg = "ThreadsX.QuickSort", basesize = nothing, smallsize = nothing)
^CERROR: ERROR: InterruptException:
Stacktrace:LoadError:
 [1] poptaskref(::Base.InvasiveLinkedListSynchronized{Task}) at ./task.jl:702
 [2] wait at ./task.jl:709 [inlined]
 [3] wait(::Base.GenericCondition{Base.Threads.SpinLock}) at ./condition.jl:106
 [4] wait(::Base.Process) at ./process.jl:622
 [5] success at ./process.jl:483 [inlined]
 [6] run(::Cmd; wait::Bool) at ./process.jl:440
 [7] run at ./process.jl:438 [inlined]
 [8] runscript(::String, ::Array{String,1}; env::Array{Pair{String,String},1}) at /home/chriselrod/.julia/packages/ThreadsXBenchmarks/3cFLt/src/ThreadsXBenchmarks.jl:52
 [9] run_nthreads(::String; nthreads_range::UnitRange{Int64}) at /home/chriselrod/.julia/packages/ThreadsXBenchmarks/3cFLt/src/ThreadsXBenchmarks.jl:82
 [10] #run_all#10 at /home/chriselrod/.julia/packages/ThreadsXBenchmarks/3cFLt/src/ThreadsXBenchmarks.jl:112 [inlined]
 [11] run_all(::String) at /home/chriselrod/.julia/packages/ThreadsXBenchmarks/3cFLt/src/ThreadsXBenchmarks.jl:112
 [12] top-level scope at REPL[3]:1
 [13] eval(::Module, ::Any) at ./boot.jl:331
 [14] eval_user_input(::Any, ::REPL.REPLBackend) at /home/chriselrod/Documents/languages/julia/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:130
 [15] run_backend(::REPL.REPLBackend) at /home/chriselrod/.julia/packages/Revise/2K7IK/src/Revise.jl:1070
 [16] top-level scope at none:0

julia> InterruptException:
Stacktrace:
 [1] poptaskref(::Base.InvasiveLinkedListSynchronized{Task}) at ./task.jl:702
 [2] wait at ./task.jl:709 [inlined]
 [3] wait(::Base.GenericCondition{Base.Threads.SpinLock}) at ./condition.jl:106
 [4] _wait(::Task) at ./task.jl:238
 [5] sync_end(::Array{Any,1}) at ./task.jl:294
 [6] macro expansion at ./task.jl:335 [inlined]
 [7] maptasks(::ThreadsX.Implementations.var"#97#98"{Float64,Base.Order.ForwardOrdering}, ::Base.Iterators.Zip{Tuple{Base.Iterators.PartitionIterator{SubArray{Float64,1,Array{Float64,1},Tuple{UnitRange{Int64}},true}},Base.Iterators.PartitionIterator{Array{Int8,1}}}}) at /home/chriselrod/.julia/packages/ThreadsX/OsJPr/src/utils.jl:49
 [8] _quicksort!(::Array{Float64,1}, ::SubArray{Float64,1,Array{Float64,1},Tuple{UnitRange{Int64}},true}, ::ThreadsX.Implementations.ParallelQuickSortAlg{Base.Sort.QuickSortAlg,Int64,Int64}, ::Base.Order.ForwardOrdering, ::Array{Int8,1}, ::Bool, ::Bool) at /home/chriselrod/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:74
 [9] sort!(::Array{Float64,1}, ::Int64, ::Int64, ::ThreadsX.Implementations.ParallelQuickSortAlg{Base.Sort.QuickSortAlg,Nothing,Int64}, ::Base.Order.ForwardOrdering) at /home/chriselrod/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:22
 [10] _sort! at /home/chriselrod/.julia/packages/ThreadsX/OsJPr/src/mergesort.jl:130 [inlined]
 [11] #sort!#86 at /home/chriselrod/.julia/packages/ThreadsX/OsJPr/src/mergesort.jl:170 [inlined]
 [12] #3 at /home/chriselrod/.julia/packages/ThreadsXBenchmarks/3cFLt/src/scripts/scaling_nthreads_target.jl:21 [inlined]
 [13] ##core#345(::NamedTuple{(:datasize, :dist, :alg, :basesize, :smallsize),Tuple{Int64,NamedTuple{(:label, :value),Tuple{String,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}},NamedTuple{(:label, :value),Tuple{String,ThreadsX.Implementations.ParallelQuickSortAlg{Base.Sort.QuickSortAlg,Nothing,Int64}}},Nothing,Nothing}}, ::MersenneTwister, ::Arr
ay{Float64,1}, ::var"#3#14") at /home/chriselrod/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:371
 [14] ##sample#346(::BenchmarkTools.Parameters) at /home/chriselrod/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:379
 [15] sample at /home/chriselrod/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:394 [inlined]
 [16] _lineartrial(::BenchmarkTools.Benchmark{Symbol("##benchmark#344")}, ::BenchmarkTools.Parameters; maxevals::Int64, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/chriselrod/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:133
 [17] _lineartrial(::BenchmarkTools.Benchmark{Symbol("##benchmark#344")}, ::BenchmarkTools.Parameters) at /home/chriselrod/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:125
 [18] #invokelatest#1 at ./essentials.jl:710 [inlined]
 [19] invokelatest at ./essentials.jl:709 [inlined]
 [20] #lineartrial#38 at /home/chriselrod/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:33 [inlined]
 [21] lineartrial at /home/chriselrod/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:33 [inlined]
 [22] tune!(::BenchmarkTools.Benchmark{Symbol("##benchmark#344")}, ::BenchmarkTools.Parameters; progressid::Nothing, nleaves::Float64, ndone::Float64, verbose::Bool, pad::String, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/chriselrod/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:209
 [23] macro expansion at /home/chriselrod/.julia/packages/ThreadsXBenchmarks/3cFLt/src/scripts/scaling_nthreads_target.jl:116 [inlined]
 [24] macro expansion at /home/chriselrod/.julia/packages/ProgressLogging/g8xnW/src/ProgressLogging.jl:470 [inlined]
 [25] macro expansion at /home/chriselrod/.julia/packages/ThreadsXBenchmarks/3cFLt/src/scripts/scaling_nthreads_target.jl:104 [inlined]
 [26] macro expansion at /home/chriselrod/.julia/packages/ProgressLogging/g8xnW/src/ProgressLogging.jl:470 [inlined]
 [27] main(::Array{String,1}; benchmark_definitions::Dict{Symbol,NamedTuple{(:prepare, :run, :paramasdata, :paramaxes),T} where T<:Tuple}, scriptname::String, tags::Array{String,1}) at /home/chriselrod/.julia/packages/ThreadsXBenchmarks/3cFLt/src/scripts/scaling_nthreads_target.jl:100
 [28] main at /home/chriselrod/.julia/packages/ThreadsXBenchmarks/3cFLt/src/scripts/scaling_nthreads_target.jl:96 [inlined] (repeats 2 times)
 [29] top-level scope at /home/chriselrod/.julia/packages/ThreadsXBenchmarks/3cFLt/src/scripts/scaling_nthreads_target.jl:144
 [30] include(::String) at ./client.jl:442
 [31] top-level scope at none:10
in expression starting at /home/chriselrod/.julia/packages/ThreadsXBenchmarks/3cFLt/src/scripts/scaling_nthreads_target.jl:143
@tkf
Copy link
Owner Author

tkf commented Mar 29, 2020

@chriselrod I'm wondering if the hang was data-dependent. If that's the case, can you get the hang more quickly by

seed = Ref(0)
@btime ThreadsX.sort!(xs) setup=(xs=rand(MersenneTwister(seed[] += 1), 0:0.01:1, 1_000_000))

and then using the same seed

@btime ThreadsX.sort!(xs) setup=(xs=rand(MersenneTwister(seed[]), 0:0.01:1, 1_000_000))

?

Also, it would be great if you can try debug branch #103 of ThreadsX and run one of the MWEs to cause the hang. I'm wondering if one of the tasks threw something.

@chriselrod
Copy link

chriselrod commented Mar 30, 2020

julia> while true; @btime ThreadsX.sort!(xs) setup=(xs=rand(MersenneTwister(@show(seed[] += 1)), 0:0.01:1, 1_000_000)); end
...
seed[] += 1 = 34162
seed[] += 1 = 34163
seed[] += 1 = 34164
seed[] += 1 = 34165
seed[] += 1 = 34166
seed[] += 1 = 34167
seed[] += 1 = 34168
seed[] += 1 = 34169
seed[] += 1 = 34170
seed[] += 1 = 34171
seed[] += 1 = 34172
seed[] += 1 = 34173
seed[] += 1 = 34174
seed[] += 1 = 34175
seed[] += 1 = 34176
seed[] += 1 = 34177
seed[] += 1 = 34178
seed[] += 1 = 34179
seed[] += 1 = 34180
seed[] += 1 = 34181
seed[] += 1 = 34182
seed[] += 1 = 34183
seed[] += 1 = 34184
seed[] += 1 = 34185
seed[] += 1 = 34186
seed[] += 1 = 34187
seed[] += 1 = 34188
seed[] += 1 = 34189
seed[] += 1 = 34190
seed[] += 1 = 34191
seed[] += 1 = 34192
seed[] += 1 = 34193
seed[] += 1 = 34194
seed[] += 1 = 34195
seed[] += 1 = 34196
seed[] += 1 = 34197
seed[] += 1 = 34198
seed[] += 1 = 34199
^Cfatal: error thrown and no exception handler available.
InterruptException()
jl_mutex_unlock at /home/chriselrod/Documents/languages/julia/src/locks.h:143 [inlined]
jl_task_get_next at /home/chriselrod/Documents/languages/julia/src/partr.c:441
poptaskref at ./task.jl:702
wait at ./task.jl:709 [inlined]
task_done_hook at ./task.jl:444
jl_apply at /home/chriselrod/Documents/languages/julia/src/julia.h:1685 [inlined]
jl_finish_task at /home/chriselrod/Documents/languages/julia/src/task.c:198
start_task at /home/chriselrod/Documents/languages/julia/src/task.c:697
unknown function (ip: (nil))

It finally hung with seed[] += 1 returned 34199.
Starting a new Julia session after the crash (it crashes when I ctrl-C)...

julia> using ThreadsX, BenchmarkTools, Random

julia> seed = Ref(34199)
Base.RefValue{Int64}(34199)

julia> @btime ThreadsX.sort!(xs) setup=(xs=rand(MersenneTwister(seed[]), 0:0.01:1, 1_000_000));
  2.749 ms (21066 allocations: 10.11 MiB)

@tkf
Copy link
Owner Author

tkf commented Mar 30, 2020

Thanks for trying it out! So it's more like a timing thing...?

I asked people in #multithreading slack channel if they have some ideas about this.

@chriselrod
Copy link

Have you tried to reproduce with

using BenchmarkTools, ThreadsX, Random
seed = Ref(0);
while true; @btime ThreadsX.sort!(xs) setup=(xs=rand(MersenneTwister(@show(seed[] += 1)), 0:0.01:1, 1_000_000)); end

?

@tkf
Copy link
Owner Author

tkf commented Mar 30, 2020

I tried it a bit and then switched to while true; @btime ThreadsX.sort($(rand(0:0.01:1, 1_000_000))); end after seeing your last comment, as I thought it's more about the scheduler. So far no hang for me.

Do you mind sharing lscpu output, by the way?

@tkf
Copy link
Owner Author

tkf commented Mar 30, 2020

it crashes when I ctrl-C

I couldn't get the hang but I get this part, too. Meaning that ctrl-C during the while loop killed Julia REPL.

@chriselrod
Copy link

No hang? Interesting. For how long have you been running it?
Normally it doesn't take too long for it to hang for me. The last three were after 10, 8, and 34 succesful @btimes.

I can provide the full lscpu if you need it, but for now I truncated the details on vulnerabilities and flags:

> lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          36
On-line CPU(s) list:             0-35
Thread(s) per core:              2
Core(s) per socket:              18
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
Stepping:                        7
CPU MHz:                         4080.897
CPU max MHz:                     4600.0000
CPU min MHz:                     1200.0000
BogoMIPS:                        6000.00
Virtualization:                  VT-x
L1d cache:                       576 KiB
L1i cache:                       576 KiB
L2 cache:                        18 MiB
L3 cache:                        24.8 MiB
NUMA node0 CPU(s):               0-35

It doesn't always crash. I just got:

^CERROR: InterruptException:
Stacktrace:
 [1] poptaskref(::Base.InvasiveLinkedListSynchronized{Task}) at ./task.jl:702
 [2] wait at ./task.jl:709 [inlined]
 [3] wait(::Base.GenericCondition{Base.Threads.SpinLock}) at ./condition.jl:106
 [4] _wait(::Task) at ./task.jl:238
 [5] sync_end(::Array{Any,1}) at ./task.jl:294
 [6] macro expansion at ./task.jl:335 [inlined]
 [7] _quicksort!(::Array{Float64,1}, ::SubArray{Float64,1,Array{Float64,1},Tuple{UnitRange{Int64}},true}, ::ThreadsX.Implementations.ParallelQuickSortAlg{Base.Sort.QuickSortAlg,Int64,Int64}, ::Base.Order.ForwardOrdering, ::Array{Int8,1}, ::Bool, ::Bool) at /home/chriselrod/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:97
 [8] sort!(::Array{Float64,1}, ::Int64, ::Int64, ::ThreadsX.Implementations.ParallelQuickSortAlg{Base.Sort.QuickSortAlg,Nothing,Int64}, ::Base.Order.ForwardOrdering) at /home/chriselrod/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:22
 [9] _sort! at /home/chriselrod/.julia/packages/ThreadsX/OsJPr/src/mergesort.jl:130 [inlined]
 [10] #sort!#86 at /home/chriselrod/.julia/packages/ThreadsX/OsJPr/src/mergesort.jl:170 [inlined]
 [11] sort! at /home/chriselrod/.julia/packages/ThreadsX/OsJPr/src/mergesort.jl:156 [inlined]
 [12] ##core#354(::Array{Float64,1}) at /home/chriselrod/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:371
 [13] ##sample#355(::BenchmarkTools.Parameters) at /home/chriselrod/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:377
 [14] _run(::BenchmarkTools.Benchmark{Symbol("##benchmark#353")}, ::BenchmarkTools.Parameters; verbose::Bool, pad::String, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/chriselrod/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:411
 [15] _run(::BenchmarkTools.Benchmark{Symbol("##benchmark#353")}, ::BenchmarkTools.Parameters) at /home/chriselrod/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:399
 [16] #invokelatest#1 at ./essentials.jl:710 [inlined]
 [17] invokelatest at ./essentials.jl:709 [inlined]
 [18] #run_result#37 at /home/chriselrod/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:32 [inlined]
 [19] run_result at /home/chriselrod/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:32 [inlined] (repeats 2 times)
 [20] top-level scope at /home/chriselrod/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:483
 [21] top-level scope at REPL[16]:1
 [22] eval(::Module, ::Any) at ./boot.jl:331
 [23] eval_user_input(::Any, ::REPL.REPLBackend) at /home/chriselrod/Documents/languages/julia/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:130
 [24] run_backend(::REPL.REPLBackend) at /home/chriselrod/.julia/packages/Revise/2K7IK/src/Revise.jl:1070
 [25] top-level scope at none:0

This is an issue for crashing on the interrupt exceptions: JuliaLang/julia#34184

@tkf
Copy link
Owner Author

tkf commented Mar 30, 2020

No hang? Interesting. For how long have you been running it?

Sadly no. I ran ~40 min last time and then gave up.

I can provide the full lscpu if you need it, but for now I truncated the details on vulnerabilities and flags:

Thanks, this is enough. I was mostly wondering NUMA topology as I saw some strange crash before when using multiple NUMA nodes (not sure if it's related at all though).

@tkf
Copy link
Owner Author

tkf commented Mar 30, 2020

Ah, I was actually using julia nightly I downloaded a while ago (1.5.0-DEV.464). Now I switched to 1.5.0-DEV.526, I get the hang! I get it with JULIA_NUM_THREADS=13 but not with JULIA_NUM_THREADS=8 so far.

@chriselrod Thanks for the help! Sorry for the extra works that could be avoided if I were a bit more cautious.

@chriselrod
Copy link

Did you try JULIA_NUM_THREADS=13 with 1.5.0-DEV.464?

I wanted to bisect, so I jumped back to 1.5.0-DEV.460, and got the hang there.

@tkf
Copy link
Owner Author

tkf commented Mar 30, 2020

You are right. I tried to bisect the bug and it hangs with 1.5.0-DEV.465 (and so presumably with 1.5.0-DEV.464) too.

While trying bisection I noticed that it can took more than 20 minutes sometimes to get a hang in my machine. So maybe I was too impatient before. I also switched to docker to run this (since the host OS is too old to build julia) so it may affect things somewhat... Anyway, I'll try to run this over night with a larger range while no one is using the machine.

Bisection script (used with git bisect run env JULIA_NUM_THREADS=13 ./julia PATH/TO/script.jl):

run(`make -j$(Sys.CPU_THREADS)`)

code = raw"""
using BenchmarkTools, ThreadsX, Random
seed = Ref(0)
while true
    @btime ThreadsX.sort($(rand(MersenneTwister(@show(seed[] += 1)), 0:0.01:1, 1_000_000)))
end
"""

const EXIT_CODE = Ref(2)
jl = run(
    pipeline(`$(Base.julia_cmd()) -e "$code"`; stdout = stdout, stderr = stderr);
    wait = false,
)
try
    pid = getpid(jl)
    start = time_ns()
    top = open(pipeline(`top -b -d 10 -p$pid`; stderr = stderr))
    try
        unknown_line = 0
        while true
            local i
            while true
                ln = readline(top)
                if match(r"^ *PID ", ln) != nothing
                    i = findfirst(==("%CPU"), split(ln))
                    if i == nothing
                        error("Cannot parse `top` header:\n", ln)
                    end
                    break
                end
            end
            ln = readline(top)
            columns = split(ln)
            if string(pid) in columns
                unknown_line = 0
                pcpu = parse(Float64, columns[i])
                @info "%CPU = $pcpu"
                minutes = (time_ns() - start) / 1000^3 / 60
                if minutes < 1
                    @info "Ignoring %CPU for first 1 minute... $((time_ns() - start) / 1000^3) seconds passed."
                    continue
                end
                if pcpu < 10.0
                    EXIT_CODE[] = 1
                    break
                end
                if minutes > 30
                    @info "More than 30 minutes without a hang."
                    EXIT_CODE[] = 0
                    break
                end
                @info "$minutes minutes without a hang...."
            else
                if jl.exitcode >= 0
                    error("Unexpected exit")
                end
                unknown_line > 10 && error("Too many parse failures")
                unknown_line += 1
                @error "Cannot parse `top` output. Continuing..." ln
            end
        end
    finally
        kill(top)
    end
finally
    kill(jl)
end
exit(EXIT_CODE[])

@chriselrod
Copy link

chriselrod commented Mar 31, 2020

That's incredible -- really neat script!

I ran while true; @btime ThreadsX.sort($(rand(0:0.01:1, 1_000_000))); end for more than an half an hour without a problem. I picked some time on the master branch from back in November (where I again ran it for over half an hour without the issue) as the good starting point.

I'm running your script now, and will let you know how it goes.

@tkf
Copy link
Owner Author

tkf commented Mar 31, 2020

Great! This is good news! Thanks for figuring out the version that it works. I did think the bug might be in the scheduler but if it weren't then I'd be hopelessly lost.

@chriselrod
Copy link

chriselrod commented Mar 31, 2020

I modified the script to use ps instead of top, because my OS provides htop instead of top (i.e., /usr/bin/top is actually htop), which doesn't have the -b option, and it's output isn't really parse-able:

run(`make -j$(Sys.CPU_THREADS)`)

code = raw"""
using BenchmarkTools, ThreadsX, Random
seed = Ref(0)
while true
    @btime ThreadsX.sort($(rand(MersenneTwister(@show(seed[] += 1)), 0:0.01:1, 1_000_000)))
end
"""

const EXIT_CODE = Ref(2)
jl = run(
    pipeline(`$(Base.julia_cmd()) -e "$code"`; stdout = stdout, stderr = stderr);
    wait = false,
)
try
    pid = getpid(jl)
    io = IOBuffer()
    cmd = pipeline(`ps -p$pid -h -o %cpu`, stdout = io)
    start = time_ns()
    while true
        sleep(10)
        run(cmd)
        pcpu = parse(Float64, String(take!(io)))
        @info "%CPU = $pcpu"
        minutes = (time_ns() - start) / 1000^3 / 60
        if minutes < 1
            @info "Ignoring %CPU for first 1 minute... $((time_ns() - start) / 1000^3) seconds passed."
            continue
        end
        if pcpu < 10.0
            EXIT_CODE[] = 1
            break
        end
        if minutes > 30
            @info "More than 30 minutes without a hang."
            EXIT_CODE[] = 0
            break
        end
        @info "$minutes minutes without a hang...."
    end
finally
    kill(jl)
end
exit(EXIT_CODE[])

Yeah,also fortunate that this bug doesn't seem to impact 1.4 / any released version of Julia.

@tkf
Copy link
Owner Author

tkf commented Mar 31, 2020

Hmm... That's too bad that we can't rely on that top is really top.

I modified the script to use ps instead of top

This may be OS-dependent, but when I tried using ps before, its %cpu output was not really useful (not as accurate as top). See: https://unix.stackexchange.com/questions/58539/top-and-ps-not-showing-the-same-cpu-result

@chriselrod
Copy link

chriselrod commented Mar 31, 2020

You're right, I just ran into this problem. It hung, but was declining slowly from >700% (with 8 threads). It wouldn't have gotten to <10% before the 30 minutes were up.

@chriselrod
Copy link

chriselrod commented Mar 31, 2020

I think this may work:

run(`make -j$(Sys.CPU_THREADS)`)

code = raw"""
using BenchmarkTools, ThreadsX, Random
seed = Ref(0)
while true
    @btime ThreadsX.sort($(rand(MersenneTwister(@show(seed[] += 1)), 0:0.01:1, 1_000_000)))
end
"""

const EXIT_CODE = Ref(2)
jl = run(
    pipeline(`$(Base.julia_cmd()) -e "$code"`; stdout = stdout, stderr = stderr);
    wait = false,
)
try
    start = time_ns()
    pid = getpid(jl)
    io = IOBuffer()
    cmd = pipeline(`ps -p$pid -h -o cputimes`, stdout = io)
    current_cpu = last_cpu = 0.0;
    last_time = current_time = 0.0;
    while true
        sleep(10)
        run(cmd)
        current_cpu = parse(Float64, String(take!(io)))
        current_time = (time_ns() - start) * 1e-9
        pcpu = 100.0 * (current_cpu - last_cpu) / (current_time - last_time)
        @info "%CPU = $pcpu"
        minutes = current_time / 60
        if minutes < 1
            @info "Ignoring %CPU for first 1 minute... $(current_time) seconds passed."
            continue
        end
        if pcpu < 10.0
            EXIT_CODE[] = 1
            break
        end
        if minutes > 120
            @info "More than 30 minutes without a hang."
            EXIT_CODE[] = 0
            break
        end
        @info "$minutes minutes without a hang...."
        last_time = current_time; last_cpu = current_cpu;
    end
finally
    kill(jl)
end
exit(EXIT_CODE[])

Above it's set to go for 2 hours before deciding that things work.
That's probably not long enough.

I decided to try the following commit a few times:
JuliaLang/julia@65b8e7e
I got the deadlock at seed[] += 1 = 12, = 73, and now it's currently at 196 (after 33 minutes) without yet locking.
I think I'd need to run it for a very long time before being reasonably confident that it won't.

It may also be worth modifying the script to run multiple in parallel. I get the deadlock with 8 threads, and the CPU has 18 physical cores. It's probably worth running three instances of 8 threads in parallel, and declaring the commit bad if any lock up.

EDIT:
Finally hung at 239. As a rough estimate from these 3 data points (assuming a geometric distribution), anything under 500 is plausible, so we'd have to run for at least that many before finding it likely that it wont deadlock eventually.

@chriselrod
Copy link

chriselrod commented Mar 31, 2020

Could this be a problem?

signal (11): Segmentation fault
in expression starting at REPL[3]:1
- at ./int.jl:52 [inlined]
unsafe_length at ./range.jl:517 [inlined]
unsafe_indices at ./abstractarray.jl:99 [inlined]
_indices_sub at ./subarray.jl:409 [inlined]
axes at ./subarray.jl:404 [inlined]
axes1 at ./abstractarray.jl:95 [inlined]
eachindex at ./abstractarray.jl:267 [inlined]
eachindex at ./abstractarray.jl:270 [inlined]
eachindex at ./abstractarray.jl:260 [inlined]
partition_sizes! at /home/chriselrod/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:163
unknown function (ip: 0x7f67b598804f)
unknown function (ip: (nil))
Allocations: 10113947179 (Pool: 10113915735; Big: 31444); GC: 29811

That is:
partition_sizes! at /home/chriselrod/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:163
There'd have to have been something weird with the subarrays.
EDIT: I'm running Julia with --check-bounds=yes to see if I eventually get a boundserror.
EDIT: As an update, with --check-bounds=yes, it's been running for about 2 and a half hours, having completed 861 @btimes without hanging, running into an error, or crashing.

@tkf
Copy link
Owner Author

tkf commented Mar 31, 2020

I decided to try the following commit a few times:
JuliaLang/julia@65b8e7e

Yes, my bisection points to JuliaLang/julia#32599 (which includes JuliaLang/julia@65b8e7e), too. After this PR (JuliaLang/julia@dc46ddd) I can get the hang but not before the PR (JuliaLang/julia@f5dbc47).

Somehow it was hard to get a hang with JuliaLang/julia@dc46ddd. It took 18 minutes to get a deadlock one time. I missed this commit the first time since it didn't cause the deadlock for 30 minutes. I ran the script for one hour for JuliaLang/julia@f5dbc47 once and it didn't cause the hang.

@tkf
Copy link
Owner Author

tkf commented Mar 31, 2020

Which Julia revision you used to get the segmentation fault?

@tkf
Copy link
Owner Author

tkf commented Mar 31, 2020

The last run with JuliaLang/julia@f5dbc47 (before JuliaLang/julia#32599) caused another segmentation fault

seed[] += 1 = 247
[ Info: %CPU = 1167.0
[ Info: 45.550472802933335 minutes without a hang....

signal (11): Segmentation fault
in expression starting at none:3
setindex! at ./Base.jl:0 [inlined]
setindex! at ./subarray.jl:301 [inlined]
partition_sizes! at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:167
#97 at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:158 [inlined]
#3 at ./threadingconstructs.jl:126
unknown function (ip: 0x7f54cc050ee6)
_jl_invoke at /julia/src/gf.c:2144 [inlined]
jl_apply_generic at /julia/src/gf.c:2328
jl_apply at /julia/src/julia.h:1695 [inlined]
start_task at /julia/src/task.c:687
unknown function (ip: (nil))
Allocations: 6273831512 (Pool: 6273798676; Big: 32836); GC: 17817

But, staring at the code,

function partition_sizes!(xs, cs, pivot, order)
nbelows = 0
nequals = 0
@inbounds for i in eachindex(xs, cs)
x = xs[i]
b = Base.lt(order, x, pivot)
a = Base.lt(order, pivot, x)
cs[i] = ifelse(b, -Int8(1), ifelse(a, Int8(1), Int8(0)))
nbelows += Int(b)
nequals += Int(!(a | b))
end
return (nbelows, nequals)
end

I don't know how it'd cause segfault.

@tkf
Copy link
Owner Author

tkf commented Mar 31, 2020

Looking at the code in your stacktrace #104 (comment), I'm puzzled why it'd cause a segmentation fault. Is it possible that the stacktrace is broken?

- at ./int.jl:52
unsafe_length at ./range.jl:517
unsafe_indices at ./abstractarray.jl:99
_indices_sub at ./subarray.jl:409
axes at ./subarray.jl:404
axes1 at ./abstractarray.jl:95
eachindex at ./abstractarray.jl:267
eachindex at ./abstractarray.jl:270
eachindex at ./abstractarray.jl:260

@tkf
Copy link
Owner Author

tkf commented Mar 31, 2020

I'm running the MWE with --check-bounds=yes and it's seed[] += 1 = 423 now. No hang or segfault so far.

Edit: I gave up after seed[] += 1 = 1077

@tkf
Copy link
Owner Author

tkf commented Apr 1, 2020

So far, I think we discovered that:

  1. ThreadsX.jl and/or julia before threads: further optimize scheduler based on #32551 JuliaLang/julia#32599 have a bug that is manifested as a segfault when quicksort is run for a very long time.
  2. PR threads: further optimize scheduler based on #32551 JuliaLang/julia#32599 makes this bug manifested as a hang that can happen more easily than before the PR.

@chriselrod Do you agree?

@chriselrod
Copy link

chriselrod commented Apr 1, 2020

Which Julia revision you used to get the segmentation fault?

That was 1.5.0-DEV.300. JuliaLang/julia@65b8e7e
I have no idea why it would segfault either. In my ignorance, a broken stack trace seems possible.

Edit: I gave up after seed[] += 1 = 1077

I gave up after seed[] += 1 = 1435. So it seems like --check-boounds=yes makes the problems much less likely (perhaps 0). I'm still using 8 threads, while you were using 13.
I'll try with other numbers of threads. I don't know how brittle this is.

I think you may be right, but it's really hard to be confident with how sporadic these hangs are.

EDIT:
I just got two hangs on JuliaLang/julia@65b8e7e
at seed[] += 1 = 460 and = 470 with 8 and 18 threads, respectively.

@tkf
Copy link
Owner Author

tkf commented Apr 2, 2020

So I just filed an issue in JuliaLang/julia#35341 with some summary of the things we discovered so far.

I'm also trying to build julia with ASAN JuliaLang/julia#35338 (comment) hoping that it finds some memory-related bug. But I don't know anything about LLVM or C++ so I'm not sure how much I can find.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants