Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hang #1 in Eager API usage #282

Closed
krynju opened this issue Sep 16, 2021 · 7 comments · Fixed by #284
Closed

Hang #1 in Eager API usage #282

krynju opened this issue Sep 16, 2021 · 7 comments · Fixed by #284

Comments

@krynju
Copy link
Member

krynju commented Sep 16, 2021

Starting the effort of documenting any somewhat replicable hangs

Observations/conditions:

  1. Can be ctrl-c'd - stacktrace below (other's not so much)
  2. Some Julia instances will hang almost immediately (1st or 2nd run of groupy), others will never hang no matter how many runs (consistent with other hangs)
  3. Julia master with all available fixes merged and Dagger with all available fixes merged
  4. Running with threads only

Thread usage during the hang : none
image

Stacktrace:

PS C:\Users\krynjupc\.julia\dev\Dagger> julia -t16
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.8.0-DEV.490 (2021-09-13)
 _/ |\__'_|_|_|\__'_|  |  kr/distributed-ref-count-race/dfd4724ce3 (fork: 2 commits, 8 days)
|__/                   |

(@v1.8) pkg> activate .
  Activating project at `C:\Users\krynjupc\.julia\dev\Dagger`

julia> using Dagger, DataFrames, Arrow, OnlineStats

julia> d = DTable(Arrow.Table, "data/".*readdir("data"))
DTable with 100 partitions
Tabletype: unknown (use `tabletype!(::DTable)`)

julia> g = Dagger.groupby(d, x->round(x.a, digits=1));
ERROR: InterruptException:
Stacktrace:
  [1] try_yieldto(undo::typeof(Base.ensure_rescheduled))
    @ Base .\task.jl:764
  [2] wait()
    @ Base .\task.jl:824
  [3] wait(c::Base.GenericCondition{ReentrantLock})
    @ Base .\condition.jl:112
  [4] fetch_buffered(c::Channel{Any})
    @ Base .\channels.jl:366
  [5] fetch(c::Channel{Any})
    @ Base .\channels.jl:360
  [6] fetch_ref(::Distributed.RRID)
    @ Distributed C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\remotecall.jl:593
  [7] call_on_owner
    @ C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\remotecall.jl:546 [inlined]
  [8] fetch(r::Distributed.Future)
    @ Distributed C:\cygwin64\home\krynjupc\julia\usr\share\julia\stdlib\v1.8\Distributed\src\remotecall.jl:587
  [9] (::Dagger.var"#73#74"{OSProc, Dagger.ThunkFuture})()
    @ Dagger C:\Users\krynjupc\.julia\dev\Dagger\src\thunk.jl:132
 [10] thunk_yield(f::Dagger.var"#73#74"{OSProc, Dagger.ThunkFuture})
    @ Dagger.Sch C:\Users\krynjupc\.julia\dev\Dagger\src\sch\eager.jl:63
 [11] fetch(t::Dagger.ThunkFuture; proc::OSProc)
    @ Dagger C:\Users\krynjupc\.julia\dev\Dagger\src\thunk.jl:131
 [12] fetch
    @ C:\Users\krynjupc\.julia\dev\Dagger\src\thunk.jl:131 [inlined]
 [13] fetch(t::Dagger.EagerThunk)
    @ Dagger C:\Users\krynjupc\.julia\dev\Dagger\src\thunk.jl:193
 [14] groupby(d::DTable, f::Function; merge::Bool, chunksize::Int64)
    @ Dagger C:\Users\krynjupc\.julia\dev\Dagger\src\table\groupby.jl:70
 [15] groupby(d::DTable, f::Function)
    @ Dagger C:\Users\krynjupc\.julia\dev\Dagger\src\table\groupby.jl:57
 [16] top-level scope
    @ REPL[4]:1
@krynju
Copy link
Member Author

krynju commented Sep 17, 2021

2nd hang type on same example:

Notes:

  • no stacktrace (can't ctrl-c)
  • killed all the threads and still couldn't do anything - no info as well
  • happened after multiple runs of groupby

threads usage:
image

@krynju
Copy link
Member Author

krynju commented Sep 19, 2021

I have an example which locks up literally in the same place every time in single threaded env!

branch https://github.com/krynju/Dagger.jl/tree/kr/dtable-groupby
commit d019e45

Steps:

  1. $Env:JULIA_DEBUG = "Dagger"
  2. Put the following in a file (tst.jl here)
using Pkg
Pkg.activate(".")
using Dagger, Random
println("run start")
rng = MersenneTwister(1111)
s = 1_000_000 * 100; d = DTable(NamedTuple((a=rand(rng, s), b=rand(rng, s), c=rand(rng, s), d=rand(rng, s))), 1_000_000); GC.gc()
println("dtable create ok")
g = Dagger.groupby(d, x->round(x.a, digits=1), chunksize=1_000_000)
println("dtable groupby ok")
  1. run it from terminal julia -t1 .\tst.jl

  2. Now there are 2 results (logs attached):

    • a hang at the same thunk every time
    • (rare) a julia crash - which I've seen many times, but couldn't catch the error message, because my terminal was closing all the time

log_hang.txt
log_with_error.txt

@krynju
Copy link
Member Author

krynju commented Sep 19, 2021

MWE

PS C:\Users\krynjupc\.julia\dev\Dagger> julia -t1
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.8.0-DEV.548 (2021-09-16)
 _/ |\__'_|_|_|\__'_|  |  Commit c5f348726c* (2 days old master)
|__/                   |

julia> using Dagger

julia> f = (x) -> 10 + x
#1 (generic function with 1 method)

julia> g = (x) -> fetch(Dagger.spawn(f, x))
#3 (generic function with 1 method)

julia> fetch(Dagger.spawn(g, 10))
┌ Debug: (1) eager_thunk (1) Using available Dagger.ThreadProc(1, 1): 1000000000 | 0/1000000000
└ @ Dagger.Sch C:\Users\krynjupc\.julia\packages\Dagger\qst7O\src\sch\Sch.jl:949
┌ Debug: (1) #3 (2) Using available Dagger.ThreadProc(1, 1): 1000000000 | 0/1000000000
└ @ Dagger.Sch C:\Users\krynjupc\.julia\packages\Dagger\qst7O\src\sch\Sch.jl:949

@krynju
Copy link
Member Author

krynju commented Sep 20, 2021

MWE for multithreaded hang julia -t16 .\tst2.jl

using Pkg
Pkg.activate(".")
ENV["JULIA_DEBUG"] = "Dagger"
using Dagger, Random,Logging
io = open("log.txt", "w+")
logger = SimpleLogger(io, Debug)
global_logger(logger)
println("run start")

f = (x) -> 10 + x
g = (vs...) -> begin 
    s = fetch.(Dagger.spawn.(f, vs))
    sum(s)
end

h = (x) -> x+10+rand(Int)%10
vs = [Dagger.@spawn h(10) for _ in 1:500]
ff = fetch(Dagger.@spawn g(vs...))
println("run end ", ff)
flush(io)

log.txt

@krynju
Copy link
Member Author

krynju commented Sep 22, 2021

MWE without Dagger
run with multiple threads

using Distributed
create_future = () -> Future()
put_future = (f) -> put!(f, Threads.threadid())
fetch_future = (f) -> fetch(f)

for x in 1:10
    _f = [create_future() for i in 1:10]
    Threads.@spawn put_future.(_f)
    t = Threads.@spawn fetch_future.(_f)
    wait(t)
end

@krynju
Copy link
Member Author

krynju commented Oct 2, 2021

Update:

very heavy multithreaded dagger usage still hangs very rarely, but i suspect it's connected to the single threaded issue that we have
no matter how hard i try to hang pure multithreaded Futures it just won't hang,

@krynju
Copy link
Member Author

krynju commented Oct 14, 2021

the last comment about very heavy multithreaded usage hangs were adressed here JuliaData/MemPool.jl#55

above PRs not merged yet, but they already provide a hang free experience 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants