Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StackOverflowError when creating RemoteChannels #30679

Closed
martinbiel opened this issue Jan 10, 2019 · 7 comments
Closed

StackOverflowError when creating RemoteChannels #30679

martinbiel opened this issue Jan 10, 2019 · 7 comments
Assignees
Labels
bug Indicates an unexpected problem or unintended behavior parallelism Parallel or distributed computation regression Regression in behavior compared to a previous version

Comments

@martinbiel
Copy link

I have noticed some strange behavior with RemoteChannel in 1.0.3, that did not occur for me in 1.0.2. I think I have identified a MWE. The following works:

julia> using Distributed

julia> addprocs(1)
1-element Array{Int64,1}:
 2

julia> @everywhere struct A end

julia> RemoteChannel(()->Channel{A}(1), 2)
RemoteChannel{Channel{A}}(2, 1, 5)

However, if I import some module before the struct definition, the following happens:

julia> using Distributed

julia> addprocs(1)
1-element Array{Int64,1}:
 2

julia> using JuMP

julia> @everywhere struct A end

julia> RemoteChannel(()->Channel{A}(1), 2)
ERROR: StackOverflowError:
deserialize(::Distributed.ClusterSerializer{Sockets.TCPSocket}, ::Type{RemoteChannel{Channel{A}}}) at /usr/share/julia/stdlib/v1.0/Distributed/src/remotecall.jl:310 (repeats 100 times)
Stacktrace:
 [1] #remotecall_fetch#149(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Function, ::Distributed.Worker, ::Function, ::Vararg{Any,N} where N) at /usr/share/julia/stdlib/v1.0/Distributed/src/remotecall.jl:379
 [2] remotecall_fetch(::Function, ::Distributed.Worker, ::Function, ::Vararg{Any,N} where N) at /usr/share/julia/stdlib/v1.0/Distributed/src/remotecall.jl:371
 [3] #remotecall_fetch#152(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Function, ::Int64, ::Function, ::Vararg{Any,N} where N) at /usr/share/julia/stdlib/v1.0/Distributed/src/remotecall.jl:406
 [4] remotecall_fetch at /usr/share/julia/stdlib/v1.0/Distributed/src/remotecall.jl:406 [inlined]
 [5] RemoteChannel(::Function, ::Int64) at /usr/share/julia/stdlib/v1.0/Distributed/src/remotecall.jl:108
 [6] top-level scope at none:0

The error repeats consistently after consecutive attempts of RemoteChannel(()->Channel{A}(1), 2), but if I run some code snippet involving the A type on the second node I can suddenly create channels again:

julia> @fetchfrom 2 A()
A()

julia> RemoteChannel(()->Channel{A}(1), 2)
RemoteChannel{Channel{A}}(2, 1, 16)

The error does not occur after any model import. For example, MacroTools, Statistics or seemingly any standard library does not lead to this error. Other large modules that I have tried that does lead to the error are Plots and Distributions. I have not found any common denominator between model imports that cause this error. My version info:

julia> versioninfo()
Julia Version 1.0.3
Commit 099e826241 (2018-12-18 01:34 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, skylake)
@KristofferC KristofferC added parallelism Parallel or distributed computation regression Regression in behavior compared to a previous version bug Indicates an unexpected problem or unintended behavior labels Jan 10, 2019
@dwfmarchant
Copy link

dwfmarchant commented Jan 15, 2019

I've run into the same issue when using DataFrames and CSV with this example:

using DataFrames
using Distributed
addprocs(1)
out = RemoteChannel(()->Channel(1), 2)
println(out)

The way it is crashes seems to vary depending on the platform I'm running it on. On linux (Ubuntu 16.04.4) it does not crash, but hangs on the RemoteChannel line while slowly ramping memory use to 100%. On mac I either get a similar error:

ERROR: LoadError: StackOverflowError:
deserialize(::Distributed.ClusterSerializer{Sockets.TCPSocket}, ::Type{RemoteChannel{Channel{Any}}}) at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.0/Distributed/src/remotecall.jl:310 (repeats 100 times)
Stacktrace:
 [1] #remotecall_fetch#149(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Function, ::Distributed.Worker, ::Function, ::Vararg{Any,N} where N) at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.0/Distributed/src/remotecall.jl:379
 [2] remotecall_fetch(::Function, ::Distributed.Worker, ::Function, ::Vararg{Any,N} where N) at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.0/Distributed/src/remotecall.jl:371
 [3] #remotecall_fetch#152(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Function, ::Int64, ::Function, ::Vararg{Any,N} where N) at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.0/Distributed/src/remotecall.jl:406
 [4] remotecall_fetch at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.0/Distributed/src/remotecall.jl:406 [inlined]
 [5] RemoteChannel(::Function, ::Int64) at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.0/Distributed/src/remotecall.jl:108
 [6] top-level scope at none:0
 [7] include at ./boot.jl:317 [inlined]
 [8] include_relative(::Module, ::String) at ./loading.jl:1044
 [9] include(::Module, ::String) at ./sysimg.jl:29
 [10] exec_options(::Base.JLOptions) at ./client.jl:266
 [11] _start() at ./client.jl:425

or a segmentation fault.

@dwfmarchant
Copy link

I just tested this in 1.1.0-rc2.0 and on macOS the StackOverflowError I was getting in 1.0.3 has changed to

[1]    75471 bus error  /Applications/Julia-1.1.app/Contents/Resources/julia/bin/julia

@ghost
Copy link

ghost commented Jan 23, 2019

1bd2334 is the first bad commit according to git bisect, reverting the commit from v1.1.0 fixes the issue for me.

@JeffBezanson
Copy link
Member

Thanks for doing the bisect. I don't know how that could have caused this but I'll look into it.

@JeffBezanson
Copy link
Member

I can reproduce this on 1.0.3 but not on master.

@JeffBezanson
Copy link
Member

Ok, progress. This is hitting this case:

julia/src/gf.c

Line 53 in e87b19b

// TODO: if `meth` came from an `invoke` call, we should make sure
causing Distributed's deserialize method to call itself instead of the intended invoke target. @vtjnash changing it to just unconditionally call fptr fixes the problem. I see from the comment you may have intended a different fix, but could we change this for now?

@Pbellive
Copy link
Contributor

Thanks for looking at this @JeffBezanson . In case it's useful information, I can reproduce this on master. Built a fresh clone from source yesterday (Commit e87b19b) on Ubuntu 16.04. Running at @dwfmarchant's example from earlier in this thread I get:

              _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.2.0-DEV.208 (2019-01-23)
 _/ |\__'_|_|_|\__'_|  |  Commit e87b19b* (0 days old master)
|__/                   |

julia> using Distributed

julia> using DataFrames

julia> addprocs(1)

1-element Array{Int64,1}:
 2

julia> out = RemoteChannel(()->Channel(1), 2)
ERROR: StackOverflowError:
deserialize(::Distributed.ClusterSerializer{Sockets.TCPSocket}, ::Type{RemoteChannel{Channel{Any}}}) at /disk1/common/juliaLang/julia-master/usr/share/julia/stdlib/v1.2/Distributed/src/remotecall.jl:310 (repeats 100 times)
Stacktrace:
 [1] #remotecall_fetch#149(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Function, ::Distributed.Worker, ::Function, ::Vararg{Any,N} where N) at /disk1/common/juliaLang/julia-master/usr/share/julia/stdlib/v1.2/Distributed/src/remotecall.jl:379
 [2] remotecall_fetch(::Function, ::Distributed.Worker, ::Function, ::Vararg{Any,N} where N) at /disk1/common/juliaLang/julia-master/usr/share/julia/stdlib/v1.2/Distributed/src/remotecall.jl:371
 [3] #remotecall_fetch#152(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Function, ::Int64, ::Function, ::Vararg{Any,N} where N) at /disk1/common/juliaLang/julia-master/usr/share/julia/stdlib/v1.2/Distributed/src/remotecall.jl:406
 [4] remotecall_fetch at /disk1/common/juliaLang/julia-master/usr/share/julia/stdlib/v1.2/Distributed/src/remotecall.jl:406 [inlined]
 [5] RemoteChannel(::Function, ::Int64) at /disk1/common/juliaLang/julia-master/usr/share/julia/stdlib/v1.2/Distributed/src/remotecall.jl:108
 [6] top-level scope at REPL[4]:1

Like @martinbiel I've found that this error occurs when I have some packages loaded but not others. I can't find any common denominator among the packages that cause this crash. It's been cropping up in some of my company's parallel julia codes when trying to compute values on remote worker julia processes and store them to RemoteChannels. I've found one example of a lightweight pure julia registered package (GeometricalPredicates) with no dependencies that is affected by this bug. This time testing on julia 1.0.3 on a machine running Ubuntu 17.10:

              _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.0.3 (2018-12-18)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using Distributed

julia> addprocs(2)
2-element Array{Int64,1}:
 2
 3

julia> using GeometricalPredicates

julia> out = RemoteChannel(()->Channel(1), 2)
ERROR: StackOverflowError:
deserialize(::Distributed.ClusterSerializer{Sockets.TCPSocket}, ::Type{RemoteChannel{Channel{Any}}}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distributed/src/remotecall.jl:310 (repeats 100 times)
Stacktrace:
 [1] #remotecall_fetch#149(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Function, ::Distributed.Worker, ::Function, ::Vararg{Any,N} where N) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distributed/src/remotecall.jl:379
 [2] remotecall_fetch(::Function, ::Distributed.Worker, ::Function, ::Vararg{Any,N} where N) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distributed/src/remotecall.jl:371
 [3] #remotecall_fetch#152(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Function, ::Int64, ::Function, ::Vararg{Any,N} where N) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distributed/src/remotecall.jl:406
 [4] remotecall_fetch at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distributed/src/remotecall.jl:406 [inlined]
 [5] RemoteChannel(::Function, ::Int64) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distributed/src/remotecall.jl:108
 [6] top-level scope at none:0

Like @GAIKA this problem goes away if I revert commit 1bd2334, starting from yesterday's master in my case.

@JeffBezanson JeffBezanson self-assigned this Jan 28, 2019
@KristofferC KristofferC mentioned this issue Feb 4, 2019
39 tasks
KristofferC pushed a commit that referenced this issue Feb 11, 2019
@KristofferC KristofferC mentioned this issue Feb 11, 2019
39 tasks
JeffBezanson added a commit that referenced this issue Apr 26, 2019
KristofferC pushed a commit that referenced this issue Apr 27, 2019
KristofferC pushed a commit that referenced this issue Feb 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or unintended behavior parallelism Parallel or distributed computation regression Regression in behavior compared to a previous version
Projects
None yet
Development

No branches or pull requests

5 participants