Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fickle segmentation fault or bus error when using pmap #13806

Closed
omalled opened this issue Oct 28, 2015 · 2 comments
Closed

fickle segmentation fault or bus error when using pmap #13806

omalled opened this issue Oct 28, 2015 · 2 comments
Labels
parallelism Parallel or distributed computation regression Regression in behavior compared to a previous version

Comments

@omalled
Copy link

omalled commented Oct 28, 2015

I ran into some problems with julia crashing during calls to pmap. I put together the smallest example that I could come up with to reproduce the bug, but it isn't all that small. This issue seems pretty fickle. First, a module is required (I call it in M.jl in the directory where the test code will be run):

module M
function untransform(y::Vector, transformparams::Vector)
    return y
end
function transformfunction(f::Function, transformparams::Vector)
    function transformedf(y::Vector)
        x = untransform(y, transformparams)
        return f(x)
    end
    return transformedf
end
end

These functions need to be in a module, or the bug does not appear. Here's the code to reproduce the bug. It needs to be run in parallel for the bug to appear (e.g., julia -p 2):

@everywhere push!(LOAD_PATH, "./")
import M
function makef()
    g(x) = x
    function thisf(p::Vector)
        println("a")
        result = g(1)
        println("b")
        return 1
    end
    return thisf
end
function callpmap2(h)
    pmap(h, fill(zeros(2), 2))
end
function callpmap1(h)
    h(zeros(2))
    pmap(h, fill(zeros(2), 2))#all hell breaks loose if we call h before doing the pmap
end
f = makef()
f_trans = M.transformfunction(f, zeros(2))
callpmap2(f_trans)#works
callpmap1(f_trans)#bus error (Mac), segmentation fault (Ubuntu)

If I run it on Mac OS X 10.10.5 with julia 0.4.0, I get

    From worker 2:  a
    From worker 2:  b
    From worker 3:  a
    From worker 3:  b
a
b

signal (10): Bus error: 10

signal (10): Bus error: 10
_ZL17jl_add_linfo_rootP17_jl_lambda_info_tP11_jl_value_t at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/codegen.cpp:1704
_ZL9emit_exprP11_jl_value_tP12jl_codectx_tbb at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/codegen.cpp:3232
_ZL17jl_add_linfo_rootP17_jl_lambda_info_tP11_jl_value_t at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/codegen.cpp:1704
_ZL11emit_jlcallPN4llvm5ValueES1_PP11_jl_value_tmP12jl_codectx_t at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/codegen.cpp:2519
_ZL9emit_exprP11_jl_value_tP12jl_codectx_tbb at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/codegen.cpp:3232
_ZL9emit_callPP11_jl_value_tmP12jl_codectx_tS0_ at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/codegen.cpp:2679
_ZL11emit_jlcallPN4llvm5ValueES1_PP11_jl_value_tmP12jl_codectx_t at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/codegen.cpp:2519
_ZL13emit_functionP17_jl_lambda_info_t at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/codegen.cpp:4802
_ZL9emit_callPP11_jl_value_tmP12jl_codectx_tS0_ at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/codegen.cpp:2679
_Z19jl_eh_restore_stateP13_jl_handler_t at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/./julia.h:1410
_ZL13emit_functionP17_jl_lambda_info_t at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/codegen.cpp:4802
jl_compile at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/codegen.cpp:808
jl_trampoline_compile_function at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/builtins.c:1025
jl_apply at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/./julia.h:1325
jl_apply at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/./julia.h:1325
anonymous at /localpath/M.jl:8
jl_apply at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/./julia.h:1325
jl_apply at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/./julia.h:1325
anonymous at multi.jl:892
run_work_thunk at multi.jl:645
jlcall_run_work_thunk_21375 at  (unknown line)
jl_apply at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/./julia.h:1325
anonymous at multi.jl:892
jl_apply at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/task.c:241
_Z19jl_eh_restore_stateP13_jl_handler_t at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/./julia.h:1410
jl_compile at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/codegen.cpp:808
jl_trampoline_compile_function at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/builtins.c:1025
jl_apply at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/./julia.h:1325
jl_apply at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/./julia.h:1325
anonymous at /localpath/M.jl:8
jl_apply at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/./julia.h:1325
jl_apply at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/./julia.h:1325
anonymous at multi.jl:892
run_work_thunk at multi.jl:645
jlcall_run_work_thunk_21342 at  (unknown line)
jl_apply at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/./julia.h:1325
anonymous at multi.jl:892
jl_apply at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/task.c:241
Worker 3 terminated.
ERROR (unhandled task failure): EOFError: read end of file
Worker 2 terminated.

If I run it on Ubuntu 14.04 with julia 0.4.0, I get

    From worker 3:  a
    From worker 3:  b
    From worker 2:  a
    From worker 2:  b
a
b

signal (11): Segmentation fault

signal (11): Segmentation fault
unknown function (ip: 0x7f6302a61b78)
unknown function (ip: 0x7f6302a81b33)
unknown function (ip: 0x7f6302a891b7)
unknown function (ip: 0x7f6302a897c2)
unknown function (ip: 0x7f6302a830f8)
unknown function (ip: 0x7f6302a7524b)
unknown function (ip: 0x7f6302a77861)
unknown function (ip: 0x7f6302a77a3c)
unknown function (ip: 0x7f0e4c9b1b78)
jl_trampoline at /pathtojulia/julia-0ff703b40a/bin/../lib/julia/libjulia.so (unknown line)
unknown function (ip: 0x7f0e4c9d1b33)
jl_apply_generic at /pathtojulia/julia-0ff703b40a/bin/../lib/julia/libjulia.so (unknown line)
unknown function (ip: 0x7f0e4c9d91b7)
anonymous at /localpath/M.jl:8
unknown function (ip: 0x7f0e4c9d97c2)
jl_apply_generic at /pathtojulia/julia-0ff703b40a/bin/../lib/julia/libjulia.so (unknown line)
unknown function (ip: 0x7f0e4c9d30f8)
jl_f_apply at /pathtojulia/julia-0ff703b40a/bin/../lib/julia/libjulia.so (unknown line)
anonymous at multi.jl:892
unknown function (ip: 0x7f0e4c9c524b)
run_work_thunk at multi.jl:645
unknown function (ip: 0x7f0e4c9c7861)
jlcall_run_work_thunk_21214 at  (unknown line)
unknown function (ip: 0x7f0e4c9c7a3c)
jl_apply_generic at /pathtojulia/julia-0ff703b40a/bin/../lib/julia/libjulia.so (unknown line)
jl_trampoline at /pathtojulia/julia-0ff703b40a/bin/../lib/julia/libjulia.so (unknown line)
anonymous at multi.jl:892
jl_apply_generic at /pathtojulia/julia-0ff703b40a/bin/../lib/julia/libjulia.so (unknown line)
unknown function (ip: 0x7f6302aaa6a1)
unknown function (ip: (nil))
anonymous at /localpath/M.jl:8
jl_apply_generic at /pathtojulia/julia-0ff703b40a/bin/../lib/julia/libjulia.so (unknown line)
jl_f_apply at /pathtojulia/julia-0ff703b40a/bin/../lib/julia/libjulia.so (unknown line)
anonymous at multi.jl:892
run_work_thunk at multi.jl:645
jlcall_run_work_thunk_21214 at  (unknown line)
jl_apply_generic at /pathtojulia/julia-0ff703b40a/bin/../lib/julia/libjulia.so (unknown line)
anonymous at multi.jl:892
unknown function (ip: 0x7f0e4c9fa6a1)
unknown function (ip: (nil))
Worker 3 terminated.ArgumentError: stream is closed or unusable
@omalled
Copy link
Author

omalled commented Oct 28, 2015

One more note: This code works on the Mac with julia 0.3.11. I don't have a 0.3.11 binary for Ubuntu left around, so I couldn't try it on Ubuntu. I know the code that this is derived from worked on Ubuntu with 0.3.11 though.

@malmaud malmaud added the parallelism Parallel or distributed computation label Oct 28, 2015
@vtjnash
Copy link
Member

vtjnash commented Oct 29, 2015

i added some typeassertion code to catch it early:
https://github.com/JuliaLang/julia/compare/jn/worker_stderr?expand=1

now the backtrace points to the roots array is getting deseralized as an Expr instead of the Vector{Any} that was sent. I suspect an error in the deserialize_cycles code:

julia> callpmap1(f_trans)#bus error (Mac), segmentation fault (Ubuntu)
a
b
fatal error on 2: ERROR: TypeError: deserialize: in typeassert, expected Array{Any,1}, got Expr
 [inlined code] from essentials.jl:58
 in deserialize at serialize.jl:557
 in handle_deserialize at serialize.jl:477
 [inlined code] from essentials.jl:58
...

@JeffBezanson JeffBezanson added regression Regression in behavior compared to a previous version backport pending 0.4 labels Oct 29, 2015
@vtjnash vtjnash closed this as completed in 843ab66 Nov 1, 2015
vtjnash added a commit that referenced this issue Nov 1, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parallelism Parallel or distributed computation regression Regression in behavior compared to a previous version
Projects
None yet
Development

No branches or pull requests

5 participants