Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault when using SharedArray on OS X #14295

Closed
jdrugo opened this issue Dec 6, 2015 · 7 comments
Closed

Segfault when using SharedArray on OS X #14295

jdrugo opened this issue Dec 6, 2015 · 7 comments
Labels
parallelism Parallel or distributed computation system:mac Affects only macOS

Comments

@jdrugo
Copy link
Contributor

jdrugo commented Dec 6, 2015

When working on an parallel implementation of a particle filter, Julia started segfault'ing under heavy workload. A minimal example to reproduce this behavior is

@everywhere begin
    type A
        x::SharedArray{Float64,1}
        A(N) = new(SharedArray(Float64, N))
    end
    localf(x::SharedArray) = nothing
    function f(a::A)
        map(fetch, Any[(@spawnat i localf(a.x)) for i in workers()])
    end
end


a = A(1000)
for n = 1:10^8
    f(a)
end

This results on my MacBook Pro under OS X El Capitan in

Jans-MacBook-Pro:~ jdrugo$ /Applications/Julia-0.4.1.app/Contents/Resources/julia/bin/julia -p 7 ./crash_example.jl

signal (11): Segmentation fault: 11
__pool_alloc at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/gc.c:1053
_new_array_ at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/array.c:84
_new_array at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/array.c:333
call at /Applications/Julia-0.4.1.app/Contents/Resources/julia/lib/julia/sys.dylib (unknown line)
def_rv_channel at multi.jl:619
jlcall_def_rv_channel_21329 at  (unknown line)
jl_apply at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/./julia.h:1325
lookup_ref at multi.jl:513
remotecall_fetch at multi.jl:727
jl_apply at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/./julia.h:1325
call_on_owner at multi.jl:778
fetch at multi.jl:796
jl_apply at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/./julia.h:1325
map at /Applications/Julia-0.4.1.app/Contents/Resources/julia/lib/julia/sys.dylib (unknown line)
f at /Users/jdrugo/crash_example.jl:9
jl_apply at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/./julia.h:1325
anonymous at /Users/jdrugo/crash_example.jl:15
jl_apply at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/./julia.h:1325
jl_parse_eval_all at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/toplevel.c:577
jl_load at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/toplevel.c:620
include at /Applications/Julia-0.4.1.app/Contents/Resources/julia/lib/julia/sys.dylib (unknown line)
jl_apply at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/./julia.h:1325
include_from_node1 at /Applications/Julia-0.4.1.app/Contents/Resources/julia/lib/julia/sys.dylib (unknown line)
jl_apply at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/./julia.h:1325
process_options at /Applications/Julia-0.4.1.app/Contents/Resources/julia/lib/julia/sys.dylib (unknown line)
_start at /Applications/Julia-0.4.1.app/Contents/Resources/julia/lib/julia/sys.dylib (unknown line)
jlcall__start_18614 at /Applications/Julia-0.4.1.app/Contents/Resources/julia/lib/julia/sys.dylib (unknown line)
jl_apply at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/./julia.h:1325
true_main at /Applications/Julia-0.4.1.app/Contents/Resources/julia/bin/julia (unknown line)
main at /Applications/Julia-0.4.1.app/Contents/Resources/julia/bin/julia (unknown line)
Segmentation fault: 11

Julia version:

julia> versioninfo()
Julia Version 0.4.1
Commit cbe1bee* (2015-11-08 10:33 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.3
@sbromberger
Copy link
Contributor

I can reproduce this on my system as well:

julia> versioninfo()
Julia Version 0.4.2-pre+18
Commit eb31eef (2015-11-26 08:03 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin15.0.0)
  CPU: Intel(R) Core(TM) i5-5287U CPU @ 2.90GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-svn

@kshyatt kshyatt added parallelism Parallel or distributed computation system:mac Affects only macOS labels Dec 7, 2015
@amitmurthy
Copy link
Contributor

On 0.4.2 without any workers :

ERROR: UndefRefError: access to undefined reference
 in ht_keyindex2 at dict.jl:602
 in setindex! at dict.jl:643
 in schedule_call at multi.jl:660
 in remotecall at multi.jl:703
 in f at none:8
 [inlined code] from none:2
 in anonymous at no file:0

On 0.4.2 with 2 workers :

fatal error on 2: ERROR: BoundsError: attempt to access 0-element Array{Any,1}
  at index [2]
 in notify at /Volumes/Julia/Julia-0.4.2.app/Contents/Resources/julia/lib/julia/sys.dylib
 in __notify#32__ at /Volumes/Julia/Julia-0.4.2.app/Contents/Resources/julia/lib/julia/sys.dylib
 in send_add_client at multi.jl:592
 in serialize at serialize.jl:185
 in serialize at sharedarray.jl:269
 in serialize_any at serialize.jl:422
 in serialize at serialize.jl:405
 in serialize at serialize.jl:127
 in serialize at serialize.jl:310
 in serialize_any at serialize.jl:422
 in send_msg_ at multi.jl:222
 in remotecall at multi.jl:710
 in f at none:8
 [inlined code] from none:2
 in anonymous at no file:0

Works fine on 0.5 with 2 workers.

On 0.5 with no workers I see a series of

error in running finalizer: UndefRefError()
error in running finalizer: UndefRefError()
error in running finalizer: UndefRefError()

I suspect a memory corruption in the shared memory code. And could possibly be the same issue as #14186 (comment)

@amitmurthy
Copy link
Contributor

Reduced case with no workers on 0.5:

for n = 1:10^8
   map(fetch, Any[(@spawnat i i) for i in workers()])
end

No error when run with workers added. Errors out on 0.4 .

@tkelman
Copy link
Contributor

tkelman commented Dec 22, 2015

Is this closed on master by #14456?

@amitmurthy
Copy link
Contributor

Yes.

@tkelman tkelman closed this as completed Dec 22, 2015
@amitmurthy
Copy link
Contributor

Why did you close it? The bug reported was for 0.4 and that will be open till 0.4.3 is released.

@tkelman
Copy link
Contributor

tkelman commented Dec 22, 2015

Generally issues should be closed by fixes getting merged to master, unless it's an issue that was never a problem for master.

vtjnash added a commit that referenced this issue May 4, 2016
this made it unreliable for the WeakKeyDict these are typically put in (client_refs)
to have trouble finding them to cleanup the dictionary later
since their hash and identity changed

fixes #15923
reverts workaround #14456 (doesn't break #14295 due to previous commit)
may also fix #16091
vtjnash added a commit that referenced this issue Jul 26, 2016
also use this `client_refs.lock` to protect other data-structures from
being interrupted by finalizers, in the multi.jl logic

we may want to start indicating which mutable data-structures are safe
to call from finalizers, since generally that isn't possible

to make a finalizer API gc-safe, that code should observe the standard
thread-safe restrictions (there's no guarantee of which thread it'll run on),

plus, if the data-structures uses locks for synchronization,
use the `islocked` pattern (demonstrated herein) in the `finalizer`
to re-schedule the finalizer when the mutable data-structure is not
available for mutation.
this ensures that the lock cannot be acquired recursively,
and furthermore, this pattern will continue to work if finalizers
get moved to their own separate thread.

close #14445
fix #16550
reverts workaround #14456 (shouldn't break #14295, due to new locks)
should fix #16091 (with #17619)
vtjnash added a commit that referenced this issue Jul 26, 2016
also use this `client_refs.lock` to protect other data-structures from
being interrupted by finalizers, in the multi.jl logic

we may want to start indicating which mutable data-structures are safe
to call from finalizers, since generally that isn't possible

to make a finalizer API gc-safe, that code should observe the standard
thread-safe restrictions (there's no guarantee of which thread it'll run on),

plus, if the data-structures uses locks for synchronization,
use the `islocked` pattern (demonstrated herein) in the `finalizer`
to re-schedule the finalizer when the mutable data-structure is not
available for mutation.
this ensures that the lock cannot be acquired recursively,
and furthermore, this pattern will continue to work if finalizers
get moved to their own separate thread.

close #14445
fix #16550
reverts workaround #14456 (shouldn't break #14295, due to new locks)
should fix #16091 (with #17619)
vtjnash added a commit that referenced this issue Aug 4, 2016
also use this `client_refs.lock` to protect other data-structures from
being interrupted by finalizers, in the multi.jl logic

we may want to start indicating which mutable data-structures are safe
to call from finalizers, since generally that isn't possible

to make a finalizer API gc-safe, that code should observe the standard
thread-safe restrictions (there's no guarantee of which thread it'll run on),

plus, if the data-structures uses locks for synchronization,
use the `islocked` pattern (demonstrated herein) in the `finalizer`
to re-schedule the finalizer when the mutable data-structure is not
available for mutation.
this ensures that the lock cannot be acquired recursively,
and furthermore, this pattern will continue to work if finalizers
get moved to their own separate thread.

close #14445
fix #16550
reverts workaround #14456 (shouldn't break #14295, due to new locks)
should fix #16091 (with #17619)
vtjnash added a commit that referenced this issue Aug 5, 2016
also use this `client_refs.lock` to protect other data-structures from
being interrupted by finalizers, in the multi.jl logic

we may want to start indicating which mutable data-structures are safe
to call from finalizers, since generally that isn't possible

to make a finalizer API gc-safe, that code should observe the standard
thread-safe restrictions (there's no guarantee of which thread it'll run on),

plus, if the data-structures uses locks for synchronization,
use the `islocked` pattern (demonstrated herein) in the `finalizer`
to re-schedule the finalizer when the mutable data-structure is not
available for mutation.
this ensures that the lock cannot be acquired recursively,
and furthermore, this pattern will continue to work if finalizers
get moved to their own separate thread.

close #14445
fix #16550
reverts workaround #14456 (shouldn't break #14295, due to new locks)
should fix #16091 (with #17619)
tkelman pushed a commit that referenced this issue Aug 11, 2016
also use this `client_refs.lock` to protect other data-structures from
being interrupted by finalizers, in the multi.jl logic

we may want to start indicating which mutable data-structures are safe
to call from finalizers, since generally that isn't possible

to make a finalizer API gc-safe, that code should observe the standard
thread-safe restrictions (there's no guarantee of which thread it'll run on),

plus, if the data-structures uses locks for synchronization,
use the `islocked` pattern (demonstrated herein) in the `finalizer`
to re-schedule the finalizer when the mutable data-structure is not
available for mutation.
this ensures that the lock cannot be acquired recursively,
and furthermore, this pattern will continue to work if finalizers
get moved to their own separate thread.

close #14445
fix #16550
reverts workaround #14456 (shouldn't break #14295, due to new locks)
should fix #16091 (with #17619)

(cherry picked from commit cd8be65)
ref #16204
mfasi pushed a commit to mfasi/julia that referenced this issue Sep 5, 2016
also use this `client_refs.lock` to protect other data-structures from
being interrupted by finalizers, in the multi.jl logic

we may want to start indicating which mutable data-structures are safe
to call from finalizers, since generally that isn't possible

to make a finalizer API gc-safe, that code should observe the standard
thread-safe restrictions (there's no guarantee of which thread it'll run on),

plus, if the data-structures uses locks for synchronization,
use the `islocked` pattern (demonstrated herein) in the `finalizer`
to re-schedule the finalizer when the mutable data-structure is not
available for mutation.
this ensures that the lock cannot be acquired recursively,
and furthermore, this pattern will continue to work if finalizers
get moved to their own separate thread.

close JuliaLang#14445
fix JuliaLang#16550
reverts workaround JuliaLang#14456 (shouldn't break JuliaLang#14295, due to new locks)
should fix JuliaLang#16091 (with JuliaLang#17619)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parallelism Parallel or distributed computation system:mac Affects only macOS
Projects
None yet
Development

No branches or pull requests

5 participants