Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rare deadlock and segmentation fault in a parallel quicksort implementation (with an MWE) #35341

Closed
tkf opened this issue Apr 2, 2020 · 53 comments · Fixed by #35387
Closed
Labels
multithreading Base.Threads and related functionality rr trace included

Comments

@tkf
Copy link
Member

tkf commented Apr 2, 2020

I have a trouble with finding a cause of a rare deadlock in my parallel quicksort. I tried my best, with a big help from @chriselrod, to find out the problem but I still have no idea how to fix it.

I am reporting this here since I think there is some chance that there is a bug in julia (scheduler? compiler?). (In other words I'm hopelessly lost now and all I can do is "blame" the upstream.)

MWE: Install the latest ThreadsX (v0.1.1) and BenchmarkTools. Then run the following code eventually causes a deadlock when JULIA_NUM_THREADS is appropriate (for me it's 13 for @chriselrod it's 8 but not 7):

using BenchmarkTools, ThreadsX, Random
seed = Ref(0)
while true
    @btime ThreadsX.sort($(rand(MersenneTwister(@show(seed[] += 1)), 0:0.01:1, 1_000_000)))
end

I've tried git bisect with this script and found that #32599 changes the way the problem(s) manifest:

With f5dbc47, I observed this segmentation fault:

seed[] += 1 = 247
[ Info: %CPU = 1167.0
[ Info: 45.550472802933335 minutes without a hang....

signal (11): Segmentation fault
in expression starting at none:3
setindex! at ./Base.jl:0 [inlined]
setindex! at ./subarray.jl:301 [inlined]
partition_sizes! at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:167
#97 at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:158 [inlined]
#3 at ./threadingconstructs.jl:126
unknown function (ip: 0x7f54cc050ee6)
_jl_invoke at /julia/src/gf.c:2144 [inlined]
jl_apply_generic at /julia/src/gf.c:2328
jl_apply at /julia/src/julia.h:1695 [inlined]
start_task at /julia/src/task.c:687
unknown function (ip: (nil))
Allocations: 6273831512 (Pool: 6273798676; Big: 32836); GC: 17817

However, I don't understand how this function partition_sizes! can cause a segfault.

@chriselrod also observed another segfault with 65b8e7e:

signal (11): Segmentation fault
in expression starting at REPL[3]:1
- at ./int.jl:52 [inlined]
unsafe_length at ./range.jl:517 [inlined]
unsafe_indices at ./abstractarray.jl:99 [inlined]
_indices_sub at ./subarray.jl:409 [inlined]
axes at ./subarray.jl:404 [inlined]
axes1 at ./abstractarray.jl:95 [inlined]
eachindex at ./abstractarray.jl:267 [inlined]
eachindex at ./abstractarray.jl:270 [inlined]
eachindex at ./abstractarray.jl:260 [inlined]
partition_sizes! at /home/chriselrod/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:163
unknown function (ip: 0x7f67b598804f)
unknown function (ip: (nil))
Allocations: 10113947179 (Pool: 10113915735; Big: 31444); GC: 29811

It is puzzling for me that how the code involved in the above stacktrace can cause a segfault:

- at ./int.jl:52
unsafe_length at ./range.jl:517
unsafe_indices at ./abstractarray.jl:99
_indices_sub at ./subarray.jl:409
axes at ./subarray.jl:404
axes1 at ./abstractarray.jl:95
eachindex at ./abstractarray.jl:267
eachindex at ./abstractarray.jl:270
eachindex at ./abstractarray.jl:260
partition_sizes! at ThreadsX.jl/src/quicksort.jl:163

We (me and @chriselrod) tried to get a better error report by running the MWE with --check-bounds=yes. But we couldn't get a hang or segfault even after running the MWE for hours.

I highly appreciate any help/advice from core devs or anyone who knows how threading works in Julia. I'm currently trying to build ASAN-enabled julia with help from @vchuravy #35338, hoping that it gives us some information. But I'm not sure how much I can discover by doing this as I don't know anything about LLVM or C++.

@Keno
Copy link
Member

Keno commented Apr 2, 2020

If you can reproduce it under rr, that'll be easiest to debug.

@tkf
Copy link
Member Author

tkf commented Apr 2, 2020

Thanks! But https://github.com/mozilla/rr/wiki/Usage says

rr forces your application's threads to execute on a single core, so your application can see an additional slowdown if it takes advantage of multicore parallelism.

So I guess it cannot be used to debug multi-threading bug? We never observed the bug with single thread.

I tried it anyway but it failed to start
root@459e9488a96a:/julia# rr --version
rr version 5.3.0
root@459e9488a96a:/julia# rr record ./julia -e 'using BenchmarkTools, ThreadsX, Random; seed = Ref(0); while true; @btime ThreadsX.sort($(rand(MersenneTwister(@show(seed[] += 1)), 0:0.01:1, 1_000_000))); end'
[FATAL /home/roc/rr/rr/src/PerfCounters.cc:317:start_counter() errno: EPERM] Failed to initialize counter
=== Start rr backtrace:
rr(_ZN2rr13dump_rr_stackEv+0x3b)[0x5f229b]
rr(_ZN2rr15notifying_abortEv+0x47)[0x5f2327]
rr[0x52cba5]
rr[0x52dede]
rr(_ZN2rr12PerfCounters23default_ticks_semanticsEv+0x1a)[0x52e9ca]
rr(_ZN2rr7SessionC1Ev+0x139)[0x5bc369]
rr(_ZN2rr13RecordSessionC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt6vectorIS6_SaIS6_EESD_RKNS_20DisableCPUIDFeaturesENS0_16SyscallBufferingEiNS_7BindCPUES8_PKNS_9TraceUuidE+0x4d)[0x54069d] 
rr(_ZN2rr13RecordSession6createERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS7_EESB_RKNS_20DisableCPUIDFeaturesENS0_16SyscallBufferingEhNS_7BindCPUERKS7_PKNS_9TraceUuidE+0xba9)[0x541769] 
rr(_ZN2rr13RecordCommand3runERSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS7_EE+0xe14)[0x535c24]
rr(main+0x353)[0x4ac743]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f935c454b97]
rr(_start+0x29)[0x4acb59]
=== End rr backtrace
Aborted (core dumped)

(Edit: it looks like this is because I'm using docker? https://github.com/mozilla/rr/wiki/Docker)

@tkf
Copy link
Member Author

tkf commented Apr 2, 2020

Ah sorry I missed that https://docs.julialang.org/en/v1/devdocs/debuggingtips/#Reproducing-concurrency-bugs-with-rr-1 mentions rr record --chaos

@Keno
Copy link
Member

Keno commented Apr 2, 2020

[FATAL /home/roc/rr/rr/src/PerfCounters.cc:317:start_counter() errno: EPERM] Failed to initialize counter

This might be due to docker, you can try using a priviledged container to get access to the perf counters.

@tkf
Copy link
Member Author

tkf commented Apr 2, 2020

Thanks! I followed https://github.com/mozilla/rr/wiki/Docker and then just started the MWE with

JULIA_NUM_THREADS=13 rr record --num-cores=13 --chaos ./julia -e 'using BenchmarkTools, ThreadsX, Random; seed = Ref(0); while true; @btime ThreadsX.sort($(rand(MersenneTwister(@show(seed[] += 1)), 0:0.01:1, 1_000_000))); end'

It runs like usual (though of course slowly). Let's hope it hangs.

@tkf
Copy link
Member Author

tkf commented Apr 2, 2020

So I ran this for more than three hours (up to seed = 1000) and rr generated 17 GB of trace but it didn't cause a hang or segfault.

@Keno
Copy link
Member

Keno commented Apr 2, 2020

If before it hung only after 45 minutes, you'll probably need to run in at least 10 hours with that many threads.you may also want to kick off multiple runs in parallel to maximize your chances of catching it.

@JeffBezanson JeffBezanson added the multithreading Base.Threads and related functionality label Apr 2, 2020
@tkf
Copy link
Member Author

tkf commented Apr 2, 2020

The hang can happen within a few minutes if I'm lucky. It's the segfault that is super rare. I was worrying that rr generates a big amount of data that eats up all the disk space so I limited the loop when I started it yesterday. I know how much space it'll eat now so I'll run more of them.

@tkf
Copy link
Member Author

tkf commented Apr 3, 2020

With ASAN #35338, running the MWE with

JULIA_NUM_THREADS=13 ASAN_OPTIONS=detect_leaks=0:allow_user_segv_handler=1 ./julia -e 'using BenchmarkTools, ThreadsX, Random; for seed in 1:5000; @btime ThreadsX.sort($(rand(MersenneTwister(@show(seed)), 0:0.01:1, 1_000_000))); end'

now somehow gives me the segfault that @chriselrod observed (see the OP) reliably with some short amount of time (a few minutes):

...
seed = 13

signal (11): Segmentation fault
in expression starting at none:1
- at ./int.jl:85 [inlined]
unsafe_length at ./range.jl:540 [inlined]
unsafe_indices at ./abstractarray.jl:99 [inlined]
_indices_sub at ./subarray.jl:409 [inlined]
axes at ./subarray.jl:404 [inlined]
size at ./subarray.jl:71 [inlined]
length at ./abstractarray.jl:206 [inlined]
#89 at ./none:0 [inlined]
iterate at ./generator.jl:47 [inlined]
collect at ./array.jl:664
_quicksort! at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:77
macro expansion at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:136 [inlined]
#92 at ./threadingconstructs.jl:126
jl_apply at /julia/src/./julia.h:1687 [inlined]
start_task at /julia/src/task.c:687
unknown function (ip: (nil))
Allocations: 99406804 (Pool: 0; Big: 99406804); GC: 1180
Segmentation fault

Though nothing specific seems to be printed via ASAN runtime itself.

I get different kind of setfault too:

...
  12.312 ms (21628 allocations: 19.69 MiB)
seed = 17

signal (11): Segmentation fault

signal (11): Segmentation fault
in expression starting at none:1
in expression starting at none:1

signal (11): Segmentation fault
in expression starting at none:1
in expres11): Segmentation fault
in expression starting at none:1
in expression starting at none:1

signal (11): Segmentation fault
in expression starting at none:1

signal (11): Segmentation fault

signal (11): Segmentation fault
in expression starting at none:1
in expression starting at none:1
in expression starting at none:1

signal (11): Segmentation fault
in expression starting at none:1

signal (11): Segmentation fault
in expression starting at none:1

signal (11): Segmentation fault
in expression starting at none:1
unknown function (ip: 0x7fa637bee020)
unknown function (ip: 0x7fa637bee020)
Segmentation fault

@tkf
Copy link
Member Author

tkf commented Apr 3, 2020

Running without multi-threading (without JULIA_NUM_THREADS=13) now gives me:

seed = 6
  42.666 ms (19937 allocations: 19.42 MiB)
seed = 7
=================================================================
==21519==ERROR: AddressSanitizer: heap-use-after-free on address 0x610017ba6578 at pc 0x7f4ce9302a5b bp 0x7f4caa960570 sp 0x
7f4caa960568
READ of size 8 at 0x610017ba6578 thread T0
    #0 0x7f4ce9302a5a  (/julia/usr/bin/../lib/libjulia.so.1+0x2b3a5a)
    #1 0x7f4ce930672c  (/julia/usr/bin/../lib/libjulia.so.1+0x2b772c)
    #2 0x7f4ce9304ab0  (/julia/usr/bin/../lib/libjulia.so.1+0x2b5ab0)
    #3 0x7f4ce92f9def  (/julia/usr/bin/../lib/libjulia.so.1+0x2aadef)
    #4 0x7f4cbfb230dc  (/memfd:julia-codegen (deleted)+0x260dc)
    #5 0x7f4cbfb2c1d4  (/memfd:julia-codegen (deleted)+0x2f1d4)
    #6 0x7f4cbfb3009d  (/memfd:julia-codegen (deleted)+0x3309d)
    #7 0x7f4cbfb32d8b  (/memfd:julia-codegen (deleted)+0x35d8b)
    #8 0x7f4ce92942b4  (/julia/usr/bin/../lib/libjulia.so.1+0x2452b4)

0x610017ba6578 is located 56 bytes inside of 128-byte region [0x610017ba6540,0x610017ba65c0)
freed by thread T0 here:
    #0 0x4a89d0  (/julia/usr/bin/julia+0x4a89d0)
    #1 0x7f4ce9310f77  (/julia/usr/bin/../lib/libjulia.so.1+0x2c1f77)
    #2 0x7f4ce930e57f  (/julia/usr/bin/../lib/libjulia.so.1+0x2bf57f)
    #3 0x7f4ce9307cb0  (/julia/usr/bin/../lib/libjulia.so.1+0x2b8cb0)
    #4 0x7f4ce9304ab0  (/julia/usr/bin/../lib/libjulia.so.1+0x2b5ab0)
    #5 0x7f4ce92f9def  (/julia/usr/bin/../lib/libjulia.so.1+0x2aadef)
    #6 0x7f4cbfb2bf90  (/memfd:julia-codegen (deleted)+0x2ef90)
    #7 0x7f4cbfb3009d  (/memfd:julia-codegen (deleted)+0x3309d)
    #8 0x7f4cbfb32d8b  (/memfd:julia-codegen (deleted)+0x35d8b)

previously allocated by thread T0 here:
    #0 0x4a993d  (/julia/usr/bin/julia+0x4a993d)
    #1 0x7f4ce92f9e4d  (/julia/usr/bin/../lib/libjulia.so.1+0x2aae4d)
    #2 0x7f4cbfb230dc  (/memfd:julia-codegen (deleted)+0x260dc)

SUMMARY: AddressSanitizer: heap-use-after-free (/julia/usr/bin/../lib/libjulia.so.1+0x2b3a5a) 
Shadow bytes around the buggy address:
  0x0c2082f6cc50: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa
  0x0c2082f6cc60: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd fd
  0x0c2082f6cc70: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa
  0x0c2082f6cc80: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd fd
  0x0c2082f6cc90: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa
=>0x0c2082f6cca0: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd[fd]
  0x0c2082f6ccb0: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa
  0x0c2082f6ccc0: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd fd
  0x0c2082f6ccd0: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa
  0x0c2082f6cce0: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd fd
  0x0c2082f6ccf0: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==21519==ABORTING

@tkf
Copy link
Member Author

tkf commented Apr 3, 2020

What's the best next step? I'm trying this with rr record ATM.

@maleadt
Copy link
Member

maleadt commented Apr 3, 2020

If you use a debug build and set ASAN_SYMBOLIZER_PATH to the llvm-symbolizer you've built in the process, you'll get debug info in the ASAN trace. Also see https://github.com/google/sanitizers/wiki/AddressSanitizerAndDebugger, ASAN_OPTIONS=abort_on_error=1 makes it easier to use ASAN with rr.

@tkf
Copy link
Member Author

tkf commented Apr 3, 2020

Thank you @maleadt!

So here is an output with llvm-symbolizer. It looks like ASAN is detecting some error from GC?:

root@919704e44cd5:/julia# ASAN_OPTIONS=detect_leaks=0:allow_user_segv_handler=1 ASAN_SYMBOLIZER_PATH=$PWD/usr/tools/llvm-symbolizer usr/bin/julia-debug -e 'using BenchmarkTools, ThreadsX, Random; for seed in 1:5000; @btime ThreadsX.sort($(rand(MersenneTwister(@show(seed)), 0:0.01:1, 1_000_000))); end'
seed = 1
  45.948 ms (19980 allocations: 19.43 MiB)
seed = 2
  45.620 ms (20313 allocations: 19.48 MiB)
seed = 3
  48.309 ms (21386 allocations: 19.66 MiB)
seed = 4
  46.958 ms (21283 allocations: 19.63 MiB)
seed = 5
  46.954 ms (20681 allocations: 19.54 MiB)
seed = 6
  46.379 ms (19937 allocations: 19.42 MiB)
seed = 7
  47.260 ms (20512 allocations: 19.52 MiB)
seed = 8
=================================================================
==23720==ERROR: AddressSanitizer: heap-use-after-free on address 0x610000c1ee78 at pc 0x7fd6fa45ff42 bp 0x7fd6c73219e0 sp 0x
7fd6c73219d8
READ of size 8 at 0x610000c1ee78 thread T0
    #0 0x7fd6fa45ff41 in gc_try_setmark /julia/src/gc.c:1642:24
    #1 0x7fd6fa45f218 in gc_mark_scan_obj8 /julia/src/gc.c:1836:14
    #2 0x7fd6fa457224 in gc_mark_loop /julia/src/gc.c:2117:9
    #3 0x7fd6fa462bea in _jl_gc_collect /julia/src/gc.c:2903:5
    #4 0x7fd6fa462175 in jl_gc_collect /julia/src/gc.c:3109:13
    #5 0x7fd6fa454156 in maybe_collect /julia/src/gc.c:827:9
    #6 0x7fd6fa453da8 in jl_gc_big_alloc /julia/src/gc.c:883:5
    #7 0x7fd6fa45504d in jl_gc_pool_alloc /julia/src/gc.c:1140:12
LLVMSymbolizer: error reading file: No such file or directory
    #8 0x7fd6d08185b3  (/memfd:julia-codegen (deleted)+0x325b3)
    #9 0x7fd6d0818f32  (/memfd:julia-codegen (deleted)+0x32f32)
    #10 0x7fd6d081ed89  (/memfd:julia-codegen (deleted)+0x38d89)
    #11 0x7fd6fa32f342 in jl_fptr_args /julia/src/gf.c:2009:12
    #12 0x7fd6fa3458a5 in _jl_invoke /julia/src/gf.c:2230:31
    #13 0x7fd6fa3474ba in jl_apply_generic /julia/src/gf.c:2414:12
    #14 0x7fd6fa3a979d in jl_apply /julia/src/./julia.h:1687:12
    #15 0x7fd6fa3ae8fd in start_task /julia/src/task.c:687:19

0x610000c1ee78 is located 56 bytes inside of 128-byte region [0x610000c1ee40,0x610000c1eec0)
freed by thread T0 here:
    #0 0x4a8a20 in free /workspace/srcdir/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cc:123
    #1 0x7fd6fa46c181 in jl_free_aligned /julia/src/gc.c:255:5
    #2 0x7fd6fa46c652 in sweep_big_list /julia/src/gc.c:942:13
    #3 0x7fd6fa46bc49 in sweep_big /julia/src/gc.c:954:9
    #4 0x7fd6fa46aa7c in gc_sweep_other /julia/src/gc.c:1407:5
    #5 0x7fd6fa46341c in _jl_gc_collect /julia/src/gc.c:3013:5
    #6 0x7fd6fa462175 in jl_gc_collect /julia/src/gc.c:3109:13
    #7 0x7fd6fa454156 in maybe_collect /julia/src/gc.c:827:9
    #8 0x7fd6fa453da8 in jl_gc_big_alloc /julia/src/gc.c:883:5
    #9 0x7fd6fa45504d in jl_gc_pool_alloc /julia/src/gc.c:1140:12
    #10 0x7fd6d0818311  (/memfd:julia-codegen (deleted)+0x32311)
    #11 0x7fd6d0818f32  (/memfd:julia-codegen (deleted)+0x32f32)
    #12 0x7fd6d081ed89  (/memfd:julia-codegen (deleted)+0x38d89)
    #13 0x7fd6fa3458a5 in _jl_invoke /julia/src/gf.c:2230:31
    #14 0x7fd6fa3474ba in jl_apply_generic /julia/src/gf.c:2414:12
    #15 0x7fd6fa3a979d in jl_apply /julia/src/./julia.h:1687:12
    #16 0x7fd6fa3ae8fd in start_task /julia/src/task.c:687:19

previously allocated by thread T0 here:
    #0 0x4a998d in posix_memalign /workspace/srcdir/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cc:226
    #1 0x7fd6fa45423a in jl_malloc_aligned /julia/src/gc.c:235:9
    #2 0x7fd6fa453e6d in jl_gc_big_alloc /julia/src/gc.c:891:30
    #3 0x7fd6fa45504d in jl_gc_pool_alloc /julia/src/gc.c:1140:12
    #4 0x7fd6d080ed92  (/memfd:julia-codegen (deleted)+0x28d92)

SUMMARY: AddressSanitizer: heap-use-after-free /julia/src/gc.c:1642:24 in gc_try_setmark
Shadow bytes around the buggy address:
  0x0c208017bd70: 00 00 00 00 00 00 00 00 fa fa fa fa fa fa fa fa
  0x0c208017bd80: fa fa fa fa fa fa fa fa 00 00 00 00 00 00 00 00
  0x0c208017bd90: 00 00 00 00 00 00 00 00 fa fa fa fa fa fa fa fa
  0x0c208017bda0: fa fa fa fa fa fa fa fa 00 00 00 00 00 00 00 00
  0x0c208017bdb0: 00 00 00 00 00 00 00 00 fa fa fa fa fa fa fa fa
=>0x0c208017bdc0: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd[fd]
  0x0c208017bdd0: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa
  0x0c208017bde0: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd fd
  0x0c208017bdf0: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa
  0x0c208017be00: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd fd
  0x0c208017be10: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==23720==ABORTING

Here is a backtrace in gdb:

root@919704e44cd5:/julia# ASAN_OPTIONS=detect_leaks=0:allow_user_segv_handler=1 ASAN_SYMBOLIZER_PATH=$PWD/usr/tools/llvm-sym
bolizer gdb --args usr/bin/julia-debug -e 'using BenchmarkTools, ThreadsX, Random; for seed in 1:5000; @btime ThreadsX.sort(
$(rand(MersenneTwister(@show(seed)), 0:0.01:1, 1_000_000))); end'
GNU gdb (Ubuntu 8.1-0ubuntu3.2) 8.1.0.20180409-git
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from usr/bin/julia-debug...done.
(gdb) r
Starting program: /julia/usr/bin/julia-debug -e using\ BenchmarkTools,\ ThreadsX,\ Random\;\ for\ seed\ in\ 1:5000\;\ @btime\ ThreadsX.sort\(\$\(rand\(MersenneTwister\(@show\(seed\)\),\ 0:0.01:1,\ 1_000_000\)\)\)\;\ end
[Thread debugging using libthread_db enabled] 
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fffef933700 (LWP 23800)]
[New Thread 0x7fffe247e700 (LWP 23801)]
[New Thread 0x7fffe1c67700 (LWP 23802)]
[New Thread 0x7fffdf450700 (LWP 23803)]
[New Thread 0x7fffdcc39700 (LWP 23804)]
[New Thread 0x7fffda422700 (LWP 23805)]
[New Thread 0x7fffd7c0b700 (LWP 23806)]
[New Thread 0x7fffd53f4700 (LWP 23807)]
seed = 1
  44.897 ms (19980 allocations: 19.43 MiB)
seed = 2
  44.839 ms (20313 allocations: 19.48 MiB)
seed = 3
  46.575 ms (21386 allocations: 19.66 MiB)
seed = 4
  46.976 ms (21283 allocations: 19.63 MiB)
seed = 5
  47.982 ms (20681 allocations: 19.54 MiB)
seed = 6

Thread 1 "julia-debug" received signal SIGSEGV, Segmentation fault.
julia_partition_sizes!_569 (xs=<error reading variable: Cannot access memory at address 0xbbbbbbbbbbbbbbbb>, cs=..., 
    pivot=4581421828931458171, order=...) at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:163
163         @inbounds for i in eachindex(xs, cs)
(gdb) bt
#0  julia_partition_sizes!_569 (xs=<error reading variable: Cannot access memory at address 0xbbbbbbbbbbbbbbbb>, cs=..., 
    pivot=4581421828931458171, order=...) at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:163
#1  0x00007fffcd643789 in #97 () at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:158
#2  julia_#3_567 () at threadingconstructs.jl:126
#3  0x00007fffcd6437e7 in jfptr_#3_568 ()
#4  0x00007ffff71728a6 in _jl_invoke (F=0x61001445a980, args=0x617000297510, nargs=0, mfunc=0x610000666280, world=27564)
    at gf.c:2230
#5  0x00007ffff71744bb in jl_apply_generic (F=0x61001445a980, args=0x617000297510, nargs=0) at gf.c:2414
#6  0x00007ffff71d679e in jl_apply (args=0x617000297508, nargs=1) at ./julia.h:1687
#7  0x00007ffff71db8fe in start_task () at task.c:687
#8  0x0000000000000000 in ?? ()

Does xs=<error reading variable: Cannot access memory at address 0xbbbbbbbbbbbbbbbb> mean somehow the object xs is GC'ed prematurely?

ASAN_OPTIONS=abort_on_error=1 makes it easier to use ASAN with rr.

Thanks for the tips! I tried rr but it didn't even hit the first @show statement after one hour so I terminated the process.

@tkf
Copy link
Member Author

tkf commented Apr 3, 2020

It looks like I was very lucky at the first runs. I often get a segfault rather than the abort from ASAN:

...
seed = 8
  46.673 ms (19959 allocations: 19.42 MiB)
seed = 9

signal (11): Segmentation fault
in expression starting at none:1
- at ./int.jl:85 [inlined]
unsafe_length at ./range.jl:540 [inlined]
unsafe_indices at ./abstractarray.jl:99 [inlined]
_indices_sub at ./subarray.jl:409 [inlined]
axes at ./subarray.jl:404 [inlined]
size at ./subarray.jl:71 [inlined]
length at ./abstractarray.jl:206 [inlined]
#89 at ./none:0 [inlined]
iterate at ./generator.jl:47 [inlined]
collect at ./array.jl:664
_quicksort! at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:77
macro expansion at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:136 [inlined]
#92 at ./threadingconstructs.jl:126
jl_fptr_args at /julia/src/gf.c:2009
_jl_invoke at /julia/src/gf.c:2230
jl_apply_generic at /julia/src/gf.c:2414
jl_apply at /julia/src/./julia.h:1687
start_task at /julia/src/task.c:687
unknown function (ip: (nil))
Allocations: 44257146 (Pool: 0; Big: 44257146); GC: 643
Segmentation fault

@Keno
Copy link
Member

Keno commented Apr 3, 2020

Does xs=<error reading variable: Cannot access memory at address 0xbbbbbbbbbbbbbbbb> mean somehow the object xs is GC'ed prematurely?

Potentially, I'd look to see if you're maybe accessing it somewhere without rooting it. That said, an rr trace is still the best way to debug these things, so I'd just kick it off in the background and let it run even if it takes a day or two to crash.

@tkf
Copy link
Member Author

tkf commented Apr 3, 2020

I'd look to see if you're maybe accessing it somewhere without rooting it.

I think the only "unsafe" thing I'm doing is @inbounds (e.g., there is no unsafe_wrap). But I'll double-check the code.

I also started rr again. I'll also try if I can speed it up somehow (it's suspicious that it doesn't even hit the first @show).

@Keno
Copy link
Member

Keno commented Apr 3, 2020

I can try running it myself to see if there's some sort of pathological rr performance problem, but I suspect it's probably just the high thread count.

@tkf
Copy link
Member Author

tkf commented Apr 3, 2020

Thanks! Actually, with ASAN I can reproduce the problem with single thread. Maye rr is slow with ASAN? Currently I'm using:

ASAN_OPTIONS=detect_leaks=0:allow_user_segv_handler=1:abort_on_error=1 ASAN_SYMBOLIZER_PATH=$PWD/usr/tools/llvm-symbolizer rr record usr/bin/julia-debug -e 'using BenchmarkTools, ThreadsX, Random; for seed in 1:5000; @btime ThreadsX.sort($(rand(MersenneTwister(@show(seed)), 0:0.01:1, 1_000_000))); end'

@tkf
Copy link
Member Author

tkf commented Apr 3, 2020

Without ASAN, rr record ./julia ... was printing the @show outputs.

@Keno
Copy link
Member

Keno commented Apr 3, 2020

It's possible that there's a bad interaction with rr. Let me take a look.

@tkf
Copy link
Member Author

tkf commented Apr 4, 2020

Thanks! FYI, I needed a bit of manual interventions to build current julia with ASAN:

#35338 (comment)
#35338 (comment)

@Keno
Copy link
Member

Keno commented Apr 4, 2020

The incompatibility between asan/LLVM/rr is resolved by https://reviews.llvm.org/D70581. You can grab LLVM 10 from Yggdrasil and build with that to get an appropriate version of compiler-rt that has this fix.

@Keno
Copy link
Member

Keno commented Apr 4, 2020

Also, if you apply rr-debugger/rr#2488 to your rr installation, you'll get significantly lower overhead on this workload (2x rather than 4x). Unfortunately, I still can't reproduce the crash here (I saw it once, but didn't have rr attached).

@tkf
Copy link
Member Author

tkf commented Apr 5, 2020

You can grab LLVM 10 from Yggdrasil and build with that to get an appropriate version of compiler-rt that has this fix.

I installed LLVM_full_jll via ] add LLVM_full_jll@10 and then ran

tooldir=/root/.julia/artifacts/8cd56e3e81e5770720c31304026f3b23f129153e/tools
make clean
ASAN_OPTIONS=detect_leaks=0:allow_user_segv_handler=1 make CC=$tooldir/clang CXX=$tooldir/clang LLVM_CONFIG=$tooldir/llvm-config USECLANG=1 SANITIZE=1

but it failed with the following error:

...
    LINK usr/bin/julia
clang-10: warning: argument unused during compilation: '-mllvm -asan-stack=0' [-Wunused-command-line-argument]
/julia/usr/lib/libjulia.so: undefined reference to `llvm::Intrinsic::getIntrinsicInfoTableEntries(unsigned int, llvm::SmallVectorImpl<llvm::Intrinsic::IITDescriptor>&)'
/julia/usr/lib/libjulia.so: undefined reference to `LLVMInitializeWebAssemblyAsmParser'
/julia/usr/lib/libjulia.so: undefined reference to `llvm::IRBuilderBase::CreateMemCpy(llvm::Value*, llvm::MaybeAlign, llvm::Value*, llvm::MaybeAlign, llvm::Value*, bool, llvm::MDNode*, llvm::MDNode*, llvm::MDNode*, llvm::MDNode*)'
/julia/usr/lib/libjulia.so: undefined reference to `llvm::LegacyJITSymbolResolver::lookup(std::set<llvm::StringRef, std::less<llvm::StringRef>, std::allocator<llvm::StringRef> > const&, llvm::unique_function<void (llvm::Expected<std::map<llvm::StringRef, llvm::JITEvaluatedSymbol, std::less<llvm::StringRef>, std::allocator<std::pair<llvm::StringRef const, llvm::JITEvaluatedSymbol> > > >)>)'
/julia/usr/lib/libjulia.so: undefined reference to `llvm::ExceptionBehaviorToStr(llvm::fp::ExceptionBehavior)'
/julia/usr/lib/libjulia.so: undefined reference to `llvm::Intrinsic::getDeclaration(llvm::Module*, unsigned int, llvm::ArrayRef<llvm::Type*>)'
/julia/usr/lib/libjulia.so: undefined reference to `llvm::AttrBuilder::addAlignmentAttr(llvm::MaybeAlign)'
/julia/usr/lib/libjulia.so: undefined reference to `llvm::GlobalObject::setAlignment(llvm::MaybeAlign)'
/julia/usr/lib/libjulia.so: undefined reference to `llvm::Attribute::getWithAlignment(llvm::LLVMContext&, llvm::Align)'
/julia/usr/lib/libjulia.so: undefined reference to `llvm::IRBuilderBase::CreateIntrinsic(unsigned int, llvm::ArrayRef<llvm::Type*>, llvm::ArrayRef<llvm::Value*>, llvm::Instruction*, llvm::Twine const&)'
/julia/usr/lib/libjulia.so: undefined reference to `llvm::DIBuilder::createTypedef(llvm::DIType*, llvm::StringRef, llvm::DIFile*, unsigned int, llvm::DIScope*, unsigned int)'
/julia/usr/lib/libjulia.so: undefined reference to `LLVMInitializeWebAssemblyTargetMC'
/julia/usr/lib/libjulia.so: undefined reference to `LLVMInitializeWebAssemblyAsmPrinter'
/julia/usr/lib/libjulia.so: undefined reference to `llvm::MachineModuleInfoWrapperPass::MachineModuleInfoWrapperPass(llvm::LLVMTargetMachine const*)'
/julia/usr/lib/libjulia.so: undefined reference to `llvm::MCContext::MCContext(llvm::MCAsmInfo const*, llvm::MCRegisterInfo
const*, llvm::MCObjectFileInfo const*, llvm::SourceMgr const*, llvm::MCTargetOptions const*, bool)'
/julia/usr/lib/libjulia.so: undefined reference to `llvm::LoadInst::setAlignment(llvm::MaybeAlign)'
/julia/usr/lib/libjulia.so: undefined reference to `llvm::Module::global_objects()'
/julia/usr/lib/libjulia.so: undefined reference to `llvm::ConstantExpr::getAsInstruction() const'
/julia/usr/lib/libjulia.so: undefined reference to `llvm::StoreInst::setAlignment(llvm::MaybeAlign)'
/julia/usr/lib/libjulia.so: undefined reference to `llvm::AllocaInst::setAlignment(llvm::MaybeAlign)'
/julia/usr/lib/libjulia.so: undefined reference to `LLVMInitializeWebAssemblyTargetInfo'
/julia/usr/lib/libjulia.so: undefined reference to `llvm::RoundingModeToStr(llvm::fp::RoundingMode)'
/julia/usr/lib/libjulia.so: undefined reference to `llvm::IRBuilderBase::CreateMemSet(llvm::Value*, llvm::Value*, llvm::Value*, llvm::MaybeAlign, bool, llvm::MDNode*, llvm::MDNode*, llvm::MDNode*)'
/julia/usr/lib/libjulia.so: undefined reference to `LLVMInitializeWebAssemblyTarget'
/julia/usr/lib/libjulia.so: undefined reference to `LLVMInitializeWebAssemblyDisassembler'
clang-10: error: linker command failed with exit code 1 (use -v to see invocation)
Makefile:86: recipe for target '/julia/usr/bin/julia' failed
make[1]: *** [/julia/usr/bin/julia] Error 1
Makefile:78: recipe for target 'julia-ui-release' failed
make: *** [julia-ui-release] Error 2

Is this a correct way to use LLVM from Yggdrasil?

I also tried using

cd $tooldir
ln -s clang-10 clang++
cd -
ASAN_OPTIONS=detect_leaks=0:allow_user_segv_handler=1 PATH="$tooldir:$PATH" make USECLANG=1 SANITIZE=1

but I got the same error.

@tkf
Copy link
Member Author

tkf commented Apr 5, 2020

... PATH="$tooldir:$PATH" make ... worked after make clean

@tkf
Copy link
Member Author

tkf commented Apr 5, 2020

I'm trying rr+ASAN with rr master and LLVM 10. I'm now seeing @show outputs.

@tkf
Copy link
Member Author

tkf commented Apr 5, 2020

I can get the errors under rr (with --chaos; maybe it helps) but julia does not exit on error. Forcefully terminating rr seems to leave the trace in incomplete state (it leaves ~/.local/share/rr/julia-*/incomplete).

I tried both with and without abort_on_error=1 to ASAN_OPTIONS but it didn't change anything. I tried killing julia with several signals (TERM HUP INT QUIT PIPE) but it didn't help.

@Keno
Copy link
Member

Keno commented Apr 5, 2020

Now that we have an rr trace, debugging it will be quite easy. I'll take a look in the morning.

@tkf
Copy link
Member Author

tkf commented Apr 5, 2020

Awesome. Thanks in advance!

@Keno
Copy link
Member

Keno commented Apr 6, 2020

Ok, I've finally managed to get my setup back into a state where I can successfully replay this. However, the debug symbols seem to be missing from the trace, so it's a bit hard to debug (not impossible, but would be easier with symbols). Did you rr pack the trace directory before uploading it? I was in particular looking at julia-debug-8.

@Keno
Copy link
Member

Keno commented Apr 6, 2020

Also, could you show the output of rr sources on one of the trace directories on the machine that recorded it?

@Keno
Copy link
Member

Keno commented Apr 6, 2020

Alright, got it working again: Here's what rr spits out. The object in question is a subarray: Here's it's life:

Thread 1 hit Hardware access (read/write) watchpoint 1: *0x61000bc09e78

Old value = -1094795586
New value = -286331154
0x00005ff8616b8f95 in ?? ()
#0  0x00005ff8616b8f95 in ?? ()
#1  0x00000000004a950c in __asan_memset (block=0x61000bc09e40, c=238, size=128) at /workspace/srcdir/llvm-project/compiler-rt/lib/asan/asan_interceptors_memintrinsics.cpp:26
#2  0x00007fb56b9d43c4 in jl_gc_big_alloc (ptls=0x404919615520, sz=48) at gc.c:899
#3  0x00007fb56b9d541e in jl_gc_pool_alloc (ptls=0x404919615520, pool_offset=1448, osize=48) at gc.c:1140
#4  0x000044b53225ced3 in SubArray () at subarray.jl:22
#5  SubArray () at subarray.jl:38
#6  SubArray () at subarray.jl:28
#7  _maybe_reindex () at subarray.jl:189
#8  _maybe_reindex () at subarray.jl:185
#9  _maybe_reindex () at subarray.jl:179
#10 unsafe_view () at subarray.jl:178
#11 view () at subarray.jl:164
#12 julia_iterate_524 (itr=..., state=<optimized out>) at iterators.jl:1137
#13 0x000044b532265e34 in iterate () at iterators.jl:1135
#14 _zip_iterate_some () at iterators.jl:351
#15 _zip_iterate_all () at iterators.jl:343
#16 iterate () at iterators.jl:333
#17 macro expansion () at task.jl:334
#18 julia_maptasks_612 (f=<error reading variable: Cannot access memory at address 0x3fe851eb851eb852>, xs=...) at /root/.julia/packages/ThreadsX/OsJPr/src/utils.jl:49
#19 0x000044b53226a153 in julia__quicksort!_615 (ys=..., xs=..., alg=..., order=..., cs=..., ys_is_result=0 '\000', mutable_xs=1 '\001') at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:74
#20 0x000044b53226ceee in macro expansion () at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:136
#21 japi1_#92_589 () at threadingconstructs.jl:126
#22 0x00007fb56b8afd33 in jl_fptr_args (f=0x61000bbf2580, args=0x617000cd0010, nargs=0, m=0x6110040cbd80) at gf.c:2009
#23 0x00007fb56b8c6296 in _jl_invoke (F=0x61000bbf2580, args=0x617000cd0010, nargs=0, mfunc=0x61001a14f380, world=27606) at gf.c:2230
#24 0x00007fb56b8c7eab in jl_apply_generic (F=0x61000bbf2580, args=0x617000cd0010, nargs=0) at gf.c:2414
#25 0x00007fb56b92a10e in jl_apply (args=0x617000cd0008, nargs=1) at ./julia.h:1687
#26 0x00007fb56b92f26e in start_task () at task.c:687
#27 0x0000000000000000 in ?? ()
Backtrace stopped: Cannot access memory at address 0x6576803a7000

Thread 1 hit Hardware access (read/write) watchpoint 1: *0x61000bc09e78

Old value = -286331154
New value = 1552986768
0x000044b53225cee7 in SubArray () at subarray.jl:22
22	subarray.jl: No such file or directory.
#0  0x000044b53225cee7 in SubArray () at subarray.jl:22
#1  SubArray () at subarray.jl:38
#2  SubArray () at subarray.jl:28
#3  _maybe_reindex () at subarray.jl:189
#4  _maybe_reindex () at subarray.jl:185
#5  _maybe_reindex () at subarray.jl:179
#6  unsafe_view () at subarray.jl:178
#7  view () at subarray.jl:164
#8  julia_iterate_524 (itr=..., state=<optimized out>) at iterators.jl:1137
#9  0x000044b532265e34 in iterate () at iterators.jl:1135
#10 _zip_iterate_some () at iterators.jl:351
#11 _zip_iterate_all () at iterators.jl:343
#12 iterate () at iterators.jl:333
#13 macro expansion () at task.jl:334
#14 julia_maptasks_612 (f=<error reading variable: Cannot access memory at address 0x3fe851eb851eb852>, xs=...) at /root/.julia/packages/ThreadsX/OsJPr/src/utils.jl:49
#15 0x000044b53226a153 in julia__quicksort!_615 (ys=..., xs=..., alg=..., order=..., cs=..., ys_is_result=0 '\000', mutable_xs=1 '\001') at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:74
#16 0x000044b53226ceee in macro expansion () at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:136
#17 japi1_#92_589 () at threadingconstructs.jl:126
#18 0x00007fb56b8afd33 in jl_fptr_args (f=0x61000bbf2580, args=0x617000cd0010, nargs=0, m=0x6110040cbd80) at gf.c:2009
#19 0x00007fb56b8c6296 in _jl_invoke (F=0x61000bbf2580, args=0x617000cd0010, nargs=0, mfunc=0x61001a14f380, world=27606) at gf.c:2230
#20 0x00007fb56b8c7eab in jl_apply_generic (F=0x61000bbf2580, args=0x617000cd0010, nargs=0) at gf.c:2414
#21 0x00007fb56b92a10e in jl_apply (args=0x617000cd0008, nargs=1) at ./julia.h:1687
#22 0x00007fb56b92f26e in start_task () at task.c:687
#23 0x0000000000000000 in ?? ()
Backtrace stopped: Cannot access memory at address 0x6576803a7000

Thread 1 hit Hardware access (read/write) watchpoint 1: *0x61000bc09e78

Value = 1552986768
0x00007fb56b9ec609 in sweep_big_list (sweep_full=0, pv=0x404919615760) at gc.c:914
914	gc.c: No such file or directory.
#0  0x00007fb56b9ec609 in sweep_big_list (sweep_full=0, pv=0x404919615760) at gc.c:914
#1  0x00007fb56b9ebfca in sweep_big (ptls=0x404919615520, sweep_full=0) at gc.c:954
#2  0x00007fb56b9eadfd in gc_sweep_other (ptls=0x404919615520, sweep_full=0) at gc.c:1407
#3  0x00007fb56b9e37ed in _jl_gc_collect (ptls=0x404919615520, collection=JL_GC_AUTO) at gc.c:3013
#4  0x00007fb56b9e2546 in jl_gc_collect (collection=JL_GC_AUTO) at gc.c:3109
#5  0x00007fb56b9d4527 in maybe_collect (ptls=0x404919615520) at gc.c:827
#6  0x00007fb56b9d4179 in jl_gc_big_alloc (ptls=0x404919615520, sz=32) at gc.c:883
#7  0x00007fb56b9d541e in jl_gc_pool_alloc (ptls=0x404919615520, pool_offset=1424, osize=32) at gc.c:1140
#8  0x000044b532265ea2 in _zip_iterate_interleave () at iterators.jl:363
#9  _zip_iterate_all () at iterators.jl:347
#10 iterate () at iterators.jl:333
#11 macro expansion () at task.jl:334
#12 julia_maptasks_612 (f=<error reading variable: Cannot access memory at address 0x3fe851eb851eb852>, xs=...) at /root/.julia/packages/ThreadsX/OsJPr/src/utils.jl:49
#13 0x000044b53226a153 in julia__quicksort!_615 (ys=..., xs=..., alg=..., order=..., cs=..., ys_is_result=0 '\000', mutable_xs=1 '\001') at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:74
#14 0x000044b53226ceee in macro expansion () at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:136
#15 japi1_#92_589 () at threadingconstructs.jl:126
#16 0x00007fb56b8afd33 in jl_fptr_args (f=0x61000bbf2580, args=0x617000cd0010, nargs=0, m=0x6110040cbd80) at gf.c:2009
#17 0x00007fb56b8c6296 in _jl_invoke (F=0x61000bbf2580, args=0x617000cd0010, nargs=0, mfunc=0x61001a14f380, world=27606) at gf.c:2230
#18 0x00007fb56b8c7eab in jl_apply_generic (F=0x61000bbf2580, args=0x617000cd0010, nargs=0) at gf.c:2414
#19 0x00007fb56b92a10e in jl_apply (args=0x617000cd0008, nargs=1) at ./julia.h:1687
#20 0x00007fb56b92f26e in start_task () at task.c:687
#21 0x0000000000000000 in ?? ()
Backtrace stopped: Cannot access memory at address 0x6576803a7000

Thread 1 hit Hardware access (read/write) watchpoint 1: *0x61000bc09e78

Old value = 1552986768
New value = -1145324613
0x00005ff8616b8f95 in ?? ()
#0  0x00005ff8616b8f95 in ?? ()
#1  0x00000000004a950c in __asan_memset (block=0x61000bc09e40, c=187, size=128) at /workspace/srcdir/llvm-project/compiler-rt/lib/asan/asan_interceptors_memintrinsics.cpp:26
#2  0x00007fb56b9ec920 in sweep_big_list (sweep_full=0, pv=0x404919615760) at gc.c:938
#3  0x00007fb56b9ebfca in sweep_big (ptls=0x404919615520, sweep_full=0) at gc.c:954
#4  0x00007fb56b9eadfd in gc_sweep_other (ptls=0x404919615520, sweep_full=0) at gc.c:1407
#5  0x00007fb56b9e37ed in _jl_gc_collect (ptls=0x404919615520, collection=JL_GC_AUTO) at gc.c:3013
#6  0x00007fb56b9e2546 in jl_gc_collect (collection=JL_GC_AUTO) at gc.c:3109
#7  0x00007fb56b9d4527 in maybe_collect (ptls=0x404919615520) at gc.c:827
#8  0x00007fb56b9d4179 in jl_gc_big_alloc (ptls=0x404919615520, sz=32) at gc.c:883
#9  0x00007fb56b9d541e in jl_gc_pool_alloc (ptls=0x404919615520, pool_offset=1424, osize=32) at gc.c:1140
#10 0x000044b532265ea2 in _zip_iterate_interleave () at iterators.jl:363
#11 _zip_iterate_all () at iterators.jl:347
#12 iterate () at iterators.jl:333
#13 macro expansion () at task.jl:334
#14 julia_maptasks_612 (f=<error reading variable: Cannot access memory at address 0x3fe851eb851eb852>, xs=...) at /root/.julia/packages/ThreadsX/OsJPr/src/utils.jl:49
#15 0x000044b53226a153 in julia__quicksort!_615 (ys=..., xs=..., alg=..., order=..., cs=..., ys_is_result=0 '\000', mutable_xs=1 '\001') at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:74
#16 0x000044b53226ceee in macro expansion () at /root/.julia/packages/ThreadsX/OsJPr/src/quicksort.jl:136
#17 japi1_#92_589 () at threadingconstructs.jl:126
#18 0x00007fb56b8afd33 in jl_fptr_args (f=0x61000bbf2580, args=0x617000cd0010, nargs=0, m=0x6110040cbd80) at gf.c:2009
#19 0x00007fb56b8c6296 in _jl_invoke (F=0x61000bbf2580, args=0x617000cd0010, nargs=0, mfunc=0x61001a14f380, world=27606) at gf.c:2230
#20 0x00007fb56b8c7eab in jl_apply_generic (F=0x61000bbf2580, args=0x617000cd0010, nargs=0) at gf.c:2414
#21 0x00007fb56b92a10e in jl_apply (args=0x617000cd0008, nargs=1) at ./julia.h:1687
#22 0x00007fb56b92f26e in start_task () at task.c:687
#23 0x0000000000000000 in ?? ()
Backtrace stopped: Cannot access memory at address 0x6576803a7000
=================================================================
==10918==ERROR: AddressSanitizer: heap-use-after-free on address 0x61000bc09e78 at pc 0x7fb56b9e0312 bp 0x6576803a49e0 sp 0x6576803a49d8
READ of size 8 at 0x61000bc09e78 thread T0
    #0 0x7fb56b9e0311 in gc_try_setmark /julia/src/gc.c:1642:24
    #1 0x7fb56b9df5e8 in gc_mark_scan_obj8 /julia/src/gc.c:1836:14
    #2 0x7fb56b9d75f4 in gc_mark_loop /julia/src/gc.c:2117:9
    #3 0x7fb56b9e2fba in _jl_gc_collect /julia/src/gc.c:2903:5
    #4 0x7fb56b9e2545 in jl_gc_collect /julia/src/gc.c:3109:13
    #5 0x7fb56b9d4526 in maybe_collect /julia/src/gc.c:827:9
    #6 0x7fb56b9d4178 in jl_gc_big_alloc /julia/src/gc.c:883:5
    #7 0x7fb56b9d541d in jl_gc_pool_alloc /julia/src/gc.c:1140:12
LLVMSymbolizer: error reading file: No such file or directory
    #8 0x44b532265f37  (/memfd:julia-codegen (deleted)+0x31f37)
    #9 0x44b53226a152  (/memfd:julia-codegen (deleted)+0x36152)
    #10 0x44b53226ceed  (/memfd:julia-codegen (deleted)+0x38eed)
    #11 0x7fb56b8afd32 in jl_fptr_args /julia/src/gf.c:2009:12
    #12 0x7fb56b8c6295 in _jl_invoke /julia/src/gf.c:2230:31
    #13 0x7fb56b8c7eaa in jl_apply_generic /julia/src/gf.c:2414:12
    #14 0x7fb56b92a10d in jl_apply /julia/src/./julia.h:1687:12
    #15 0x7fb56b92f26d in start_task /julia/src/task.c:687:19

@Keno
Copy link
Member

Keno commented Apr 6, 2020

This definitely looks like the compiler is failing to insert a GC root. However, when I just ask it for the LLVM IR, that allocation where the root is deleted doesn't happen (it's optimized out), so there most be something else going on.

@Keno
Copy link
Member

Keno commented Apr 6, 2020

Yeah, this is a regression in LateLowerGCFrame, caused by 4776a8d#diff-1b7282e2ae2a9c71d694e66b58d43e07

Reproducer:

define i8 @lost_select_decayed(i1 %arg1) {
; CHECK-LABEL: @lost_select_decayed
; CHECK: %gcframe = alloca %jl_value_t addrspace(10)*, i32 3
; CHECK: [[GEP0:%.*]] = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %gcframe, i32 2
; CHECK: store %jl_value_t addrspace(10)* %obj1, %jl_value_t addrspace(10)** [[GEP0]]
top:
    %ptls = call %jl_value_t*** @julia.ptls_states()
    %obj1 = call %jl_value_t addrspace(10) *@alloc()
    %decayed = addrspacecast %jl_value_t addrspace(10) *%obj1 to %jl_value_t addrspace(11)*
    %selected = select i1 %arg1, %jl_value_t addrspace(11)* null, %jl_value_t addrspace(11)* %decayed
    %casted = bitcast %jl_value_t addrspace(11)* %selected to i8 addrspace(11)*
    call void @jl_safepoint()
    %val = load i8, i8 addrspace(11)* %casted
    ret i8 %val
}

It'll fail to protect the allocation across the safepoint. Thanks for getting me the rr trace @tkf - This would have been extremely difficult without, since it doesn't show up by just looking at code_llvm for that function. The fix is fairly straightforward, but it's getting early here, so I'll fix it in the afternoon.

@tkf
Copy link
Member Author

tkf commented Apr 6, 2020

Thank you so, so much! I'm glad that rr trace thing was useful. It's magic!

So the bug is now found and there is no need for re-uploading the trace? Though looking at the shell history it looks like I did rr pack. I can try this again or create a new one.

@Keno
Copy link
Member

Keno commented Apr 6, 2020

Yes, you did do rr pack, but asan was confusing gdb, so it looked like there were no symbols. I was able to work around that, get it working and find the bug.

@tkf
Copy link
Member Author

tkf commented Apr 6, 2020

Great!

@Keno
Copy link
Member

Keno commented Apr 7, 2020

Is the hang that requires SIGKILLing the tracee reproducible? I'm trying to reproduce, but failing. It'd be good to know what rr is up to there, so I can fix it.

@tkf
Copy link
Member Author

tkf commented Apr 7, 2020

That's julia-debug-8 and julia-debug-5.

You can click "Original output" and see what happened around when I send the SIGKILL. The julia-debug processes hang after an abort in julia-debug-8 (from ASAN) and a segfault in julia-debug-5 (not from ASAN). I guess you need to go pass those signals in gdb to debug that? (Is it possible?)

By the way, just in case if it helps to guess what happened, I also remember that rr and julia-debug did not show up in top when it happened. So I was imagining it was some kind of deadlock during teardown in julia.

@Keno
Copy link
Member

Keno commented Apr 7, 2020

Yes, something weird happened in rr when julia died. I can't reproduce it locally. If you see it again, try to attach gdb to the rr process (by finding it in ps aux - don't try the gdb line it prints, which tries to use rr to attach to julia). I'd like to know where it deadlocked.

@tkf
Copy link
Member Author

tkf commented Apr 7, 2020

I have one process like julia-debug-5 still hanging. Here is its backtrace:

(gdb) attach 10880                 
Attaching to process 10880                                                                           
[New LWP 10881]                                                                      
[New LWP 10882]                                                                    
[New LWP 10883]                                                            
[New LWP 10884]                                                                                    
[New LWP 10885]                              
[New LWP 10886]                                                                           
[New LWP 10887]
[New LWP 10888]
[New LWP 10889]
[New LWP 10890]
[New LWP 10891]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f215e68a7e8 in __waitid (idtype=P_PID, id=10893, infop=0x7ffcdd62b950, options=16777218)
    at ../sysdeps/unix/sysv/linux/waitid.c:29 
29      ../sysdeps/unix/sysv/linux/waitid.c: No such file or directory.
(gdb) bt
#0  0x00007f215e68a7e8 in __waitid (idtype=P_PID, id=10893, infop=0x7ffcdd62b950, options=16777218)
    at ../sysdeps/unix/sysv/linux/waitid.c:29
#1  0x000055b25a41d86a in rr::Task::wait_exit (this=0x55b25bf2e5c0) at /rr/src/Task.cc:149
#2  0x000055b25a41da4f in rr::Task::proceed_to_exit (this=0x55b25bf2e5c0) at /rr/src/Task.cc:163
#3  0x000055b25a336ea1 in rr::handle_ptrace_exit_event (t=0x55b25bf2e5c0) at /rr/src/RecordSession.cc:198
#4  0x000055b25a3404e4 in rr::RecordSession::record_step (this=0x55b25bf132d0) at /rr/src/RecordSession.cc:2056
#5  0x000055b25a332ef0 in rr::record (args=std::vector of length 3, capacity 8 = {...}, flags=...)
    at /rr/src/RecordCommand.cc:617
#6  0x000055b25a3339bf in rr::RecordCommand::run (this=0x55b25a8272b0 <rr::RecordCommand::singleton>,
    args=std::vector of length 3, capacity 8 = {...}) at /rr/src/RecordCommand.cc:740
#7  0x000055b25a473dde in main (argc=6, argv=0x7ffcdd62c118) at /rr/src/main.cc:268

@Keno
Copy link
Member

Keno commented Apr 7, 2020

Can you show the output of /proc/10893/status?

@tkf
Copy link
Member Author

tkf commented Apr 7, 2020

Name:   julia-debug
State:  D (disk sleep)
Tgid:   10892
Ngid:   0
Pid:    10893
PPid:   10880
TracerPid:      10880
Uid:    0       0       0       0
Gid:    0       0       0       0
FDSize: 1024
Groups:
NStgid: 10892
NSpid:  10893
NSpgid: 10880
NSsid:  7094
VmPeak: 21476554264 kB
VmSize: 21476304084 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:   1974604 kB
VmRSS:   1781988 kB
VmData: 21475976676 kB
VmStk:         0 kB
VmExe:      1168 kB
VmLib:    110412 kB
VmPTE:    108684 kB
VmPMD:     46816 kB
VmSwap:        0 kB
HugetlbPages:          0 kB
Threads:        5
SigQ:   1/515037
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000004226
SigIgn: 0000000000001010
SigCgt: 00000001c0000aea
CapInh: 00000000a80c25fb
CapPrm: 00000000a80c25fb
CapEff: 00000000a80c25fb
CapBnd: 00000000a80c25fb
CapAmb: 0000000000000000
Seccomp:        2
Cpus_allowed:   0000,00000010
Cpus_allowed_list:      4
Mems_allowed:   00000000,00000003
Mems_allowed_list:      0-1
voluntary_ctxt_switches:        57
nonvoluntary_ctxt_switches:     0
Output of `lsof -p 10892`:
COMMAND     PID USER   FD      TYPE             DEVICE  SIZE/OFF      NODE NAME     
julia-deb 10892 root  cwd       DIR               0,56      4096       215 /julia    
julia-deb 10892 root  rtd       DIR               0,56      4096         2 /         
julia-deb 10892 root  txt       REG               0,56   9595832     85057 /julia/usr/bin/julia-debug
julia-deb 10892 root  mem       REG               0,56        32     80648 /obj/share/rr/rr_page_64
julia-deb 10892 root  DEL       REG               0,56               85344 /tmp/rr-shared-preload_thread_locals-10892-0
julia-deb 10892 root  DEL       REG                0,5           233876681 /memfd:julia-codegen
julia-deb 10892 root  mem       REG               0,56    229957     17909 /julia/usr/lib/libopenlibm.so.3.0
julia-deb 10892 root  mem       REG               0,56 153989760     85059 /julia/usr/lib/julia/sys-debug.so
julia-deb 10892 root  mem       REG               0,56     17136     19069 /julia/usr/lib/libsuitesparse_wrapper.so
julia-deb 10892 root  mem       REG               0,56    655808     17922 /julia/usr/lib/libgmp.so.10.3.2
julia-deb 10892 root  mem       REG               0,56   2030544        33 /lib/x86_64-linux-gnu/libc-2.27.so
julia-deb 10892 root  mem       REG               0,56   1995888     22380 /julia/usr/lib/libgfortran.so.4
julia-deb 10892 root  mem       REG               0,56     47652     19050 /julia/usr/lib/libccolamd.so.2.9.6
julia-deb 10892 root  mem       REG               0,56     31250     19051 /julia/usr/lib/libcolamd.so.2.9.6
julia-deb 10892 root  mem       REG               0,56  31767024     85053 /julia/usr/lib/libjulia-debug.so.1.5
julia-deb 10892 root  mem       REG               0,56     39059     19040 /julia/usr/lib/libamd.so.2.4.6
julia-deb 10892 root  mem       REG               0,56   2406176     19346 /julia/usr/lib/libmpfr.so.6.0.2         
julia-deb 10892 root  mem       REG               0,56  30301432     19502 /julia/usr/lib/libopenblas64_.0.3.7.so   
julia-deb 10892 root  mem       REG               0,56   1005880     19054 /julia/usr/lib/libcholmod.so.3.0.13    
julia-deb 10892 root  mem       REG               0,56     10896     19072 /julia/usr/lib/libsuitesparseconfig.so.5.4.0
julia-deb 10892 root  mem       REG               0,56     18080     17907 /julia/usr/lib/libdSFMT.so   
julia-deb 10892 root  mem       REG               0,56    577384     19087 /julia/usr/lib/libpcre2-8.so.0.7.0
julia-deb 10892 root  mem       REG               0,56     43470     19046 /julia/usr/lib/libcamd.so.2.4.6
julia-deb 10892 root  mem       REG               0,56    388656     80784 /obj/lib/rr/librrpreload.so
julia-deb 10892 root  mem       REG               0,56     14560        31 /lib/x86_64-linux-gnu/libdl-2.27.so
julia-deb 10892 root  mem       REG               0,56    144976        77 /lib/x86_64-linux-gnu/libpthread-2.27.so
julia-deb 10892 root  mem       REG               0,56    101488     22379 /julia/usr/lib/libgcc_s.so.1
julia-deb 10892 root  mem       REG               0,56  61870856     21896 /julia/usr/lib/libLLVM-9jl.so
julia-deb 10892 root  DEL       REG               0,56               85355 /tmp/rr-shared-syscallbuf.10896-10892-5
julia-deb 10892 root  DEL       REG               0,56               85354 /tmp/rr-shared-syscallbuf.10895-10892-4
julia-deb 10892 root  DEL       REG               0,56               85353 /tmp/rr-shared-syscallbuf.10894-10892-3
julia-deb 10892 root  DEL       REG               0,56               85352 /tmp/rr-shared-syscallbuf.10893-10892-2
julia-deb 10892 root  mem       REG               0,56    170960        26 /lib/x86_64-linux-gnu/ld-2.27.so
julia-deb 10892 root  DEL       REG               0,56               85345 /tmp/rr-shared-syscallbuf.10892-10892-1
julia-deb 10892 root  mem       REG               0,56     31680        95 /lib/x86_64-linux-gnu/librt-2.27.so
julia-deb 10892 root  mem       REG               0,56   1594864       840 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.25
julia-deb 10892 root  mem       REG               0,56   1700792        93 /lib/x86_64-linux-gnu/libm-2.27.so
julia-deb 10892 root  mem       REG               0,56    264208     22381 /julia/usr/lib/libquadmath.so.0
julia-deb 10892 root    0u      CHR              136,5       0t0         8 /dev/pts/5
julia-deb 10892 root    1u      CHR              136,5       0t0         8 /dev/pts/5
julia-deb 10892 root    2u      CHR              136,5       0t0         8 /dev/pts/5
julia-deb 10892 root    3u  a_inode               0,11         0      8416 [eventpoll]
julia-deb 10892 root    4r     FIFO               0,10       0t0 233876667 pipe
julia-deb 10892 root    5w     FIFO               0,10       0t0 233876667 pipe
julia-deb 10892 root    6r     FIFO               0,10       0t0 233876668 pipe
julia-deb 10892 root    7w     FIFO               0,10       0t0 233876668 pipe
julia-deb 10892 root    8u  a_inode               0,11         0      8416 [eventfd]
julia-deb 10892 root    9u      CHR              136,5       0t0         8 /dev/pts/5
julia-deb 10892 root   10u      CHR              136,5       0t0         8 /dev/pts/5
julia-deb 10892 root   11r      CHR                1,3       0t0         6 /dev/null
julia-deb 10892 root   12u      CHR              136,5       0t0         8 /dev/pts/5
julia-deb 10892 root   13u      CHR              136,5       0t0         8 /dev/pts/5
julia-deb 10892 root   14u      CHR              136,5       0t0         8 /dev/pts/5
julia-deb 10892 root   15u      CHR              136,5       0t0         8 /dev/pts/5
julia-deb 10892 root   16u      REG                0,5 134217728 233876681 /memfd:julia-codegen (deleted)
julia-deb 10892 root   17r      CHR                1,9       0t0        11 /dev/urandom
julia-deb 10892 root   18r     FIFO               0,10       0t0 233876709 pipe
julia-deb 10892 root   19w     FIFO               0,10       0t0 233876709 pipe
julia-deb 10892 root  100u  a_inode               0,11         0      8416 [perf_event]
julia-deb 10892 root  101u  a_inode               0,11         0      8416 [perf_event]
julia-deb 10892 root  102u  a_inode               0,11         0      8416 [perf_event]
julia-deb 10892 root  103u  a_inode               0,11         0      8416 [perf_event]
julia-deb 10892 root  104u  a_inode               0,11         0      8416 [perf_event]
julia-deb 10892 root  999w      CHR                1,3       0t0         6 /dev/null
julia-deb 10892 root 1001u     unix 0x0000000000000000       0t0 233892917 type=STREAM

@Keno
Copy link
Member

Keno commented Apr 7, 2020

rr fix in rr-debugger/rr#2493

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multithreading Base.Threads and related functionality rr trace included
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants