-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rare deadlock and segmentation fault in a parallel quicksort implementation (with an MWE) #35341
Comments
If you can reproduce it under rr, that'll be easiest to debug. |
Thanks! But https://github.com/mozilla/rr/wiki/Usage says
So I guess it cannot be used to debug multi-threading bug? We never observed the bug with single thread. root@459e9488a96a:/julia# rr --version
rr version 5.3.0
root@459e9488a96a:/julia# rr record ./julia -e 'using BenchmarkTools, ThreadsX, Random; seed = Ref(0); while true; @btime ThreadsX.sort($(rand(MersenneTwister(@show(seed[] += 1)), 0:0.01:1, 1_000_000))); end'
[FATAL /home/roc/rr/rr/src/PerfCounters.cc:317:start_counter() errno: EPERM] Failed to initialize counter
=== Start rr backtrace:
rr(_ZN2rr13dump_rr_stackEv+0x3b)[0x5f229b]
rr(_ZN2rr15notifying_abortEv+0x47)[0x5f2327]
rr[0x52cba5]
rr[0x52dede]
rr(_ZN2rr12PerfCounters23default_ticks_semanticsEv+0x1a)[0x52e9ca]
rr(_ZN2rr7SessionC1Ev+0x139)[0x5bc369]
rr(_ZN2rr13RecordSessionC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt6vectorIS6_SaIS6_EESD_RKNS_20DisableCPUIDFeaturesENS0_16SyscallBufferingEiNS_7BindCPUES8_PKNS_9TraceUuidE+0x4d)[0x54069d]
rr(_ZN2rr13RecordSession6createERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS7_EESB_RKNS_20DisableCPUIDFeaturesENS0_16SyscallBufferingEhNS_7BindCPUERKS7_PKNS_9TraceUuidE+0xba9)[0x541769]
rr(_ZN2rr13RecordCommand3runERSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS7_EE+0xe14)[0x535c24]
rr(main+0x353)[0x4ac743]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f935c454b97]
rr(_start+0x29)[0x4acb59]
=== End rr backtrace
Aborted (core dumped)
|
Ah sorry I missed that https://docs.julialang.org/en/v1/devdocs/debuggingtips/#Reproducing-concurrency-bugs-with-rr-1 mentions |
This might be due to docker, you can try using a priviledged container to get access to the perf counters. |
Thanks! I followed https://github.com/mozilla/rr/wiki/Docker and then just started the MWE with
It runs like usual (though of course slowly). Let's hope it hangs. |
So I ran this for more than three hours (up to |
If before it hung only after 45 minutes, you'll probably need to run in at least 10 hours with that many threads.you may also want to kick off multiple runs in parallel to maximize your chances of catching it. |
The hang can happen within a few minutes if I'm lucky. It's the segfault that is super rare. I was worrying that |
With ASAN #35338, running the MWE with
now somehow gives me the segfault that @chriselrod observed (see the OP) reliably with some short amount of time (a few minutes):
Though nothing specific seems to be printed via ASAN runtime itself. I get different kind of setfault too:
|
Running without multi-threading (without
|
What's the best next step? I'm trying this with |
If you use a debug build and set ASAN_SYMBOLIZER_PATH to the llvm-symbolizer you've built in the process, you'll get debug info in the ASAN trace. Also see https://github.com/google/sanitizers/wiki/AddressSanitizerAndDebugger, ASAN_OPTIONS=abort_on_error=1 makes it easier to use ASAN with rr. |
Thank you @maleadt! So here is an output with root@919704e44cd5:/julia# ASAN_OPTIONS=detect_leaks=0:allow_user_segv_handler=1 ASAN_SYMBOLIZER_PATH=$PWD/usr/tools/llvm-symbolizer usr/bin/julia-debug -e 'using BenchmarkTools, ThreadsX, Random; for seed in 1:5000; @btime ThreadsX.sort($(rand(MersenneTwister(@show(seed)), 0:0.01:1, 1_000_000))); end'
seed = 1
45.948 ms (19980 allocations: 19.43 MiB)
seed = 2
45.620 ms (20313 allocations: 19.48 MiB)
seed = 3
48.309 ms (21386 allocations: 19.66 MiB)
seed = 4
46.958 ms (21283 allocations: 19.63 MiB)
seed = 5
46.954 ms (20681 allocations: 19.54 MiB)
seed = 6
46.379 ms (19937 allocations: 19.42 MiB)
seed = 7
47.260 ms (20512 allocations: 19.52 MiB)
seed = 8
=================================================================
==23720==ERROR: AddressSanitizer: heap-use-after-free on address 0x610000c1ee78 at pc 0x7fd6fa45ff42 bp 0x7fd6c73219e0 sp 0x
7fd6c73219d8
READ of size 8 at 0x610000c1ee78 thread T0
#0 0x7fd6fa45ff41 in gc_try_setmark /julia/src/gc.c:1642:24
#1 0x7fd6fa45f218 in gc_mark_scan_obj8 /julia/src/gc.c:1836:14
#2 0x7fd6fa457224 in gc_mark_loop /julia/src/gc.c:2117:9
#3 0x7fd6fa462bea in _jl_gc_collect /julia/src/gc.c:2903:5
#4 0x7fd6fa462175 in jl_gc_collect /julia/src/gc.c:3109:13
#5 0x7fd6fa454156 in maybe_collect /julia/src/gc.c:827:9
#6 0x7fd6fa453da8 in jl_gc_big_alloc /julia/src/gc.c:883:5
#7 0x7fd6fa45504d in jl_gc_pool_alloc /julia/src/gc.c:1140:12
LLVMSymbolizer: error reading file: No such file or directory
#8 0x7fd6d08185b3 (/memfd:julia-codegen (deleted)+0x325b3)
#9 0x7fd6d0818f32 (/memfd:julia-codegen (deleted)+0x32f32)
#10 0x7fd6d081ed89 (/memfd:julia-codegen (deleted)+0x38d89)
#11 0x7fd6fa32f342 in jl_fptr_args /julia/src/gf.c:2009:12
#12 0x7fd6fa3458a5 in _jl_invoke /julia/src/gf.c:2230:31
#13 0x7fd6fa3474ba in jl_apply_generic /julia/src/gf.c:2414:12
#14 0x7fd6fa3a979d in jl_apply /julia/src/./julia.h:1687:12
#15 0x7fd6fa3ae8fd in start_task /julia/src/task.c:687:19
0x610000c1ee78 is located 56 bytes inside of 128-byte region [0x610000c1ee40,0x610000c1eec0)
freed by thread T0 here:
#0 0x4a8a20 in free /workspace/srcdir/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cc:123
#1 0x7fd6fa46c181 in jl_free_aligned /julia/src/gc.c:255:5
#2 0x7fd6fa46c652 in sweep_big_list /julia/src/gc.c:942:13
#3 0x7fd6fa46bc49 in sweep_big /julia/src/gc.c:954:9
#4 0x7fd6fa46aa7c in gc_sweep_other /julia/src/gc.c:1407:5
#5 0x7fd6fa46341c in _jl_gc_collect /julia/src/gc.c:3013:5
#6 0x7fd6fa462175 in jl_gc_collect /julia/src/gc.c:3109:13
#7 0x7fd6fa454156 in maybe_collect /julia/src/gc.c:827:9
#8 0x7fd6fa453da8 in jl_gc_big_alloc /julia/src/gc.c:883:5
#9 0x7fd6fa45504d in jl_gc_pool_alloc /julia/src/gc.c:1140:12
#10 0x7fd6d0818311 (/memfd:julia-codegen (deleted)+0x32311)
#11 0x7fd6d0818f32 (/memfd:julia-codegen (deleted)+0x32f32)
#12 0x7fd6d081ed89 (/memfd:julia-codegen (deleted)+0x38d89)
#13 0x7fd6fa3458a5 in _jl_invoke /julia/src/gf.c:2230:31
#14 0x7fd6fa3474ba in jl_apply_generic /julia/src/gf.c:2414:12
#15 0x7fd6fa3a979d in jl_apply /julia/src/./julia.h:1687:12
#16 0x7fd6fa3ae8fd in start_task /julia/src/task.c:687:19
previously allocated by thread T0 here:
#0 0x4a998d in posix_memalign /workspace/srcdir/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cc:226
#1 0x7fd6fa45423a in jl_malloc_aligned /julia/src/gc.c:235:9
#2 0x7fd6fa453e6d in jl_gc_big_alloc /julia/src/gc.c:891:30
#3 0x7fd6fa45504d in jl_gc_pool_alloc /julia/src/gc.c:1140:12
#4 0x7fd6d080ed92 (/memfd:julia-codegen (deleted)+0x28d92)
SUMMARY: AddressSanitizer: heap-use-after-free /julia/src/gc.c:1642:24 in gc_try_setmark
Shadow bytes around the buggy address:
0x0c208017bd70: 00 00 00 00 00 00 00 00 fa fa fa fa fa fa fa fa
0x0c208017bd80: fa fa fa fa fa fa fa fa 00 00 00 00 00 00 00 00
0x0c208017bd90: 00 00 00 00 00 00 00 00 fa fa fa fa fa fa fa fa
0x0c208017bda0: fa fa fa fa fa fa fa fa 00 00 00 00 00 00 00 00
0x0c208017bdb0: 00 00 00 00 00 00 00 00 fa fa fa fa fa fa fa fa
=>0x0c208017bdc0: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd[fd]
0x0c208017bdd0: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa
0x0c208017bde0: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd fd
0x0c208017bdf0: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa
0x0c208017be00: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd fd
0x0c208017be10: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
Shadow gap: cc
==23720==ABORTING Here is a backtrace in
Does
Thanks for the tips! I tried |
It looks like I was very lucky at the first runs. I often get a segfault rather than the abort from ASAN:
|
Potentially, I'd look to see if you're maybe accessing it somewhere without rooting it. That said, an rr trace is still the best way to debug these things, so I'd just kick it off in the background and let it run even if it takes a day or two to crash. |
I think the only "unsafe" thing I'm doing is I also started |
I can try running it myself to see if there's some sort of pathological rr performance problem, but I suspect it's probably just the high thread count. |
Thanks! Actually, with ASAN I can reproduce the problem with single thread. Maye
|
Without ASAN, |
It's possible that there's a bad interaction with rr. Let me take a look. |
Thanks! FYI, I needed a bit of manual interventions to build current |
The incompatibility between asan/LLVM/rr is resolved by https://reviews.llvm.org/D70581. You can grab LLVM 10 from Yggdrasil and build with that to get an appropriate version of compiler-rt that has this fix. |
Also, if you apply rr-debugger/rr#2488 to your rr installation, you'll get significantly lower overhead on this workload (2x rather than 4x). Unfortunately, I still can't reproduce the crash here (I saw it once, but didn't have rr attached). |
I installed tooldir=/root/.julia/artifacts/8cd56e3e81e5770720c31304026f3b23f129153e/tools
make clean
ASAN_OPTIONS=detect_leaks=0:allow_user_segv_handler=1 make CC=$tooldir/clang CXX=$tooldir/clang LLVM_CONFIG=$tooldir/llvm-config USECLANG=1 SANITIZE=1 but it failed with the following error:
Is this a correct way to use LLVM from Yggdrasil? I also tried using cd $tooldir
ln -s clang-10 clang++
cd -
ASAN_OPTIONS=detect_leaks=0:allow_user_segv_handler=1 PATH="$tooldir:$PATH" make USECLANG=1 SANITIZE=1 but I got the same error. |
|
I'm trying rr+ASAN with rr |
I can get the errors under I tried both with and without |
Now that we have an rr trace, debugging it will be quite easy. I'll take a look in the morning. |
Awesome. Thanks in advance! |
Ok, I've finally managed to get my setup back into a state where I can successfully replay this. However, the debug symbols seem to be missing from the trace, so it's a bit hard to debug (not impossible, but would be easier with symbols). Did you |
Also, could you show the output of |
Alright, got it working again: Here's what rr spits out. The object in question is a subarray: Here's it's life:
|
This definitely looks like the compiler is failing to insert a GC root. However, when I just ask it for the LLVM IR, that allocation where the root is deleted doesn't happen (it's optimized out), so there most be something else going on. |
Yeah, this is a regression in LateLowerGCFrame, caused by 4776a8d#diff-1b7282e2ae2a9c71d694e66b58d43e07 Reproducer:
It'll fail to protect the allocation across the safepoint. Thanks for getting me the rr trace @tkf - This would have been extremely difficult without, since it doesn't show up by just looking at |
Thank you so, so much! I'm glad that So the bug is now found and there is no need for re-uploading the trace? Though looking at the shell history it looks like I did |
Yes, you did do |
Great! |
Is the hang that requires SIGKILLing the tracee reproducible? I'm trying to reproduce, but failing. It'd be good to know what rr is up to there, so I can fix it. |
That's You can click "Original output" and see what happened around when I send the SIGKILL. The By the way, just in case if it helps to guess what happened, I also remember that |
Yes, something weird happened in rr when julia died. I can't reproduce it locally. If you see it again, try to attach gdb to the rr process (by finding it in |
I have one process like julia-debug-5 still hanging. Here is its backtrace:
|
Can you show the output of /proc/10893/status? |
Output of `lsof -p 10892`:
|
rr fix in rr-debugger/rr#2493 |
I have a trouble with finding a cause of a rare deadlock in my parallel quicksort. I tried my best, with a big help from @chriselrod, to find out the problem but I still have no idea how to fix it.
I am reporting this here since I think there is some chance that there is a bug in
julia
(scheduler? compiler?). (In other words I'm hopelessly lost now and all I can do is "blame" the upstream.)MWE: Install the latest ThreadsX (v0.1.1) and BenchmarkTools. Then run the following code eventually causes a deadlock when
JULIA_NUM_THREADS
is appropriate (for me it's 13 for @chriselrod it's 8 but not 7):I've tried
git bisect
with this script and found that #32599 changes the way the problem(s) manifest:With f5dbc47, I observed this segmentation fault:
However, I don't understand how this function
partition_sizes!
can cause a segfault.@chriselrod also observed another segfault with 65b8e7e:
It is puzzling for me that how the code involved in the above stacktrace can cause a segfault:
-
at ./int.jl:52unsafe_length
at ./range.jl:517unsafe_indices
at ./abstractarray.jl:99_indices_sub
at ./subarray.jl:409axes
at ./subarray.jl:404axes1
at ./abstractarray.jl:95eachindex
at ./abstractarray.jl:267eachindex
at ./abstractarray.jl:270eachindex
at ./abstractarray.jl:260partition_sizes!
at ThreadsX.jl/src/quicksort.jl:163We (me and @chriselrod) tried to get a better error report by running the MWE with
--check-bounds=yes
. But we couldn't get a hang or segfault even after running the MWE for hours.I highly appreciate any help/advice from core devs or anyone who knows how threading works in Julia. I'm currently trying to build ASAN-enabled
julia
with help from @vchuravy #35338, hoping that it gives us some information. But I'm not sure how much I can discover by doing this as I don't know anything about LLVM or C++.The text was updated successfully, but these errors were encountered: