-
Notifications
You must be signed in to change notification settings - Fork 566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
64-bit Travis tests failing non-deterministically after Travis upgrade #2641
Comments
Running the full suite I can't reproduce quite the same thing, but on a Trusty VM with 4.4.0-93 the linux.fib-conflict* tests (and unit_tests) fail consistently. We have several issues revealed by looking at those:
relocate_dynamorio is not preserving rsi or rdi even though they're This is true on laptop too: but the munmap of old_libdr_base just fails b/c
So the assert can't run that early.
|
Another issue from vvar+vdso inside libdynamorio: handle is_in_dynamo_dll(). The problem is that this is all pre-heap, which is why we aren't using the general code for app libs that makes an array of segments and handles gaps. |
I tried a Travis run fixing the non-NULL old_libdr_base and disabling the asserts and it does get further: the handful of failures are now like this:
They're all very similar: they exit around 224 fragments in. |
Since I could not reproduce anywhere else, I resorted to running all tests at -loglevel 2 and adding scripting in runsuite to look through all the logs and print relevant info from the failing tests. Here's the info:
I found that block in my VM:
It's ld-linux.so, reading from vdso + 0x38:
Adding the maps info to what my script gathers:
Where are vvar and vdso? The crash is referencing the gap in libdynamorio It looks like code from #1659 is the culprit:
The non-determinism must be from the kernel, whether or not I don't think we want to try and avoid the gap via linker flags I put in a hack to reload DR if vvar is in the gap (plus a hack So now to figure out a clean way to do this. If it's actually |
Fixes 3 problems related to vvar+vdso in libdynamorio's text-data gap: 1) Ensure old_libdr_base is NULL by not relying on the calling convention during early injection. 2) Handle asserts/curiosities inside memquery code by adding is_readable_without_exception_query_os_noblock() and memquery_from_os_will_block() to avoid deadlock on locks used on UNIX for memory queries. 3) Detect something in the text-data gap and reload libdynamorio during early injection as the rest of DR assumes there's nothing there. Fixes #2641
While the PR run had everything pass, the final commit run had one test fail with the same unmapped-vdso app-crash symptoms! https://travis-ci.org/DynamoRIO/dynamorio/jobs/280172168
So there's still some scenario my fix doesn't handle?? |
Re-opening to finish the missing piece here. First I added an assert that the gap is empty when when clobber it for #1659.
What is that rw anon page: To compare, here's from my VM:
(gdb) p /x 0x00007f64a9b1d000-0x00007f64a9ad4000 So the .data and .bss have the same sizes: this page seems separate from them. |
Another issue is that on a reload the old .bss is not being freed. Even a check for an empty comment is not sufficient because the kernel incorrectly labels our .bss as "[heap]" in some cases, as shown here:
|
I found a machine where I can reproduce the vvar clobber even with my fix (if I run hundreds of times in a loop): I don't see the weird single page listed above, but I do see why it's being clobbered. I merged the reload_dynamorio case for the gap with the app conflict but didn't adjust the conflict bounds, so the reload for the gap ends up making a temp mmap that is very large, covering from the app to the old lib. This can clobber vvar and vdso. |
I also reproduced the mysterious extra page at the end of the gap. It's our emulated brk! Breaking at mmap_syscall: it's init_emulated_brk() called from
|
The top-down packing of the libs by the kernel means we should request extra space along with the exe's ELF space to put the brk after it. |
Fixes 3 problems related to the kernel placing vvar+vdso in libdynamorio's text-data gap: 1) Ensures old_libdr_base is NULL by not relying on the calling convention during early injection. 2) Handles asserts/curiosities inside memquery code by adding is_readable_without_exception_query_os_noblock() and memquery_from_os_will_block() to avoid deadlock on locks used on UNIX for memory queries. 3) Detects something in the text-data gap and reloads libdynamorio during early injection as the rest of DR assumes there's nothing there. Fixes #2641
Solves several remaining issues beyond what the first #2641 commit addressed: Moves the libdynamorio text-data gap filling earlier to prevent our own mmaps, such as for the brk, from landing there. Adds explicit space for the brk in the initial app mmap, to deal with top-down packing by the kernel for PIE which otherwise leaves no space for the brk. Refactors the scheme of "conflict mmaps" used for reloading libdynamorio to avoid clobbering existing mappings. Adds more sanity check asserts regarding the libdynamorio text-data gap. On a reload, properly unmaps the libdynamorio .bss, even when the kernel mis-labels it as "[heap]". Fixes #2641
Split from the 32-bit crashes of #2634
For 64-bit, a completely different subset of tests fails each run. The tests have zero output as #2640 shows. I cannot reproduce in a VM that is similar to the Travis VM.
https://travis-ci.org/DynamoRIO/dynamorio/jobs/276032337
https://travis-ci.org/DynamoRIO/dynamorio/jobs/277148078
The text was updated successfully, but these errors were encountered: