Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unloading a Rust dylib with TLS used segfaults on OSX #28794

Closed
alexcrichton opened this issue Oct 1, 2015 · 17 comments
Closed

Unloading a Rust dylib with TLS used segfaults on OSX #28794

alexcrichton opened this issue Oct 1, 2015 · 17 comments
Labels
A-thread-locals Area: Thread local storage (TLS) C-bug Category: This is a bug. O-macos Operating system: macOS

Comments

@alexcrichton
Copy link
Member

Example code

The problem here is that we register a TLS destructor via _tlv_atexit when TLS is referenced the first time after it is used (e.g. when the dylib's function is called), but then when dlclose happens the function isn't actually there and a fault happens when the thread exits and tries to run its destructors.

I'm not entirely sure how we might handle this, perhaps there's a way to compile dylibs such that the TLS access is OK? Perhaps we should hook an "unload" event and deregister (e.g. leak) TLS destructors? Either way seems like a good thing to track!

@alexcrichton alexcrichton added the O-macos Operating system: macOS label Oct 1, 2015
@ranma42
Copy link
Contributor

ranma42 commented Oct 2, 2015

_dyld_register_func_for_remove_image might be the hook we need.
Manpage available at https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man3/dyld.3.html

@emoon
Copy link
Contributor

emoon commented Feb 23, 2016

cc @emoon

@emoon
Copy link
Contributor

emoon commented Mar 5, 2016

Has there been any progress on this issue? It can be worked around but not releasing the lib but it would be nice to have a proper solution.

emoon added a commit to emoon/ProDBG that referenced this issue Jun 17, 2016
ProDBG crashes on exit but that is due to rust-lang/rust#28794

Closes #177
@noeleont
Copy link

noeleont commented Sep 1, 2016

I had the same issue, one workaround was loading the lib with RTLD_NODELETE.

@solarretrace
Copy link

I think I'm having the same issue? But the RTLD_NODELETE option doesn't seem to help. It's also probably not appropriate for my use-case, which requires unloading and reloading the library repeatedly.

Stack looks like this:

Exception Type: EXC_BAD_ACCESS (SIGSEGV)
Exception Codes: KERN_INVALID_ADDRESS at 0x0000000115ab5cb0

Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0 libsystem_malloc.dylib 0x00007fff8e8c5059 free + 58
1 dyld 0x00007fff6542cec1 ImageLoaderMachOCompressed::~ImageLoaderMachOCompressed() + 33
2 dyld 0x00007fff6541afc4 dyld::garbageCollectImages() + 831
3 dyld 0x00007fff65422428 dlclose + 134
4 libdyld.dylib 0x00007fff91143808 dlclose + 61
5 brightlab 0x000000010c54cfdb libloading::os::unix::{{impl}}::drop::{{closure}} + 43 (mod.rs:38)
6 brightlab 0x000000010c54cd56 libloading::os::unix::with_dlerror<(),closure> + 134 (mod.rs:38)
7 brightlab 0x000000010c54cf7d _$LT$libloading..os..unix..Library$u20$as$u20$core..ops..Drop$GT$::drop::hd8137c4da21c7d9d + 45 (mod.rs:38)
8 brightlab 0x000000010c4908e1 drop::h9ed5b642e5309eab + 17
9 brightlab 0x000000010c48e409 drop::h2bbca3905b6b58a1 + 9
10 brightlab 0x000000010c490e28 drop::haca7bed875860903 + 72
11 brightlab 0x000000010c48dee1 drop::h1e1054b6f31067c0 + 17
12 brightlab 0x000000010c48f385 drop::h529da6d106e4933f + 149
13 brightlab 0x000000010c4b2313 brightlab::main + 835 (main.rs:37)
14 brightlab 0x000000010c55997b __rust_maybe_catch_panic + 27 (lib.rs:106)
15 brightlab 0x000000010c558ec7 std::rt::lang_start::hefd96b70277e8a4a + 391 (rt.rs:57)
16 brightlab 0x000000010c4b245a main + 42
17 libdyld.dylib 0x00007fff911445c9 start + 1

@nagisa
Copy link
Member

nagisa commented Dec 22, 2016

Lack of _tlv_atexit in the backtrace seems to suggest that yours is different issue. This comment might help to give some pointers.

@solarretrace
Copy link

solarretrace commented Dec 23, 2016

Hmm, I got the impression from that comment that it would be fixed by default... I'll look into it, thanks.

@Mark-Simulacrum
Copy link
Member

Copying code into here so it doesn't get lost; nagisa/rust_libloading#5 is potentially relevant.

test.rs:

#[no_mangle]
pub extern "system" fn test_fn() -> i32 {
    // Removing this line prevents the segfault.
    // I've tried flushing stdout as well but it doesn't change anything
    println!("In library!");
    123456
}

main.c:

#include <dlfcn.h>
#include <stdio.h>

int main() {
    printf("running\n");
    void* handle = dlopen("./libtest.dylib", RTLD_LAZY);
    printf("opened: %p\n", handle);

    int (*test_fn)() = dlsym(handle, "test_fn");
    printf("test_fn: %d\n", test_fn());

    printf("Closing...\n");
    int code = dlclose(handle); // Removing this line prevents the segfault upon exit?
    printf("Closed: %d.\n", code);
}
$ rustc --crate-type=dylib test.rs
$ gcc main.c
$ ./a.out
running
opened: 0x7f8a93c02640
In library!
test_fn: 123456
Closing...
Closed: 0.
Segmentation fault: 11

@Mark-Simulacrum Mark-Simulacrum added the C-bug Category: This is a bug. label Jul 24, 2017
@alexcrichton alexcrichton added the A-thread-locals Area: Thread local storage (TLS) label Aug 25, 2017
@aidanhs
Copy link
Member

aidanhs commented Nov 14, 2017

@mitsuhiko
Copy link
Contributor

I ran into this and looked at ways to work around this. It comes up with Python extension modules and so far we just decided to leak the module. The reason this cannot really be fixed to the best of my knowledge is that _tlv_atexit (which is somewhat of an undocumented api as far as I can tell) does not have a way to unregister the callback.

Since the only callback that rust can reasonably place here is from the dylib we can't really register something here that does not crash if the dylib goes away. One would need to find a trampoline that can be used and does not unload. Unsure what the fix here is. This seems like a bug in macos albeit one that has low changes of fixing.

@BurntPizza
Copy link
Contributor

BurntPizza commented Feb 3, 2018

I'm getting something similar on Arch: https://github.com/BurntPizza/dylib_tls_crash

$ cargo build && valgrind target/debug/dylib_crash
    Finished dev [unoptimized + debuginfo] target(s) in 0.0 secs
==27114== Memcheck, a memory error detector
==27114== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==27114== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==27114== Command: target/debug/dylib_crash
==27114== 
Dropping lib
Lib is dropped
Success: No thread
Dropping lib
Lib is dropped
==27114== Thread 2:
==27114== Jump to the invalid address stated on the next line
==27114==    at 0x6EC23D0: ???
==27114==    by 0x524B1B7: __nptl_deallocate_tsd.part.5 (in /usr/lib/libpthread-2.26.so)
==27114==    by 0x524C1DC: start_thread (in /usr/lib/libpthread-2.26.so)
==27114==    by 0x577042E: clone (in /usr/lib/libc-2.26.so)
==27114==  Address 0x6ec23d0 is not stack'd, malloc'd or (recently) free'd
==27114== 
==27114== Can't extend stack to 0x402a138 during signal delivery for thread 2:
==27114==   no stack segment
==27114== 
==27114== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==27114==  Access not within mapped region at address 0x402A138
==27114==    at 0x6EC23D0: ???
==27114==    by 0x524B1B7: __nptl_deallocate_tsd.part.5 (in /usr/lib/libpthread-2.26.so)
==27114==    by 0x524C1DC: start_thread (in /usr/lib/libpthread-2.26.so)
==27114==    by 0x577042E: clone (in /usr/lib/libc-2.26.so)
==27114==  If you believe this happened as a result of a stack
==27114==  overflow in your program's main thread (unlikely but
==27114==  possible), you can try to increase the size of the
==27114==  main thread stack using the --main-stacksize= flag.
==27114==  The main thread stack size used in this run was 8388608.
==27114== Invalid write of size 8
==27114==    at 0x4A27630: _vgnU_freeres (in /usr/lib/valgrind/vgpreload_core-amd64-linux.so)
==27114==  Address 0x402aff8 is on thread 2's stack
==27114== 
==27114== 
==27114== Process terminating with default action of signal 11 (SIGSEGV)
==27114==  Access not within mapped region at address 0x402AFF8
==27114==    at 0x4A27630: _vgnU_freeres (in /usr/lib/valgrind/vgpreload_core-amd64-linux.so)
==27114==  If you believe this happened as a result of a stack
==27114==  overflow in your program's main thread (unlikely but
==27114==  possible), you can try to increase the size of the
==27114==  main thread stack using the --main-stacksize= flag.
==27114== 
==27114== HEAP SUMMARY:
==27114==     in use at exit: 384 bytes in 4 blocks
==27114==   total heap usage: 33 allocs, 29 frees, 9,316 bytes allocated
==27114== 
==27114== LEAK SUMMARY:
==27114==    definitely lost: 0 bytes in 0 blocks
==27114==    indirectly lost: 0 bytes in 0 blocks
==27114==      possibly lost: 288 bytes in 1 blocks
==27114==    still reachable: 96 bytes in 3 blocks
==27114==         suppressed: 0 bytes in 0 blocks
==27114== Rerun with --leak-check=full to see details of leaked memory
==27114== 
==27114== For counts of detected and suppressed errors, rerun with: -v
==27114== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)
[1]    27114 segmentation fault (core dumped)  valgrind target/debug/dylib_crash

The difference here is that the segfault happens when a (non-main) thread exits, if there has been a dylib dropped in that thread, even if the lib doesn't contain anything (source-wise). mem::forget "works", but interestingly so does setting the lib's crate-type to cdylib. What are the differences between dylib and cdylib in regards to TLS destructors, at least with an empty rust library?

Various things can be found be searching for __nptl_deallocate_tsd but this seems above my pay grade. The most promising thing was this comment and the patch after it: https://bugzilla.redhat.com/show_bug.cgi?id=1065695#c12

Info:

$ rustc -V
rustc 1.23.0 (766bd11c8 2018-01-01)

$ uname -srv 
Linux 4.14.11-1-ARCH #1 SMP PREEMPT Wed Jan 3 07:02:42 UTC 2018

Should I make a separate issue?

@ubolonton
Copy link

This seems to have been fixed by some dyld changes in High Sierra.
Looks like dyld now marks a dylib as "never unload" if it has MH_HAS_TLV_DESCRIPTORS flag in the header.

On the other hands, this means that Rust dylibs will never be unloaded on OS X?

@scottjmaddox
Copy link

Are we 100% sure the issue is with TLS? I've got TLS working without segfault through a dlclose (and reopen) on macOS Sierra (10.12.6), but only if the dylib statically links to libstd. If the dylib is dynamically linked to libstd, then I get a segfault on dlclose regardless of whether or not I'm using TLS.

This is on stable rustc 1.24.0.

This is good enough for my use case (code hot reloading during development), but it would be nice to remove the requirement to statically link libstd into the dylib.

@mitsuhiko
Copy link
Contributor

@scottjmaddox this particular crash is from what's registered to _tlv_atexit which is exclusively used by thread locals. If you have a different crash that might be interesting as well.

@nanotech
Copy link

WWDC 2017's Session 413 @ 29:36 mentions that using TLS prevents a dylib from being unloaded:

There are also a number of features on our platforms that prevent dylibs from unloading, and I'd like to go through a few of those because maybe you do them.

You can have Objective-C classes in your dylib. That will make it not unloadable.

You could have Swift classes. That will also make it not unloadable.

And you can have C __thread or C++ thread local variables, all of which make it impossible to unload a dylib.

So on macOS, where there's a number of existing Unix apps, obviously we will keep this working, but because almost every dylib on all of our other platforms does one of these things, effectively it hasn't really worked on any of them ever.

So we are considering making it just a straight up no-op, that will not do anything on any of those platforms. If there's a reason why that's a problem, please, we want to hear about it.

@nagisa
Copy link
Member

nagisa commented Mar 28, 2018

@alexcrichton seems like MacOS people have fixed this. Should we close?

@alexcrichton
Copy link
Member Author

Sure!

zicklag added a commit to zicklag/rust-dlopen that referenced this issue May 25, 2019
The bug was fixed, no need for the note now:

rust-lang/rust#28794
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-thread-locals Area: Thread local storage (TLS) C-bug Category: This is a bug. O-macos Operating system: macOS
Projects
None yet
Development

No branches or pull requests