Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault when thread using dynamically loaded Rust library exits #91979

Open
devongovett opened this issue Dec 15, 2021 · 8 comments
Open
Labels
A-thread-locals Area: Thread local storage (TLS) C-bug Category: This is a bug. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Comments

@devongovett
Copy link

Scenario: I have a Rust cdylib, which is loaded by a C program via dlopen. The C program creates a thread, and loads the Rust module inside it. It proceeds to call one of the Rust functions, and closes the library via dlclose. Then the thread exits. The Rust program has a thread local variable with a struct that implements Drop, which it modifies in the function called from C.

Full reproduction here: https://github.com/devongovett/rust-threadlocal-bug

On CentOS 7, which uses glibc 2.17, it segfaults at __nptl_deallocate_tsd() inside pthread_create.c. With later versions of glibc, there is no crash. I believe the crash occurs because Rust creates a thread local key with pthread_key_create but never calls pthread_key_delete (the call in the destructor is commented out):

impl Drop for Key {
fn drop(&mut self) {
// Right now Windows doesn't support TLS key destruction, but this also
// isn't used anywhere other than tests, so just leak the TLS key.
// unsafe { imp::destroy(self.key) }
}
}

When the thread exits, glibc tries to call the destructor for the key, but because the dynamic library has already been unloaded via dlclose at this point, the function no longer exists and we get a crash.

My theory is that this only occurs with glibc 2.17 and not later versions is due to __cxa_thread_atexit_impl not existing in these older versions. This function is used when available to register destructors, otherwise a fallback implementation is used:

if !__cxa_thread_atexit_impl.is_null() {
type F = unsafe extern "C" fn(
dtor: unsafe extern "C" fn(*mut u8),
arg: *mut u8,
dso_handle: *mut u8,
) -> libc::c_int;
mem::transmute::<*const libc::c_void, F>(__cxa_thread_atexit_impl)(
dtor,
t,
&__dso_handle as *const _ as *mut _,
);
return;
}
However, I'm not sure about that. It could be some other change in glibc.

I have not tested, but I think the bug could potentially be fixed if the commented out destructor linked above were actually called. The comment indicates something about windows not supporting this, so maybe it could be called conditionally?

glibc 2.17 is indeed pretty old, however, it is the version used by the current CentOS 7 version which is not EOL until 2024, so I do think this bug should be fixed.

Meta

rustc --version --verbose:

rustc 1.57.0 (f1edd0429 2021-11-29)
binary: rustc
commit-hash: f1edd0429582dd29cccacaf50fd134b05593bd9c
commit-date: 2021-11-29
host: x86_64-unknown-linux-gnu
release: 1.57.0
LLVM version: 13.0.0
@devongovett devongovett added the C-bug Category: This is a bug. label Dec 15, 2021
@devongovett
Copy link
Author

Upon further research, I'm pretty sure the reason it "works" with newer glibc versions is that the library is actually never fully unloaded if there are TLS destructors registered via __cxa_thread_atexit_impl. In the source code, l_tls_dtor_count is incremented when a destructor is registered: https://github.com/bminor/glibc/blob/91cc803d27bda34919717b496b53cf279e44a922/stdlib/cxa_thread_atexit_impl.c#L137

And in dlclose, this is checked: https://github.com/bminor/glibc/blob/90b37cac8b5a3e1548c29d91e3e0bff1014d2e5c/elf/dl-close.c#L186

This can be verified by running my reproduction program above with the LD_DEBUG=files environment variable. In glibc 2.17, it calls the .fini section immediately on dlclose (just before segfaulting), but in newer versions this is deferred until process exit (essentially a leak?!).

@Brooooooklyn
Copy link

Brooooooklyn/canvas#377 maybe relate to it

@hkratz
Copy link
Contributor

hkratz commented Dec 16, 2021

@rustbot label T-libs A-thread-locals

@rustbot rustbot added A-thread-locals Area: Thread local storage (TLS) T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Dec 16, 2021
@follower
Copy link
Contributor

follower commented Dec 16, 2021

It may be useful/informative to read some of the prior discussion of dynamic libraries & thread local storage in this & linked issues: #28794

FWIW based on my experience the only "reliable" approach has been to simply never allow dylibs to be unloaded: follower/foreigner@3845586

Edit: Especially nagisa/rust_libloading#41 & also https://sourceware.org/glibc/wiki/Destructor%20support%20for%20thread_local%20variables

(I first encountered this issue when using wasmtime from C & C++.)

@Aaron1011
Copy link
Member

See also this upstream glibc bug: https://sourceware.org/bugzilla/show_bug.cgi?id=21032

@devongovett
Copy link
Author

devongovett commented Dec 16, 2021

A bit more context about how this affects real world code. This currently happens to all native Node.js modules that include thread locals, when loaded within Node's worker threads. Node controls when dlopen and dlclose are called, so there isn't really a great way for module authors to simply prevent unloading the dylib. One way is for users to also load the module from the main thread in addition to workers, which in effect delays unloading until process exit, but this is not very obvious or ergonomic. And this is far from the only case where this could occur.

One potential solution is to register a destructor function in the .fini_array section of the ELF binary, which will be called when the library is unloaded by dlclose. In C this is possible via the __attribute__((destructor)) syntax, which was used to solve this related bug: https://bugzilla.redhat.com/show_bug.cgi?id=1065695. Perhaps Rust could do something similar to ensure the thread local key is dropped when the dylib is unloaded?

@joshtriplett joshtriplett added the E-medium Call for participation: Medium difficulty. Experience needed to fix: Intermediate. label Jan 26, 2022
@joshtriplett joshtriplett removed the E-medium Call for participation: Medium difficulty. Experience needed to fix: Intermediate. label Mar 1, 2022
@thomcc
Copy link
Member

thomcc commented Mar 1, 2022

Hmm, is this possibly caused by #88737 ? (edit: no, but it is related). That bug shouldn't be that hard to fix, if so. (edit: ☹️)

@mcollina
Copy link

Apprently, this was made significantly worse in v1.83.0, which has starting to segfault for me when using native addons written in Rust in Node.js.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-thread-locals Area: Thread local storage (TLS) C-bug Category: This is a bug. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

9 participants