Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unclear safety of std::thread::catch_panic #25662

Closed
tbu- opened this issue May 20, 2015 · 30 comments
Closed

Unclear safety of std::thread::catch_panic #25662

tbu- opened this issue May 20, 2015 · 30 comments

Comments

@tbu-
Copy link
Contributor

tbu- commented May 20, 2015

If I get it correctly, this function cannot be safe as is, due to everyone in a thread implicitely having a reference to TLS.

Created this issue because I didn't want discussion about this to be lost in IRC.

@lilyball
Copy link
Contributor

If RefCell were changed to poison itself on panic, would that fix this issue? Immutable TLS isn't a problem, and offhand I believe that mutable TLS requires either RefCell or some inter-thread data primitive (such as Mutex, which already handles this case).

@steveklabnik
Copy link
Member

/cc @alexcrichton

@Stebalien
Copy link
Contributor

@kballard I'm not entirely convinced that would be safe but I can't prove it unsafe either. In the multithreaded case, data protected by a lock must not violate any invariants any time the lock isn't held because one generally can't make statements about the state of other threads. However, in single threaded environments one can reasonably make scope-based invariants. For example, my horrorshow template macro emulates scoped TLS as follows:

pub fn __with_template_reentrant<F: FnMut(&mut Template)>(mut f: F) {
    // The scoped variant is unstable so we do this ourselves...
    __TEMPLATE.with(|template| {
        let mut local_template = None;
        ::std::mem::swap(&mut *template.borrow_mut(), &mut local_template);
        (f)(local_template.as_mut().unwrap());
        ::std::mem::swap(&mut *template.borrow_mut(), &mut local_template);
    });
}```

If the last swap isn't run, the invariant is violated and my library will likely panic at some future point.


However, this example doesn't violate memory safety and I can't find a way to do so without using `unsafe`.

@lilyball
Copy link
Contributor

@Stebalien You're effectively poisoning the TLS value while that code is being run. If the thread panics and avoids the last swap, a subsequent attempt to access it will panic. And that's perfectly fine (although you might want to consider the idea of explicit poisoning so you can let the user detect this case without panicking). Panicking is not memory unsafety.

In the case of RefCell, a panic will run the destructors on the extant borrows, which will return the RefCell to the unborrowed state, but the value it contains may be violating invariants (in a manner that can lead to memory unsafety). This is the same issue that Mutex had, which is why Mutex poisons itself if the thread panics while a lock is held. RefCell doesn't have poisoning because it can't be shared between multiple threads without some thread synchronization primitive (e.g. Mutex). But std::thread::catch_panic() plus TLS does give rise to the same situation.

@bluss
Copy link
Member

bluss commented May 21, 2015

Can't you use the binaryheap again? It was used to show unsoundness off thread scoped (the first time)

@bluss
Copy link
Member

bluss commented May 24, 2015

Ok, so I put myself to work since it was my suggestion.

Do you remember soundness issue the first time with std::thread::scoped (issue #20807) and why we made it stop catching panics? The following is (possibly?) a reason: It allows violating invariants in a type, and still go back to using that same value.

The following code example segfaults in safe rust. I'm using a mutex and ignoring poison (this is safe rust). I think we have other ways to share memory that would avoid mutex altogether, so I think we can probably ignore mutex in the discussion(?).

Note: This segfault works just as well with thread::spawn!

rustc version: rustc 1.2.0-nightly (0cc99f9 2015-05-17) (built 2015-05-18)

(playpen link)

#![feature(catch_panic, collections, std_misc)]
use std::cmp::Ordering;
use std::collections::BinaryHeap;
use std::collections::BTreeSet;
use std::thread;
use std::sync::{Arc, Mutex};

#[derive(PartialEq, Eq, Ord, Debug, Clone)]
struct Panicker<T>(T);

impl<T: PartialEq> PartialOrd for Panicker<T>
{
    fn partial_cmp(&self, _: &Self) -> Option<Ordering>
    {
        panic!()
    }
}

fn main() {
    let heap = Arc::new(Mutex::new(BinaryHeap::new()));
    let heap_ref = heap.clone();
    let guard = thread::catch_panic(move || {
        let mut local_heap = heap.lock().unwrap();
        local_heap.push(Panicker(BTreeSet::<i32>::new()));
        local_heap.push(Panicker(BTreeSet::<i32>::new()));
    });
    println!("Thread result: {:?}", guard);

    let mutex_guard = match heap_ref.lock() {
        Ok(inner) => inner,
        Err(e) => e.into_inner(),
    };

    let vector_from_binary_heap = mutex_guard.clone().into_vec();
    println!("{:?}", vector_from_binary_heap);
}

Q: Why use BTreeSet as the element type?
A: BinaryHeap will mark an element as dropped (zeroing or filling) before starting its sift down algorithm and comparing values. We catch the panic to get the binaryheap in this state. BTreeSet is used for a value that we know is really corrupted / memory unsafe when used in zeroed/filled state.

You have to ask the question: Is it catching panics or BinaryHeap that is broken? If we fix BinaryHeap, what other type invariants can we break for fun & unsoundness?

Edit: as tomaka says below, is it ignoring poisoning that is unsafe?

@bluss
Copy link
Member

bluss commented May 24, 2015

Time to call in the cavalry, cc @nikomatsakis @aturon

@bluss
Copy link
Member

bluss commented May 24, 2015

oh and cc @tomaka

@tomaka
Copy link
Contributor

tomaka commented May 24, 2015

std::sync::PoisonError::into_inner() should definitely be unsafe.

@bluss
Copy link
Member

bluss commented May 24, 2015

I want to be sure we solve the fundamental issues and don't just patch the holes in an unsound bag. I don't know which of the two categories ignoring poisoning belongs to.

@tomaka
Copy link
Contributor

tomaka commented May 24, 2015

The golden rule is that when a panic occurs, any object that is mutably borrowed must no longer be used in the future (except for its destructor) because its internal state is possibly messed up.

Rust cleans up local variables inside the closure passed to catch_panic and spawn, so the only problem comes when accessing variables from the outside:

  • References/mutable references.
  • Rc/Arc. (Gc in the future?)
  • Anything static or static-like such as thread local storage.
  • Raw pointers, but these are unsafe anyway so we don't care.

References are explicitely forbidden because the closure is required to be Send and 'static (by the way I don't see why Send is required for catch_panic, but I may be missing something).

Only Rc, Arc and statics are problematic, but they only allow immutable access to their content. Therefore the problem lies in types that allow interior mutability: Cell, RefCell, Mutex, Atomic.
Cell and Atomic can't panic during their mutations, and RefCell and Mutex use/should use poisoning.

@tomaka
Copy link
Contributor

tomaka commented May 24, 2015

One thing that I neglected in the previous post are destructors. Maybe with a lot of hacky (but safe) code it could possible to trigger a crash by writing a destructor that is being run during a panic (that's just a supposition, I have absolutely no idea how). Destructors are some kind of gray area when it comes to exception-safety.

@tomaka
Copy link
Contributor

tomaka commented May 24, 2015

This problem is really specific to TLS, because RefCells can't be put inside Arcs and that thread::spawn requires a closure that implements Send.

That's probably the reason why catch_panic requires Send as well, but it didn't account for TLS.

@bluss
Copy link
Member

bluss commented May 24, 2015

For completeness, the TLS + RefCell version of the same segfault.

@alexcrichton
Copy link
Member

@tbu- can you clarify what you're looking for in this issue? There's discussion about referencing RefCell after the thread has already panicked, and I just want to make sure we're all on the same page.


@kballard

If RefCell were changed to poison itself on panic, would that fix this issue?

It's currently an explicit design decision that RefCell is not poisoned like mutexes. The rationale behind this being that a panic propagation across threads can not be detected without poisoning, but within a thread it can be detected via other means. This rationale did predate catch_panic, however.


@bluss

Is it catching panics or BinaryHeap that is broken?

Looks like the implementation of BinaryHeap is the one at fault here, it is not panic-safe.


@tomaka

std::sync::PoisonError::into_inner() should definitely be unsafe.

It's a fundamental part of poisoning that this is not unsafe. It is only possible to break memory safety as a part of panic safety with unsafe code, if there is no unsafe code then all code is automatically panic-safe. As a result a poison error is simply an indication that a panic happened, which much of the time means a bug, but sometimes it can be safely ignored.

The golden rule is that when a panic occurs, any object that is mutably borrowed must no longer be used in the future (except for its destructor) because its internal state is possibly messed up.

This is only true for for unsafe code, safe code can only have logical invariants violated.

by the way I don't see why Send is required for catch_panic, but I may be missing something

The precise bounds for this API are a little squishy (hence the instability), but the goal was to prevent unnecessary leakage of accidental vectors of panic safety. It was thought that the bounds can be dropped possibly if good rationale comes up in the future.

That's probably the reason why catch_panic requires Send as well, but it didn't account for TLS.

Yes, TLS did not come up much in the initial discussions of catch_panic

@tomaka
Copy link
Contributor

tomaka commented May 27, 2015

It is only possible to break memory safety as a part of panic safety with unsafe code, if there is no unsafe code then all code is automatically panic-safe.

On one hand several changes have to been made to help unsafe code writers: forbidding transmuting &self to &mut self, removing the Send/Sync implementations from raw pointers, tying the lifetime of a slice produced from a raw pointer to the lifetime of the pointer, etc.

On the other hand, failure to write panic-safe code should be blamed on the writer?
One of the reasons why critical code is written in C and not C++ is because C doesn't have exceptions, and exception-safey is awefully hard to do right.
I know that the idea behind Rust is that any unsafe code should be carefully auditted, but this is almost impossible to do right, and without even looking I'm sure that there are panic-safety-related bugs in most unsafe code of the stdlib.

@bluss
Copy link
Member

bluss commented May 27, 2015

@alexcrichton BinaryHeap hasn't even been updated from zeroing to filling drop.. This is super broken!

If we choose to blame this particular location in libstd

  • That means we have a memory safety bug in 1.0 -- and why wasn't it fixed the first time it was pointed out?
  • You are implying that JoinGuard::join returning an Err is **really** unsafe #20807 was overblown and not an issue
  • We need to keep looking for type invariants that will break memory safety when broken. I'll try to help there.

@alexcrichton
Copy link
Member

@tomaka

On the other hand, failure to write panic-safe code should be blamed on the writer?

Note that like the abstractions of Send and Sync being removed from raw pointers, we strive to have you opt-in to extra features (e.g. explicitly requiring the marker trait annotations) for unsafe code, but we still allow it to happen. Another example is how for poisoning you have to opt-in to ignoring remote panics.

without even looking I'm sure that there are panic-safety-related bugs in most unsafe code of the stdlib.

Whatever the outcome of this bug is, however, this isn't super relevant as they all need to be fixed regardless.


@bluss

That means we have a memory safety bug in 1.0 -- and why wasn't it fixed the first time it was pointed out?

We're all quite busy and things fall under the radar from time to time, it's not like we all sat down at a table and decided that a memory safety bug in BinaryHeap was OK to make it into 1.0.

You are implying that #20807 was overblown and not an issue

I mean only to clarify what the outcome of the issue was. We definitely realize that exception safety is a problem for unsafe code, but this is not an issue that can just be ignored by sweeping it under an unsafe rug, it always needs to be dealt with no matter what.

@bluss
Copy link
Member

bluss commented May 28, 2015

I was apparently not on the same page. I think it sounds good that we tackle exception safety head on, but it seems to be a different kind of direction than what #20807 concluded.

@nikomatsakis
Copy link
Contributor

I pretty much agree with what @alexcrichton said, and I think that was my takeaway from #20807 as well. That is, exception safety is "still a thing" -- but we do our best to minimize its impact on your life, particularly if you are not developing unsafe code. We opted to change the API for #20807 not because we believed it would render exception safety unnecessary, but because the current API was effectively adding an (ineffecient, but otherwise relatively usable) "catch" keyword to the language, which wasn't really the intention. (If we want catch, we should just add catch.) This is also the intention of poisoning mutexes (but letting you bypass it). It is not intended to absolutely prevent you from obtaining bad data, but it does make you think about it at least.

@bluss
Copy link
Member

bluss commented May 28, 2015

Ok. I'll just say, total days during which binaryheap was known to be broken: 1 single day*.

*Known defined as having bug reported

alexcrichton added a commit to alexcrichton/rust that referenced this issue Jun 22, 2015
This function has remained in std for quite some time now without modifications,
and it's a core building block for robust FFI, so this commit stabilizes the
signature as-is.

It is possible to relax the `Send` or `'static` bounds in the future
additionally.

Closes rust-lang#25662
@lhecker
Copy link

lhecker commented May 1, 2016

@alexcrichton We (@zonyitoo and me) noticed another issue with the current behaviour of std::panic::catch_unwind: It assumes that the smallest "scope" in which something can be processed is a thread. Which is why it uses a thread-local variable to count the current number of "active" unwinds, right?
But what if you have a smaller scope like... coroutines? A coroutine can migrate between threads (using e.g. a work stealing scheduler). The PANIC_COUNT however is thread-local, while for coroutines it would be "coroutine-local". For instance catch_unwind(func) might be called in one thread while func completes under another one. At this point the wrong PANIC_COUNT (from the original thread) will be set to a invalid value without any synchronization primitives.
Will this be fixed one day in Rust (since it does work in C++), or will this forever be broken, under the assumption that there are no userland-threads? Because that would be quite sad. 😢

@zonyitoo
Copy link

zonyitoo commented May 1, 2016

There is one easy solution for this: Don't allow panic in Drop::drop!

@lhecker
Copy link

lhecker commented May 1, 2016

Preventing panics in drop() is akin to how C++ works...

The only alternative I see right now is to introduce some additional API which allows us to "register a stack". We could pass a callback to that register method which could be invoked whenever the panic & unwind modules need the panic count. What is at the momen the thread-local PANIC_COUNT would serve as the default handler in std::thread::Builder::spawn().
Thus in coroutine implementations a custom handler could be passed. This handler could query the currently executing coroutine on the current thread and poll it's own panic count.

IMHO this would be the least intrusive and most flexible approach right now. Closing down Rust to only support threads and nothing more seems to me like an unnecessary limitation though.
Any opinions on that? I'd value especially yours, @alexcrichton.

Apart from that I think there are only 2 solutions: One of which would be to adapt the C++ model and the other would be to have "stack-local" instead of "thread-local" data.

@sfackler
Copy link
Member

sfackler commented May 1, 2016

@lhecker The concern is a userland context switch in the middle of unwinding? Things should work if you panic::catch_unwind in the initial thread and then panic::propagate in the new thread when a coroutine is migrating.

@zonyitoo
Copy link

zonyitoo commented May 2, 2016

@sfackler Nope. Not in the middle of unwinding.

As you can see in here, the f is the coroutine's callback, what if we transfer the coroutine to a new thread before it finished? Yes, the s still points to the original thread's TLS!!! And after the coroutine is finished, s.set(prev) will cause data race!

@alexcrichton
Copy link
Member

@lhecker, @zonyitoo

This is unfortunately one of the major drawbacks of coroutines in a systems language right now, which is that they don't support native TLS easily. It sounds like this interaction with PANIC_COUNT is indeed one location where things can go awry, but it's not the only one use of TLS in the standard library. For example this is also used for poisoning on mutexes. Additionally, many external libraries rely on TLS for various bits and pieces of an implementation.

I don't think this problem is necessarily contained to catch_unwind in particular. This would either be the responsibility of a coroutine library to swap out TLS tables as well as the stack, or the standard library would provide an API to save/restore all thread-local context whenever a switch happened.

Also note that TLS has bad interactions with optimizations in LLVM right now, specifically the usage of TLS across a function call that can change threads may cause segfaults. (some info about this is in 12c5fc5)

@zonyitoo
Copy link

zonyitoo commented May 2, 2016

@alexcrichton Hmm, unfortunately, switching TLS tables could not solve this problem, because after switching, the current executing context still holds pointers to values in the original TLS (for example, the s I mentioned before in here).

It is not easy to make coroutine compatible with thread_local!. But .. is it possible to override the behavior of thread_local!? Which is a solution for "swap out TLS tables as well as the stack".

@lhecker
Copy link

lhecker commented May 2, 2016

@alexcrichton The only thing with Rust where TLS is used in a way which does not work with coroutines is std::panic (even if we wanted to). Reason being that it holds a reference to a TLS store while calling a callback function, which could move from thread to thread. While this might be reasonable with threads it's a major error with coroutines.

For instance the poison flag you mentioned? That's not inherently a problem! The poison flag only calls thread::panicking(), which is in fact alright, because only one coroutine can run on a single thread at a time and if a panic happens while it's currently already panicking it's obviously an error.

But what happens if you catch that panic? The PANIC_COUNT of the current thread won't be reset to 0, but instead the PANIC_COUNT of the thread where catch_unwind was called first will be reset to 0.

AFAIK this is the only major problem with TLS & Coroutines in Rust right now. And as I said earlier this could either be solved by:

  • Disallowing catch_unwind() while a panic is currently active. That way the PANIC_COUNT will always be 0 before calling the callback. You could then simply set the PANIC_COUNT to 0 after intrinsic::try() catched an unwind.
  • Adding a API similiar to panic::set_hook() which allows providing a custom PANIC_COUNT

Edit: You're right though about it being dangerous... For instance a ThreadRng should not be used with coroutines. But it does not change the fact that it and all those external libraries have some alternatives at least while std::panic has none. 😢

@alexcrichton
Copy link
Member

Thinking more on this, the bug pointed out in 12c5fc5 essentially means that any use of TLS is subject to segfaults with coroutines currently. That's unfortunately not the only problem that needs to be solved (e.g. also this business with PANIC_COUNT), but no local solution with PANIC_COUNT will solve the entire problem, however.

The gist of that bug is that this code can segfault:

fn foo() -> bool {
    let a = thread::panicking();
    bar();
    a && thread::panicking()
}

As a side note, I just realized that this issue is basically now long since closed. We've discussed the whole std::panic story quite thoroughly now! I'm gonna close this, and we may want to continue discussing coroutines/TLS elsewhere (as this may not be the most appropriate place)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests