Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is this „nasty“ code around GIL sound? #1019

Closed
vorner opened this issue Jul 4, 2020 · 9 comments · Fixed by #1036
Closed

Is this „nasty“ code around GIL sound? #1019

vorner opened this issue Jul 4, 2020 · 9 comments · Fixed by #1036

Comments

@vorner
Copy link
Contributor

vorner commented Jul 4, 2020

Hello

When reading through the docs, I sometimes have the habit of clicking on the [src] button and looking under the hood. And I've come up with some code I'm not sure is not UB (besides me not using unsafe). It might be fine, but I think it makes sense to share it and ask why it is fine.

use pyo3::prelude::*;
use pyo3::wrap_pyfunction;

#[pyfunction]
fn messed_gil_cnt(py: Python<'_>) {
    eprintln!("Start");
    let guard = py.allow_threads(|| {
        eprintln!("Released");
        let guard = GILGuard::acquire();
        eprintln!("Re-acquired");
        guard
    });
    eprintln!("Returned");
    drop(guard);
}

#[pyfunction]
fn cross_release() {
    let g1 = GILGuard::acquire();
    let g2 = GILGuard::acquire();

    drop(g1);
    drop(g2);
}

#[pymodule]
fn broken(_py: Python<'_>, m: &PyModule) -> PyResult<()> {
    m.add_wrapped(wrap_pyfunction!(messed_gil_cnt))?;
    m.add_wrapped(wrap_pyfunction!(cross_release))?;

    Ok(())
}

The thing I worry about the cross_release is, the documentation for PyGILState_Ensure seems to hint (not very explicitly) that the corresponding PyGILState_Release should be called in reverse order:

After this call, Python’s state will be the same as it was prior to the corresponding PyGILState_Ensure() call (but generally this state will be unknown to the caller, hence the use of the GILState API).

If taken literally, the interpreter should be left in the state after acquiring g1, because it's the state it should return to after releasing g2. I don't know how to „prove“ anything wrong is happening, though, I haven't seen anything weird manifest.

The messed_gil_cnt is similar, but I've noticed one further problem. When I'm looking at the source code here: https://docs.rs/pyo3/0.11.1/src/pyo3/python.rs.html#129-150, I think this is happening:

  • Starts with GIL_COUNT set to some unknown n.
  • GIL_COUNT is set to 0.
  • A new GILGuard is created, increasing GIL_COUNT to 1. Furthermore, as it is 0, it creates the GILPool (because the gil_is_acquired works based on the GIL_COUNT, https://docs.rs/pyo3/0.11.1/src/pyo3/gil.rs.html#167), but at least the comment above says there should not be two pools on the same thread.
  • When leaving the scope, the GIL_COUNT is reset back to n, losing the one from the created GILGuard
  • But the `GILGuard is „exported“ out through the return value, so that number is wrong.

I don't know if it is related, but the code deadlocks, it never prints Returned.

@kngwyu
Copy link
Member

kngwyu commented Jul 4, 2020

Interesting, thank you for reporting.

The thing I worry about the cross_release is, the documentation for PyGILState_Ensure seems to hint (not very explicitly) that the corresponding PyGILState_Release should be called in reverse order:

I understand the problem but now I have no idea to fix this 🤔

The messed_gil_cnt is similar, but I've noticed one further problem

I think we can handle the situation better by using PyThreadState.gilstate_counter instead of our own GIL_COUNT, though maybe I need more investigation.

@davidhewitt
Copy link
Member

I think we can handle the situation better by using PyThreadState.gilstate_counter instead of our own GIL_COUNT, though maybe I need more investigation.

If we're expecting the upcoming Python limited API to make structs opaque, it may not be a good idea to rely on the internals of PyThreadState to solve our problem. I think I see a simple fix for this that I'll push in a moment.

The thing I worry about the cross_release is, the documentation for PyGILState_Ensure seems to hint (not very explicitly) that the corresponding PyGILState_Release should be called in reverse order:

I think I can put together a test case - will report back.

@davidhewitt
Copy link
Member

davidhewitt commented Jul 4, 2020

I think I see a simple fix for this that I'll push in a moment.

Ok, after my simple fix failed I think that we cannot safely allow return of a GILGuard from allow_threads. For a couple of reasons:

  • PyEval_RestoreThread (which we call in allow_threads after running the closure) will deadlock if the GIL is already acquired. This is actually the deadlock we see in messed_gil_cnt.
  • The GILGuard evaluated inside allow_threads will contain a GILPool. If we allow it to exist outside of allow_threads then we risk opening up exactly the kind of two GILPool / dangling object problems we disabled with Close soundness hole with acquire_gil #893

I see three possible solutions. Which do you guys think is best?

  1. We leave as-is, but add documentation saying that returning the GIL from the allow_threads closure will deadlock.
  2. We change to a runtime panic! if GIL_COUNT is not zero after the allow_threads closure.
  3. We change to a compile-time error by restricting T: Send for the allow_threads return value.

As (3) is overly restrictive in my opinion, I think (2) is probably best.

I've got to dash now, so I will put together the test for cross_release later.

@vorner
Copy link
Contributor Author

vorner commented Jul 4, 2020

I don't know if the PyEval_RestoreThread is guaranteed to deadlock. If it ever ceases to deadlock in eg. python 3.10, the 1. would lead to UB. So from these, I think 2. would be the safer option.

I was also thinking if changing the GILGuard API to be more like crossbeam's scoped threads would work, something like:

GILGuard::with(|guard: &GILGuard| {
    ...
})

(And one would newer be able to get hold of the GILGuard ownership directly in safe code)

That would make sure one can't do ugly things like pass it around through other GIL guards, reorder their drops, store it for later in a structure or mem::forget them (which maybe could also do weird things?). I don't know if that would prohibit some legitimate usecase and a clear downside is this is a change that a lot of code would have to adapt to.

@kngwyu
Copy link
Member

kngwyu commented Jul 4, 2020

I was also thinking if changing the GILGuard API to be more like crossbeam's scoped threads would work, something like:

That's really appealing to me, though it can require lots of code changes in user codes... 🤔

@vorner
Copy link
Contributor Author

vorner commented Jul 4, 2020

That's really appealing to me, though it can require lots of code changes in user codes...

Would some kind of deprecation notice/transition period help smooth that out? Eg:

impl GILGuard {
    #[doc(hidden)]
    #[deprecated(message = "Use GILGuard::with instead, this one is broken and will be removed in next major version")]
    pub fn acquire() -> Self {
        ...
    }
    pub fn with<T, F: FnOnce(&GILGuard) -> T>(f: F) -> T {
       ...
    }
}

@davidhewitt
Copy link
Member

I really love this proposal! Though I think we might be able to go further and make GILGuard an implementation detail. The API can just be:

Python::with(|py: Python| {
    ...
})

Deprecation notice on Python::acquire_gil() / GILGuard::acquire() would make sense. 0.12 we could deprecate, 0.13 we could remove (if no users create significant objection).

@konstin
Copy link
Member

konstin commented Jul 4, 2020

Would it be possible to solve the allow_threads by checking GIL_COUNT or PyGILState_GetThisThreadState before returning from allow_threads, and solve cross_release by letting GILGuard have a usize field that is initialized with the GIL_COUNT on creation and check on drop? I'd still like to have something like with_gil(|py: Python| { ... }), but it would be nice if GILGuard::acquire could be fixed with some small changes.

@davidhewitt
Copy link
Member

I wrote this test for cross_release, and can confirm that with Python 3.8 I can observe that dropping g1 releases the GIL. The assertion fails, and worse, dropping g2 causes Python to abort the process.

    #[test]
    fn test_gilguard_drop_order() {
        let g1 = Python::acquire_gil();
        let g2 = Python::acquire_gil();

        let safe = Arc::new(Mutex::new(false));

        std::thread::spawn({
            let safe = safe.clone();
            move || {
                let _ = Python::acquire_gil();
                assert!(*safe.lock().unwrap());
            }
        });

        drop(g1);
        std::thread::sleep(std::time::Duration::from_secs(2));

        *safe.lock().unwrap() = true;
        drop(g2);
    }

it would be nice if GILGuard::acquire could be fixed with some small changes.

Yeah you're probably right, we could avoid breaking user code. All documentation could change to use Python::with or whatever we call this new API.

We can fix allow_threads by adding the panic as discussed.

For cross_release, I see two choices:

  1. We store the gil states in a thread-local stack, and whichever order the guards are dropped, we always pop the last value from the stack (runtime cost).
  2. We store the GIL_COUNT and panic on drop if it's out of order.

I'd be tempted to go for (2) here. It will cause occasional surprises, but I'm not 100% confident (1) will not have other edge cases and it probably is higher overhead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants