Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix exception swallow in fixture due to parent cancel scope #77

Conversation

touilleMan
Copy link
Member

Following #75, I've finally isolated the original issue:

async def die_soon(*, task_status=trio.TASK_STATUS_IGNORED):
    task_status.started()
    raise RuntimeError('Ooops !')

@asynccontextmanager
async def async_finalizer():
    try:
        yield
    finally:
        await trio.sleep(0)

@pytest.fixture
async def fixture(nursery):
    async with trio.open_nursery() as nursery1:
        async with async_finalizer():
            async with trio.open_nursery() as nursery2:
                await nursery2.start(die_soon)
                yield
                # Comment next line to make the test pass :'(
                nursery1.cancel_scope.cancel()

@pytest.mark.trio
async def test_try(fixture):
    await trio.sleep_forever()
  1. die_soon causes the fixture to crash
  2. Because how pytest-trio works, the information that die_soon has crashed doesn't make yield to return with an exception
  3. Hence nursery1 is cancelled
  4. any async teardown operation in async_finalizer end up with a Cancelled exception
  5. this Cancelled exception reach the nursery1's which conclude no other exceptions has occurred

The sad part is if we convert the fixture into a regular async context manager, everything works fine.

I'm not really sure how we could fix that (well I'm not even sure about the steps I've listed 😄 ), @njsmith any idea ?

@touilleMan touilleMan force-pushed the yield-fixture-crash-swallowed-with-parent-cancel branch from beff403 to 4f24fb2 Compare January 12, 2019 12:21
@touilleMan touilleMan changed the title Add test_async_yield_fixture_crashed_then_parent_cancel Showcase exception swallow in fixture due to parent cancel scope Jan 12, 2019
This test showcase a bug where an exception occuring in a yield fixture
is silently swallowed when a parent cancel_scope get cancelled during
teardown.
@touilleMan touilleMan force-pushed the yield-fixture-crash-swallowed-with-parent-cancel branch from d9b4eb5 to ab985c7 Compare May 30, 2019 12:39
@codecov
Copy link

codecov bot commented May 30, 2019

Codecov Report

Merging #77 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master      #77      +/-   ##
==========================================
+ Coverage   99.78%   99.79%   +<.01%     
==========================================
  Files          19       19              
  Lines         476      483       +7     
  Branches       41       42       +1     
==========================================
+ Hits          475      482       +7     
  Misses          1        1
Impacted Files Coverage Δ
pytest_trio/_tests/test_async_yield_fixture.py 100% <100%> (ø) ⬆️
pytest_trio/plugin.py 99.48% <100%> (ø) ⬆️

@touilleMan
Copy link
Member Author

@njsmith Would you have an eye on this fix ?
The bug is really nasty and I fear this only cover part of the trouble (or may lead to incorrect exception reporting in some cases...)

@touilleMan touilleMan changed the title Showcase exception swallow in fixture due to parent cancel scope Fix exception swallow in fixture due to parent cancel scope May 30, 2019
@oremanj
Copy link
Member

oremanj commented May 30, 2019

I think this is the same issue as python-trio/trio#455. An exception is propagating in a cancelled context; some aexit or finally executes a checkpoint; you wind up with a Cancelled propagating with the original exception as __context__; and when Trio swallows the Cancelled it can't tell whether the exception in __context__ was in the middle of being handled or what.

I'm not super comfortable with the proposed fix -- it feels special-cased against the particular example you've used to demonstrate the problem, rather than fixing the underlying problem.

The sad part is if we convert the fixture into a regular async context manager, everything works fine.

I have some doubts about that. Note that the @pytest.fixture protocol is different from the @asynccontextmanager protocol, because the former doesn't throw in exceptions. That means that for an apples-to-apples comparison, the @asynccontextmanager version would have to say

try:
    yield
finally:
    nursery1.cancel_scope.cancel()

at which point I believe the same error will recur, demonstrating that this isn't really a pytest-trio problem.

@touilleMan
Copy link
Member Author

Thanks for the feedback @oremanj

I think this is the same issue as python-trio/trio#455.

I didn't know about this issue, but it seems pretty close indeed. My guess is the weird yield in pytest-trio make the issue more likely to occurs there and that's why I only encountered this kind of issue in pytest-trio so far.

I'm not super comfortable with the proposed fix -- it feels special-cased against the particular example you've used to demonstrate the problem, rather than fixing the underlying problem.

I totally agree, the point is currently we have test that swallow errors with is the worst kind of behavior (it's a pain to detect the bug, then an even bigger paint to find out the issue)

So I would say a hacky fix is still better than the current situation...

@njsmith
Copy link
Member

njsmith commented Jun 7, 2019

This is a tough issue. @touilleMan can you share the real example where you ran into this?

The problem with the minimal example is that it's artificial, so it isn't obvious what it should do. Arguably it's already working correctly – you explicitly requested that the code inside the nursery1 block be cancelled, that code immediately stopped what it was doing (in this case, handling an exception), that's what you asked for, what's the problem? And if you change the example to something like this, then the proposed fix seems very weird (it means we exit with RuntimeError even though we caught it):

async def myfixture():
    try:
        async with open_nursery() as nursery:
            await nursery.start(die_soon)
            yield
    except RuntimeError:
        pass

So we probably need to do something, but the minimal example doesn't make it obvious what change to make.

@touilleMan
Copy link
Member Author

@njsmith We encountered this issue multiple time with the unittests of our project Parsec
Put it shortly Parsec is a client/server application for sharing file with a client-side encryption. This means a typical usecase in our tests is:

  • init a server with the running_backend fixture
  • init a client with the alice_core fixture
  • in the actual test make the client and the server communicate

One important thing about the client (called core in our codebase) is it spawn multiple coroutines to run background tasks called monitors (typically listening to the backend in case another client modify a file we are working on, or determine when we should upload to the server the files that have been modified ni local).
To do that the core is created by a context manager which end with a scope cancellation to stop the monitors coroutines.

Now going back to our test, if a monitor is buggy and blows up it coroutine, the alice_core fixture doesn't get the exception an do it regular teardown, cancelling the core's monitors nursery.
This make a trivial bug pretty complex to spot because we have to comment out the cancellation in the fixture (and there can be multiple suspect cancellations in differents fixtures !) in order to reveal the exception.

I've created a branch in the project that demonstrate this nicely

$ pip install -e .[all]
[...]
$ py.test tests/test_showcase_pytest_trio_issue_75.py -s
[...]
tests/test_showcase_pytest_trio_issue_75.py Test started...
Crash in messages_monitor :(
.
[...]
======= 1 passed in 0.25 seconds =======
$ py.test tests/test_showcase_pytest_trio_issue_75.py -s --detect-pytest-trio-issue-75
tests/test_showcase_pytest_trio_issue_75.py Test started...
Crash in messages_monitor :(
F
[...]
======= 1 failed in 0.33 seconds =======

The more I think about this issue, the more I wonder about this "fixture never raise exception" policy. This is useful to make the trio fixture working like regular pytest fixture, but I don't see a good reason for pytest to do this in the first place (I guess the reason is teardown should happen no matter the outcome of the test, however this what try/finally are for and Python is all about consenting adults...).
So wouldn't it be better to just have trio fixture raise the exception ?

@njsmith
Copy link
Member

njsmith commented Jun 7, 2019

Hmm, OK, I think I get part of it though. What's anomalous about myfixture is that it does crash the test – the yield never raises Cancelled, but it is is a cancel point in the critical sense that if it gets cancelled, then the whole test immediately gets cancelled. So even if we catch the exception in the fixture, it's already had side-effects outside the fixture. I think I finally get what you and @vxgmichel were saying in #75.

Ick.

So first, yeah, if crash has been called, then the test definitely needs to fail one way or another. That seems clear.

The simplest solution would be that if a fixture calls crash(None) but never follows up with an actual exception, then we raise a new exception like RuntimeError(f"fixture {fixture_name} crashed"). (This is sort of what @vxgmichel is doing here.) That would get us to a point where at least your tests reliably fail. And of course we'd have some docs to explain that this error happens when a fixture yield gets cancelled, usually because of a background task crashing, but then the fixture doesn't raise an exception, usually because it somehow caught the background task's exception.

If we want to do better, it would be by giving more information to help debug the problem. For something like the myfixture example above, I guess it will be pretty rare to get into this situation and won't be hard for users to figure out what they did wrong, so maybe that's good enough...?

There are also cases where you probably can't do any better. For example:

async def silly_fixture():
    async def cancel_me(cancel_scope):
        cancel_scope.cancel()

    async with trio.open_nursery() as nursery:
        nursery.start_soon(cancel_me, nursery.cancel_scope)
        yield

In this case, the fixture crashes the test, but there is no exception to report.

I guess this is an example of a case where it would be useful if Cancelled exceptions carried some hint about what triggered them – @smurfix keeps asking for them to capture the stack where cancel was called, which is expensive but might even be OK in a test setting. Or even just someone called cancel() could be a useful hint here, versus there was an unhandled exception propagating.

But myfixture and silly_fixture are both super super artificial situations. In the actual situations you're hitting it's a lot more confusing, so probably some kind of additional hints would be helpful, but to figure out what those are, we need a better understanding of how exactly people get into this situation.

For fixtures using the nursery fixture, I guess this isn't an issue? I think it reliably passes any background exceptions to crash. And AFAICT the only case where it could be cancelled from outside is if one of our TrioFixture.run raised an exception, in which case you'd see that exception. So we're looking at fixtures that open their own nurseries (perhaps hidden inside some async context manager that came from a library).

In any case where this problem happens, the first step is for the fixture yield to be cancelled. The only ways this can happen are if some code inside the fixture explicitly calls .cancel() (like in silly_fixture), or else there's an unhandled exception that hits a nursery. If there's an unhandled exception, either it gets caught explicitly (like in myfixture), or else it gets accidentally swallowed like in @touilleMan's die_soon fixture, which requires an exception handler be cancelled. So, this reduces to the previous case: for an exception to be accidentally swallowed, you either need an explicit call to .cancel(), or else you need a second unhandled exception. The case where one unhandled exception wipes out another one is important, but it does still result in the test reporting an exception, and in the multierror v2 design we have an idea for how to at least report the lost exception as a __context__ on the surviving exception. So... conclusion: in order to get into the really confusing case, I think you need to either explicitly call .cancel() on a scope that's still running, or else explicitly catch an exception like in the myfixture example.

And, it's not just any cancel() that can cause problems: if the cancel() is the very last thing inside the cancel scope/nursery, then it's safe. (So for example, the nursery fixture is fine, and I think all the nurseries used implicitly inside trio-websocket are fine.)

So I guess the general advice is:

  • Either you're explicitly catching and throwing away some exception; don't do that
  • Or you're cancelling a nursery while it's still running; maybe you should move the cancel to the end of the nursery instead.

Does that sound right?

@njsmith
Copy link
Member

njsmith commented Jun 7, 2019

@njsmith

Or you're cancelling a nursery while it's still running; maybe you should move the cancel to the end of the nursery instead.

@touilleMan

To do that the core is created by a context manager which end with a scope cancellation to stop the monitors coroutines.

Well, score one for logic I guess :-). This context manager indeed does an explicit cancel, but not at the end of the nursery:

https://github.com/Scille/parsec-cloud/blob/07ebd60ec4de94f6073d5640ea35974a50524394/parsec/core/logged_core.py#L102

So I predict that if you remove 4 levels of indentation from that last line, then it will fix your problem. (I guess you might also need to add a monitor.nursery.cancel_scope.cancel() at the appropriate level too?)

@touilleMan
Copy link
Member Author

@njsmith didn't read your answer yet, but seeing such a big answer 3mn after my post feels like you are batman 🦇 😄

@touilleMan
Copy link
Member Author

So I predict that if you remove 4 levels of indentation from that last line, then it will fix your problem. (I guess you might also need to add a monitor.nursery.cancel_scope.cancel() at the appropriate level too?)

I've just tested, indeed your prediction was true ;-)

On top of that your explanation really helped me understanding the logic of this swalled-by-cancelled exception.
Now it seems totally logical to me to put the cancel() at the very end of the nursery context manager. However this is still incredibly misleading (one indentation level seems so harmless...), but I can't see how to make this less likely to shoot yourself in the foot.

@njsmith
Copy link
Member

njsmith commented Jun 7, 2019

@touilleMan I wish! But it took me like an hour to write, I just started sooner :-)

@njsmith
Copy link
Member

njsmith commented Jun 8, 2019

OK, so we definitely understand the problem better than we did before, that's good. And it seems like we can at least say it would be better if pytest-trio detected when a fixture called crash(None) and then never followed up with a crash(actual_exception). Maybe make it crash(unique_fixture_id, <exc or None>), so it can track this? That would at least make the problem clearer when it happens, and for the blatantly silly cases (like silly_fixture) I don't think there's anything better we can hope to do.

But you're also right that this is a super subtle footgun.

I've been trying to figure out if this can happen when your code is run outside pytest-trio. I think the answer is that there is one rare situation where if you get unlucky enough, you can lose background exceptions. Consider what happens if a background task crashes just as the body of the async context manager finishes and your yield returns. Something like:

async with logged_core_factory(...):
    await checkpoint()
  1. The checkpoint suspends the task, checks for cancellation, then schedules it to run again later.
  2. Then the scheduler switches to your monitor_backend_connection task, which exits with an exception. (I assume that something like that is happening in your actual test cases, just not at that exact moment?) This marks the monitor_nursery as cancelled, and we make a note to inject a Cancelled exception at the next chance we get.
  3. But before that, we run the main task again, and it returns from the checkpoint (which already checked for cancellation, it's done with that), then exits the async with logged_core_factory block
  4. So the main task goes back into the body of logged_core_factory and executes root_nursery.cancel_scope.cancel().

And now your monitor_backend_connection exception is going to be lost.

Now... in your application, I don't know if this is a problem or not. Once the body of the async with logged_core_factory exits, I guess you want to cancel those background tasks like monitor_backend_connection, and if that cancellation kills them while they were in the process of raising an exception, or just about to raise an exception, then maybe that's fine, you don't care about that anymore because you're exiting.

What makes the pytest-trio case so weird is that the background task exception causes the body of the async with logged_core_factory to exit early by cancelling the inner nursery, but pytest-trio suppresses the exception, so the outer nursery still gets cancelled, which hides the original exception.

.....I really don't know what to do about this! You're totally right about this part:

The more I think about this issue, the more I wonder about this "fixture never raise exception" policy. This is useful to make the trio fixture working like regular pytest fixture, but I don't see a good reason for pytest to do this in the first place (I guess the reason is teardown should happen no matter the outcome of the test, however this what try/finally are for and Python is all about consenting adults...).
So wouldn't it be better to just have trio fixture raise the exception ?

If pytest fixtures had followed the usual @contextmanager convention, then this wouldn't be an issue. But... I don't think there's anything we can do about it now. We can't just unilaterally declare that whenever pytest-trio is loaded, all pytest fixtures have different semantics, because people have existing fixtures that put teardown logic outside of finally blocks, and they'll break if the yield can start raising. We could declare that trio fixtures can raise Cancelled, while regular fixtures can't. But then people have to remember that some fixtures have one semantics, and some have the other semantics, and that seems confusing too. That's why I kept the pytest fixture semantics in the first place. But I hadn't realized then how easy it would be for people to write fixtures like

@fixture
async def my_fixture():
    async with my_existing_context_manager:
        yield

and now we're basically taking the weird pytest fixture semantics and forcing them onto my_existing_context_manager, which its authors might not be expecting at all.

Maybe it wouldn't be so bad if literally the only exception that can be raised from a fixture yield is Cancelled, and only when the fixture has a nursery and a background task and the background task crashed?

Can you look at your fixtures and see if there are any that rely on running code after yield, even if a background task crashes?

Other than that... I think the only option is to document it. You do seem to need a pretty specific combination of factors to trigger this, so I guess if the error message pointed people to a detailed writeup of what to look for, then most people would be able to find the cause pretty quickly?

@njsmith
Copy link
Member

njsmith commented Jun 23, 2019

I think this can be closed now that #83 is merged... please re-open if I'm wrong

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants