-
-
Notifications
You must be signed in to change notification settings - Fork 31.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AsyncIO's wait_for can hide cancellation in a rare race condition #86296
Comments
Hi, during migration to Python 3.8.6 we encountered a behavior change from previous versions: wait_for ignored the request to cancellation and returned instead. After investigation, it seems to be related to the update in bpo-32751 and is only reproduced if the waited task is finished when cancellation of wait_for happens (code mistakes external CancelledError for a timeout). The following example can reproduce the behavior on both 3.8.6 and 3.9.0 for me:
Changing the wait time before cancellation slightly will return the correct behavior and CancelledError will be raised. |
Hi Chris, |
I've also found this deficiency of asyncio.wait_for by debugging an obscure hang in an application. Back then I've quickly made an issue about it: https://bugs.python.org/issue43389 I've just closed it as duplicate, since this issue covers the same bug and has been around longer. I'm surprised this issue has not got more attention. This definitely needs a fix. Ignoring a CancellationError is something that is heavily discouraged by the documentation in general. Silently ignoring/eating a cancellation makes this construct unreliable without good workarounds, aside from not using it. For a specific application, this was so broken, that I had to come up with a fix in the short term. I made an asyncio.wait_for variant available as a library that fixes the problem: https://github.com/Traktormaster/wait-for2 The repository has a detailed description of the issue and the way it fixes it. It also has test cases to assert the behaviour of the builtin and the fixed wait_for construct from the library. |
Echoing nmatravolgyi's comments - I got here after hitting this bug and I too am amazed it's been around so long and hasn't been addressed. It makes cancellation in my application very unreliable, and the reason I need well-controlled cancellation semantics is to allow emergency stop of physical movement. |
It has been addressed, PR should be merged this week: #28149 Like most open source projects, it just needed someone to actually propose a fix. |
can this be fixed by implementing wait_for with from . import timeouts
async def wait_for(aw, timeout):
async with timeouts.timeout(timeout):
return await aw |
I have just recently found out that one of my libraries had always been affected by this bug. Here's a script that reliably reproduces the bug on Python 3.9 - 3.11, without relying on timing coincidence:
Actual output:
Using |
Wow, this has certainly been reported enough times -- a fix was even reported but apparently it's still not fixed. My head dazzles trying to understand the code of |
One of the new tests may fail intermittently due to python#86296.
One of the new tests may fail intermittently due to python#86296.
@twisteroidambassador If you want a fix, can you first try @graingert's suggestion from #86296 (comment)? If that works, it'd be great! |
Yes, as far as I can see replacing |
On Python < 3.11 I think using |
I think this is the crux of the issue, and an important decision to make for the behavior of When
The current code's behavior is 1: See Lines 470 - 472 (EDIT: and I believe it's intentional, since nothing could be cancelling Lines 466 to 479 in ed827d5
while my PR #98431 does 2. |
A much smaller-scale fix can also achieve behavior 2 above: #98432 |
I've been lying awake thinking this through. In regards to your question about what should happen when
My reason is derived from what @grangert convinced me was the correct outcome in some other cancellation edge case (I think it was for either But wait, there's more! In 3.11 there's another wrinkle, The cancel-uncancel dance is intended to make it possibly to reliably distinguish between an external cancellation and a cancellation you initiated, even if both occurred simultaneously on the same object. You have to set a flag that remembers the result of your I have a feeling that much of the complexity in With all that, I still haven't fully grokked either of your proposed solutions, for which I apologize (all the above was just mental preparation). I also notice that you haven't added a new test yet that demonstrates the problem. Finally, when I try the naive version of @graingert's ideas use timeout(), I get failing tests in test_waitfor.py, so either the tests are over-constraining a certain undesirable behavior, or there's a bug in |
I like this proposal. It avoids the most unexpected behavior of (1), in which If we go with this, then I think we have a complete specification for There are 3 things that can happen once Existing comments in
The proposed behavior is:
Does this look reasonable?
I wonder whether the following sequence of events is possible:
If it's possible, does it matter? |
Your proposal looks reasonable. There is some devil is in the details. How can we tell we're being cancelled externally? I suppose the current wait_for() code can tell because it has an outer future and an inner task, and we consider only cancellation of the outer future to be external. I sort of wish we didn't have to have both a future and a task, but in the end timeout() uses that too, so I should just be okay with that. Re: the sequence of cance/uncancel calls, it's certain that we may not be able to tell whether A or B cancelled first, but we do know in which order they are handled, and the convention is that whoever handles last should get to decide. Everyone who handles prior to that will see uncancel() return > 0, and by convention has to propagate the outer cancellation. There are still some odd corner cases though, e.g. when both A and B cancel, then the coroutine catches the cancellation and returns None. The first handler (typically the last one to cancel, so B) ought to still call uncancel(), will find it > 0, and then what? Raise a fresh CancelledError? Or do whatever it normally does when the coroutine returns None? FWIW on how to proceed, let's fix it in a way that's compatible with 3.10 first (so no uncancel() calls), and then in a later PR get the uncancel() calls in, possibly by using timeout(), e.g. as in Kumar's PR (#98518). PS. A final wrinkle with |
While writing tests for the proposed behavior, I realized that it will make This also means that to pass these tests, the "easy path" of Does this still sound like a good idea? |
Hmm, that does sound like a problem. I've been see-sawing on this quite a bit. At this point I believe Kumar's PR is the gold standard for behavior, and it follows what (Note that I'm currently severely backed up and it may take a few days to dig myself out, until then I'm not sure you should believe anything I say.) |
I'll write down a new set of proposed behavior at the bottom. These will always raise CancelledError when cancelled externally, discarding any result / exception of the inner awaitable if any. This makes I have also changed my latest PR to match. If we end up preferring the previous behavior, we can always revert the one commit in the PR. New proposed behavior (changes are italicized):
|
I believe using
Gives the desired output:
This suggests the issue is to do with the exception being swallowed somewhere deep in the call stack. |
This test fails, but not very reliably. The cause of failure is that `asyncio.wait_for` sometimes swallows cancellation: python/cpython#86296
And fixed in 3.12 by gh-96764. (We're not 100.00% sure that the fix doesn't disturb some workaround, so we're not backporting the fix to 3.11 even though that has timeout(), and we would have to devise a totally different fix for 3.10, which we're not inclined to do, sorry. But going forward it should be fixed. |
wasted some time because asyncio.wait_for() was suppressing cancellations. [0][1][2] deja vu... [3] Looks like this is finally getting fixed in cpython 3.12 [4] So far away... In attempt to avoid encountering this again, let's try using asyncio.timeout in 3.11, which is how upstream reimplemented wait_for in 3.12 [4], and aiorpcx.timeout_after in 3.8-3.10. [0] python/cpython#86296 [1] https://bugs.python.org/issue42130 [2] https://bugs.python.org/issue45098 [3] kyuupichan/aiorpcX#44 [4] python/cpython#98518
wasted some time because asyncio.wait_for() was suppressing cancellations. [0][1][2] deja vu... [3] Looks like this is finally getting fixed in cpython 3.12 [4] So far away... In attempt to avoid encountering this again, let's try using asyncio.timeout in 3.11, which is how upstream reimplemented wait_for in 3.12 [4], and aiorpcx.timeout_after in 3.8-3.10. [0] python/cpython#86296 [1] https://bugs.python.org/issue42130 [2] https://bugs.python.org/issue45098 [3] kyuupichan/aiorpcX#44 [4] python/cpython#98518
Before 0ab7bfe, this test fails, but not very reliably. The cause of failure is that `asyncio.wait_for` sometimes swallows cancellation: python/cpython#86296 Fixed by switching to `asyncio.timeout`.
for feedback I just wasted several partial days tracking a production only issue that only happens once a day almost certainly due to this (add logging, narrow down, add more logging..etc...etc). I'd reconsider the decision as it can waste many development hours trying to find what the issue is when the code at hand seems reasonable, until you come to the conclusion that the library must be swallowing the exception. Either that or at a minimum update the docs to say don't use it in python<3.12 or there be dragons. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
Linked PRs
The text was updated successfully, but these errors were encountered: