-
-
Notifications
You must be signed in to change notification settings - Fork 31.8k
bpo-42130: Fix swallowing of cancellation by wait_for #26097
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello, and thanks for your contribution! I'm a bot set up to make sure that the project can legally accept this contribution by verifying everyone involved has signed the PSF contributor agreement (CLA). CLA MissingOur records indicate the following people have not signed the CLA: For legal reasons we need all the people listed to sign the CLA before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue. If you have recently signed the CLA, please wait at least one business day You can check yourself to see if the CLA has been received. Thanks again for the contribution, we look forward to reviewing it! |
Does this actually solve the original issue where a connection/resource can be leaked, or are we just trading one variant of the bug for another? I agree this fixes some odd cancellation behaviour, but are we back to having a likelihood of leaked resources, if I remember correctly. Following from one of the comments in the second link: I'm still not sure why As a concept, why do we not have the waiter be a simple call_later handle and add a done callback to fut to cancel it? Likewise, the waiter func just checks if the future is done and cancels it if not. My impression is that this would avoid much of the complexity within # NOT TESTED: PoC pseudocode!
def _optionally_cancel_fut(fut, event):
if not fut.done():
event.set()
fut.cancel()
async def wait_for(fut, timeout):
loop = events.get_running_loop()
fut = ensure_future(fut, loop=loop)
timeout_occurred = asyncio.Event() # Likely not optimal
if timeout is not None and timeout > 0:
timeout_handle = loop.call_later(timeout, _optionally_cancel_fut, fut, timeout_occurred)
fut.add_done_callback(lambda fut: timeout_handle.cancel())
elif timeout is not None:
# Timeout must be negative or zero, cancel immediately
_optionally_cancel_fut(fut, timeout_occurred)
try:
return await fut
except exceptions.CancelledError as exc:
if timeout_occurred.is_set():
raise exceptions.TimeoutError() from exc
raise exc |
No, it doesn't solve bpo-37658. But reverted change doesn't solve it too, at least not completely. And we certainly have to work on leaked resources again. The main purpose of this PR is the test case. |
@@ -1147,22 +1147,6 @@ def gen(): | |||
res = loop.run_until_complete(task) | |||
self.assertEqual(res, "ok") | |||
|
|||
def test_wait_for_cancellation_race_condition(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this test now fail if it wasn't removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly. The test is incorrect (the task must raise) and it doesn't reproduce original problem.
It's worth a try at least, I'll take it tomorrow. |
👍 I have checked that code in my comment above and it does what is expected at a high level (cancels on timeout etc), but I haven't checked it against the test suite nor against any of the bugs that have led to this. If it works then it's a whole lot easier to reason about than the current impl, but I've probably forgotten something important... 🙄 |
Based on pseudocode suggested by @aaliddell python#26097 (comment)
Here is my first attempt. With this change the following tests fail:
Let's put aside arguable incompatible changes in behaviour and elaborate on main goal. In test case |
👍 Regarding the 'badly behaved' coroutines: I'd argue that it is behaving as best as can be expected, since
If we try to patch a poorly behaved coroutine by ensuring we raise the timeout regardless of if the cancellation succeeds, we're masking the fact that the task underneath has been poorly behaved and expose ourselves to the resource leak, but I can see this is is perhaps going to be a point of debate 😄. I guess it comes down to two options:
Re |
IMO, both ways are acceptable. But the important thing is that it's incompatible change that may break someone's code.
To me the behaviour it tests shouldn't be guaranteed. Such things should be done with patterns like RAII or context managers. But it's quite problematic with current API and the whole point of trying the new approach was to remove the race in this particular case. And this goal is not reached. |
This PR is stale because it has been open for 30 days with no activity. |
Re: the aiokafka library issue tagged above by @ods. I hope this will add some perspective to the issue. When we upgrade to Python 3.8.10, we're no longer able to reliably stop the consumer (close connection and release resources). We wind up needing to toss the stop call into a task and move on but this leaves us with unreleased resources. Our only solution has been to downgrade to Python 3.8.5 which isn't very sustainable in the long run. |
I think I've fixed this properly for both cases in #28149. |
I think each time async def wait_for(fut, timeout):
"""Wait for the single Future or coroutine to complete, with timeout.
Coroutine will be wrapped in Task.
Returns result of the Future or coroutine. When a timeout occurs,
it cancels the task and raises TimeoutError. To avoid the task
cancellation, wrap it in shield().
If the wait is cancelled, the task is also cancelled.
This function is a coroutine.
"""
loop = events.get_running_loop()
if timeout is None:
return await fut
timed_out = False
event = asyncio.Event()
fut = ensure_future(fut, loop=loop)
set_event = event.set
fut.add_done_callback(set_event)
if timeout <= 0:
timed_out = True
fut.cancel()
while not fut.done():
try:
await event.wait()
except asyncio.CancelledError as e:
timed_out = False
fut.cancel()
else:
def on_timeout():
nonlocal timed_out
timed_out = True
fut.cancel()
timeout_handle = loop.call_later(timeout, on_timeout)
try:
while not fut.done():
try:
await event.wait()
except asyncio.CancelledError as e:
timed_out = False
timeout_handle.cancel()
fut.cancel()
finally:
timeout_handle.cancel()
if timed_out:
try:
return fut.result()
except exceptions.CancelledError as exc:
raise exceptions.TimeoutError() from exc
return fut.result() |
I think you may have missed the issue. There are various timing issues, that result in situations where the future may already be complete when handling the cancellation. By only changing the code in So far, the only proposal that might work will require making some changes to how the event loop runs tasks. |
fut.remove_done_callback(cb) | ||
# We must ensure that the task is not running | ||
# after wait_for() returns. | ||
# See https://bugs.python.org/issue32751 | ||
await _cancel_and_wait(fut, loop=loop) | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the state wait_for
is in right now (outside this PR) is as bad as
in swallowing a KeyboardInterrupt
in a utility function. It makes it unusable
in my opinion. There is also no way for user code to counteract, whereas the
resource leakage could actually be worked around.
This leads me to possibility 1 how I think this issue could be solved:
Merge this PR, embrace the quirk and document it well.
Tell the user something along the lines of
Cancelling
wait_for
will leave the waited for future/coroutine in any
state between (including both) not started and finished. Make sure to
provide any necessary clean-up on cancellation.try: await asyncio.wait_for(foo, 123) except asyncio.CancelledError: ... # do clean-up raise
Or alternatively:
try: future = asyncio.ensure_future(foo) await asyncio.wait_for(future, 123) except asyncio.CancelledError: if fut.done() and not fut.cancelled() and fut.exception() is None: ... # the future completed, yet, we got cancelled raise
After all, if the wrapped future yields multiple times to the event loop, you don't get any guarantees as to how far your inner future got. So the user has to be prepared to clean-up the cancellation mess one or the other way.
Possibility 2 I see is to not dismiss, but to defer the cancellation:
fut.remove_done_callback(cb) | |
# We must ensure that the task is not running | |
# after wait_for() returns. | |
# See https://bugs.python.org/issue32751 | |
await _cancel_and_wait(fut, loop=loop) | |
raise | |
if fut.done(): | |
# We got cancelled, but we are already done. Therefore, we defer the | |
# cancellation until next time the task yields to the event loop. | |
current_task().cancel() | |
return fut.result() | |
else: | |
fut.remove_done_callback(cb) | |
# We must ensure that the task is not running | |
# after wait_for() returns. | |
# See https://bugs.python.org/issue32751 | |
await _cancel_and_wait(fut, loop=loop) | |
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To amend to that: this is the issue that I thinks needs a quick resolution (using pytest here, but it should get the point across):
@pytest.mark.asyncio
async def test_wait_for_cancellation_propagates():
inner = asyncio.get_event_loop().create_future()
outer = asyncio.wait_for(inner, 0.1)
outer_future = asyncio.ensure_future(outer)
await asyncio.sleep(0)
inner.set_result(None) # inner is done
outer_future.cancel() # AND outer got cancelled
# this fails starting with Python 3.8
with pytest.raises(asyncio.CancelledError):
await outer_future
I think #31847 will fix this issue |
That might be true for 3.11 but 3.8 - 3.10 still should be fixed. |
and in https://www.python.org/downloads/release/python-3913/
so this could only be backported as far as 3.10 |
@Dreamsorcerer I also had a go here bf594bd but found a bug in asyncio.timeout |
Superseded by #96764 |
Related:
https://bugs.python.org/issue42130