-
-
Notifications
You must be signed in to change notification settings - Fork 727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Close comm on CancelledError #5656
Conversation
distributed/core.py
Outdated
@@ -519,7 +519,7 @@ async def handle_comm(self, comm): | |||
result = asyncio.ensure_future(result) | |||
self._ongoing_coroutines.add(result) | |||
result = await result | |||
except (CommClosedError, CancelledError): | |||
except (CommClosedError, CancelledError, asyncio.CancelledError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this. What determines which exception is raised (CancelledError is concurrent.futures.CancelledError, which is tornado's)? The event loop being used, or the choice of @gen_coroutine
vs. async def
?
Anyway I didn't find any failure here in real life - this is just defensive programming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd actually never expect a concurrent.futures.CancelledError
here, but would expect a asyncio.CancelledError
to be possible. If a task is ever cancelled while waiting on handle_comm
, a asyncio.CancelledError
should be raised. If we're seeing both types of cancelled errors here, we should probably figure out how to convert to only one (I'd prefere the asyncio version, which should be what's hit most places in the code base).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you like me to change it to
except (CommClosedError, asyncio.CancelledError):
...
except CancelledError as e: # pragma: nocover
raise AssertionError("Unexpected exception") from e
and see if the path is ever hit? Or should I just clean it away?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be good to remove the catch for the concurrent futures version, if for no other reason than readability. I'd probably delete it and see if any tests fail, but up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
distributed/core.py
Outdated
force_close = True | ||
raise | ||
except (CancelledError, asyncio.CancelledError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, I don't have evidence that concurrent.futures.CancelledError
is ever raised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, after reading the tornado code, there's no part of our async code that should ever hit concurrent.futures.CancelledError
. Maybe this would happen for older versions of tornado, but not any version we're currently using.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed concurrent.futures.CancelledError
comm.abort() | ||
elif please_close: | ||
await comm.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of OSError, this should halve the latency of scheduler.broadcast
(which is the only method that uses close=True).
68c57c4
to
fe5d749
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if our application of reuse is safe
distributed/distributed/core.py
Lines 885 to 889 in 7ebeda4
try: | |
result = await send_recv(comm=comm, op=key, **kwargs) | |
finally: | |
self.pool.reuse(self.addr, comm) | |
comm.name = name |
- Everything works great -> reuse -> no need for finally
- Comm broken -> reuse will not reuse but remove the comm from the pool
- CancelledError -> reusing the comm resulting in a broken stream
there are one or two more places where we're using the ConnectionPool.reuse
@@ -531,6 +532,27 @@ async def test_send_recv_args(): | |||
server.stop() | |||
|
|||
|
|||
@gen_test(timeout=5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this safe for CI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it takes 35ms on my machine, so I think so. I set it very low to detect a regression where everything gets stuck for 10-20 seconds or whatever the default timeouts are.
I think it's safe. See the implementation of reuse: distributed/distributed/core.py Lines 1086 to 1091 in 7ebeda4
In master, on CancelledError it's hitting lines 1089:1091. With this PR it hits line 1087 instead. I added a comment in reuse() to clarify what's happening |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this seems good to me. Nice catch.
All test failures seem to be unrelated |
Fix crash in tornado that would happen when a task waiting on read() got cancelled (typically by a timeout). The comm would go back into the pool of available connections, without notifying the server and later causing two concurrent reads on client side.