Close comm on CancelledError #5656

crusaderky · 2022-01-13T13:06:34Z

Fix crash in tornado that would happen when a task waiting on read() got cancelled (typically by a timeout). The comm would go back into the pool of available connections, without notifying the server and later causing two concurrent reads on client side.

tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x16ac14400>>, <Task finished name='Task-25633' coro=<Client._update_scheduler_info() done, defined at /Users/fjetter/workspace/distributed/distributed/client.py:1111> exception=AssertionError('Already reading')>)
Traceback (most recent call last):
  File "/Users/fjetter/mambaforge/envs/dask-distributed/lib/python3.8/site-packages/tornado/ioloop.py", line 741, in _run_callback
    ret = callback()
  File "/Users/fjetter/mambaforge/envs/dask-distributed/lib/python3.8/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
    future.result()
  File "/Users/fjetter/workspace/distributed/distributed/client.py", line 1115, in _update_scheduler_info
    self._scheduler_identity = SchedulerInfo(await self.scheduler.identity())
  File "/Users/fjetter/workspace/distributed/distributed/core.py", line 886, in send_recv_from_rpc
    result = await send_recv(comm=comm, op=key, **kwargs)
  File "/Users/fjetter/workspace/distributed/distributed/core.py", line 663, in send_recv
    response = await comm.read(deserializers=deserializers)
  File "/Users/fjetter/workspace/distributed/distributed/comm/tcp.py", line 205, in read
    frames_nbytes = await stream.read_bytes(fmt_size)
  File "/Users/fjetter/mambaforge/envs/dask-distributed/lib/python3.8/site-packages/tornado/iostream.py", line 421, in read_bytes
    future = self._start_read()
  File "/Users/fjetter/mambaforge/envs/dask-distributed/lib/python3.8/site-packages/tornado/iostream.py", line 809, in _start_read
    assert self._read_future is None, "Already reading"
AssertionError: Already reading

crusaderky · 2022-01-13T13:12:29Z

distributed/core.py

@@ -519,7 +519,7 @@ async def handle_comm(self, comm):
                            result = asyncio.ensure_future(result)
                            self._ongoing_coroutines.add(result)
                            result = await result
-                    except (CommClosedError, CancelledError):
+                    except (CommClosedError, CancelledError, asyncio.CancelledError):


I'm not sure about this. What determines which exception is raised (CancelledError is concurrent.futures.CancelledError, which is tornado's)? The event loop being used, or the choice of @gen_coroutine vs. async def?

Anyway I didn't find any failure here in real life - this is just defensive programming.

I'd actually never expect a concurrent.futures.CancelledError here, but would expect a asyncio.CancelledError to be possible. If a task is ever cancelled while waiting on handle_comm, a asyncio.CancelledError should be raised. If we're seeing both types of cancelled errors here, we should probably figure out how to convert to only one (I'd prefere the asyncio version, which should be what's hit most places in the code base).

Would you like me to change it to

except (CommClosedError, asyncio.CancelledError): ... except CancelledError as e: # pragma: nocover raise AssertionError("Unexpected exception") from e

and see if the path is ever hit? Or should I just clean it away?

I think it'd be good to remove the catch for the concurrent futures version, if for no other reason than readability. I'd probably delete it and see if any tests fail, but up to you.

distributed/core.py

crusaderky · 2022-01-13T13:14:50Z

distributed/core.py

        force_close = True
        raise
+    except (CancelledError, asyncio.CancelledError):


Same as above, I don't have evidence that concurrent.futures.CancelledError is ever raised.

Ok, after reading the tornado code, there's no part of our async code that should ever hit concurrent.futures.CancelledError. Maybe this would happen for older versions of tornado, but not any version we're currently using.

removed concurrent.futures.CancelledError

crusaderky · 2022-01-13T13:16:21Z

distributed/core.py

            comm.abort()
+        elif please_close:
+            await comm.close()


In case of OSError, this should halve the latency of scheduler.broadcast (which is the only method that uses close=True).

crusaderky · 2022-01-13T13:29:09Z

@jcrist could you take a look?
FYI @fjetter

fjetter

I'm wondering if our application of reuse is safe

distributed/distributed/core.py

Lines 885 to 889 in 7ebeda4

    
           try: 
        
               result = await send_recv(comm=comm, op=key, **kwargs) 
        
           finally: 
        
               self.pool.reuse(self.addr, comm) 
        
               comm.name = name

Everything works great -> reuse -> no need for finally
Comm broken -> reuse will not reuse but remove the comm from the pool
CancelledError -> reusing the comm resulting in a broken stream

there are one or two more places where we're using the ConnectionPool.reuse

fjetter · 2022-01-13T13:38:06Z

distributed/tests/test_core.py

@@ -531,6 +532,27 @@ async def test_send_recv_args():
    server.stop()


+@gen_test(timeout=5)


Is this safe for CI?

it takes 35ms on my machine, so I think so. I set it very low to detect a regression where everything gets stuck for 10-20 seconds or whatever the default timeouts are.

crusaderky · 2022-01-13T15:18:11Z

I'm wondering if our application of reuse is safe

distributed/distributed/core.py

Lines 885 to 889 in 7ebeda4

try:

result = await send_recv(comm=comm, op=key, **kwargs)

finally:

self.pool.reuse(self.addr, comm)

comm.name = name

Everything works great -> reuse -> no need for finally

Comm broken -> reuse will not reuse but remove the comm from the pool

CancelledError -> reusing the comm resulting in a broken stream

there are one or two more places where we're using the ConnectionPool.reuse

I think it's safe. See the implementation of reuse:

distributed/distributed/core.py

Lines 1086 to 1091 in 7ebeda4

    
           if comm.closed(): 
        
               self.semaphore.release() 
        
           else: 
        
               self.available[addr].add(comm) 
        
               if self.semaphore.locked() and self._n_connecting > 0: 
        
                   self.collect()

In master, on CancelledError it's hitting lines 1089:1091. With this PR it hits line 1087 instead.

I added a comment in reuse() to clarify what's happening

jcrist

Overall this seems good to me. Nice catch.

crusaderky · 2022-01-13T20:54:04Z

All test failures seem to be unrelated

crusaderky commented Jan 13, 2022

View reviewed changes

distributed/core.py Outdated Show resolved Hide resolved

crusaderky commented Jan 13, 2022

View reviewed changes

Close comm on CancelledError

fe5d749

crusaderky force-pushed the comm_cancel branch from 68c57c4 to fe5d749 Compare January 13, 2022 13:20

crusaderky requested a review from jcrist January 13, 2022 13:22

crusaderky self-assigned this Jan 13, 2022

crusaderky marked this pull request as ready for review January 13, 2022 13:23

crusaderky mentioned this pull request Jan 13, 2022

Cosmetic changes to distributed.comm #5657

Merged

fjetter reviewed Jan 13, 2022

View reviewed changes

crusaderky added 2 commits January 13, 2022 15:21

comment

fa8e29f

comment

1ef248f

crusaderky added a commit to crusaderky/distributed that referenced this pull request Jan 13, 2022

Close comm on CancelledError (dask#5656)

16c3211

crusaderky mentioned this pull request Jan 13, 2022

AMM: Graceful Worker Retirement #5381

Merged

code review

203d48e

jcrist reviewed Jan 13, 2022

View reviewed changes

crusaderky added a commit to crusaderky/distributed that referenced this pull request Jan 13, 2022

Close comm on CancelledError (dask#5656)

3e2e880

lint

b1a1039

crusaderky added a commit to crusaderky/distributed that referenced this pull request Jan 13, 2022

Close comm on CancelledError (dask#5656)

3df9fcb

crusaderky merged commit bf4ecff into dask:main Jan 13, 2022

crusaderky deleted the comm_cancel branch January 13, 2022 20:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Close comm on CancelledError #5656

Close comm on CancelledError #5656

crusaderky commented Jan 13, 2022 •

edited

Loading

crusaderky Jan 13, 2022 •

edited

Loading

jcrist Jan 13, 2022

crusaderky Jan 13, 2022

jcrist Jan 13, 2022

crusaderky Jan 13, 2022

crusaderky Jan 13, 2022

jcrist Jan 13, 2022

crusaderky Jan 13, 2022

crusaderky Jan 13, 2022

crusaderky commented Jan 13, 2022

fjetter left a comment

fjetter Jan 13, 2022

crusaderky Jan 13, 2022

crusaderky commented Jan 13, 2022 •

edited

Loading

jcrist left a comment

crusaderky commented Jan 13, 2022

	try:
	result = await send_recv(comm=comm, op=key, **kwargs)
	finally:
	self.pool.reuse(self.addr, comm)
	comm.name = name

		@@ -531,6 +532,27 @@ async def test_send_recv_args():
		server.stop()


		@gen_test(timeout=5)

Close comm on CancelledError #5656

Close comm on CancelledError #5656

Conversation

crusaderky commented Jan 13, 2022 • edited Loading

crusaderky Jan 13, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky commented Jan 13, 2022

fjetter left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky commented Jan 13, 2022 • edited Loading

jcrist left a comment

Choose a reason for hiding this comment

crusaderky commented Jan 13, 2022

crusaderky commented Jan 13, 2022 •

edited

Loading

crusaderky Jan 13, 2022 •

edited

Loading

crusaderky commented Jan 13, 2022 •

edited

Loading