Error closing a local cluster when client still running #6087

gjoseph92 · 2022-04-07T19:57:14Z

What happened:

When a local cluster (using processes) shuts down, a ton of errors are now spewed about scheduling new futures after shutdown.

I can't replicate it in my distributed dev environment, but in a different environment (which is quite similar, also running dask & distributed from main—just py py3.9.1 instead of py3.9.5?) the process hangs and never terminates until a ctrl-C. In my distributed dev environment, the same errors are spewed, but it exits (with code 0 no the less).

git bisect implicates #6031 @graingert.

Minimal Complete Verifiable Example:

# repro.py
import distributed
from distributed.deploy.local import LocalCluster


if __name__ == "__main__":
    cluster = LocalCluster(n_workers=1, threads_per_worker=1, processes=True)
    client = distributed.Client(cluster)

(dask-distributed) gabe dev/distributed ‹f4c52e9a› » python repro.py
2022-04-07 15:47:02,414 - distributed.utils - ERROR - cannot schedule new futures after shutdown
Traceback (most recent call last):
  File "/Users/gabe/dev/distributed/distributed/comm/tcp.py", line 226, in read
    frames_nbytes = await stream.read_bytes(fmt_size)
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/gabe/dev/distributed/distributed/client.py", line 1395, in _handle_report
    msgs = await self.scheduler_comm.comm.read()
  File "/Users/gabe/dev/distributed/distributed/comm/tcp.py", line 242, in read
    convert_stream_closed_error(self, e)
  File "/Users/gabe/dev/distributed/distributed/comm/tcp.py", line 150, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Client->Scheduler local=tcp://127.0.0.1:58800 remote=tcp://127.0.0.1:58791>: Stream is closed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/gabe/dev/distributed/distributed/utils.py", line 693, in log_errors
    yield
  File "/Users/gabe/dev/distributed/distributed/client.py", line 1225, in _reconnect
    await self._ensure_connected(timeout=timeout)
  File "/Users/gabe/dev/distributed/distributed/client.py", line 1254, in _ensure_connected
    comm = await connect(
  File "/Users/gabe/dev/distributed/distributed/comm/core.py", line 289, in connect
    comm = await asyncio.wait_for(
  File "/Users/gabe/miniconda3/envs/dask-distributed/lib/python3.9/asyncio/tasks.py", line 481, in wait_for
    return fut.result()
  File "/Users/gabe/dev/distributed/distributed/comm/tcp.py", line 439, in connect
    stream = await self.client.connect(
  File "/Users/gabe/miniconda3/envs/dask-distributed/lib/python3.9/site-packages/tornado/tcpclient.py", line 265, in connect
    addrinfo = await self.resolver.resolve(host, port, af)
  File "/Users/gabe/dev/distributed/distributed/comm/tcp.py", line 424, in resolve
    for fam, _, _, _, address in await asyncio.get_running_loop().getaddrinfo(
  File "/Users/gabe/miniconda3/envs/dask-distributed/lib/python3.9/asyncio/base_events.py", line 856, in getaddrinfo
    return await self.run_in_executor(
  File "/Users/gabe/miniconda3/envs/dask-distributed/lib/python3.9/asyncio/base_events.py", line 814, in run_in_executor
    executor.submit(func, *args), loop=self)
  File "/Users/gabe/miniconda3/envs/dask-distributed/lib/python3.9/concurrent/futures/thread.py", line 161, in submit
    raise RuntimeError('cannot schedule new futures after shutdown')
RuntimeError: cannot schedule new futures after shutdown
2022-04-07 15:47:02,417 - distributed.utils - ERROR - cannot schedule new futures after shutdown
Traceback (most recent call last):
  File "/Users/gabe/dev/distributed/distributed/comm/tcp.py", line 226, in read
    frames_nbytes = await stream.read_bytes(fmt_size)
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/gabe/dev/distributed/distributed/client.py", line 1395, in _handle_report
    msgs = await self.scheduler_comm.comm.read()
  File "/Users/gabe/dev/distributed/distributed/comm/tcp.py", line 242, in read
    convert_stream_closed_error(self, e)
  File "/Users/gabe/dev/distributed/distributed/comm/tcp.py", line 150, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Client->Scheduler local=tcp://127.0.0.1:58800 remote=tcp://127.0.0.1:58791>: Stream is closed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/gabe/dev/distributed/distributed/utils.py", line 693, in log_errors
    yield
  File "/Users/gabe/dev/distributed/distributed/client.py", line 1401, in _handle_report
    await self._reconnect()
  File "/Users/gabe/dev/distributed/distributed/client.py", line 1225, in _reconnect
    await self._ensure_connected(timeout=timeout)
  File "/Users/gabe/dev/distributed/distributed/client.py", line 1254, in _ensure_connected
    comm = await connect(
  File "/Users/gabe/dev/distributed/distributed/comm/core.py", line 289, in connect
    comm = await asyncio.wait_for(
  File "/Users/gabe/miniconda3/envs/dask-distributed/lib/python3.9/asyncio/tasks.py", line 481, in wait_for
    return fut.result()
  File "/Users/gabe/dev/distributed/distributed/comm/tcp.py", line 439, in connect
    stream = await self.client.connect(
  File "/Users/gabe/miniconda3/envs/dask-distributed/lib/python3.9/site-packages/tornado/tcpclient.py", line 265, in connect
    addrinfo = await self.resolver.resolve(host, port, af)
  File "/Users/gabe/dev/distributed/distributed/comm/tcp.py", line 424, in resolve
    for fam, _, _, _, address in await asyncio.get_running_loop().getaddrinfo(
  File "/Users/gabe/miniconda3/envs/dask-distributed/lib/python3.9/asyncio/base_events.py", line 856, in getaddrinfo
    return await self.run_in_executor(
  File "/Users/gabe/miniconda3/envs/dask-distributed/lib/python3.9/asyncio/base_events.py", line 814, in run_in_executor
    executor.submit(func, *args), loop=self)
  File "/Users/gabe/miniconda3/envs/dask-distributed/lib/python3.9/concurrent/futures/thread.py", line 161, in submit
    raise RuntimeError('cannot schedule new futures after shutdown')
RuntimeError: cannot schedule new futures after shutdown
2022-04-07 15:47:02,471 - distributed.utils - ERROR - cannot schedule new futures after shutdown
Traceback (most recent call last):
  File "/Users/gabe/dev/distributed/distributed/comm/tcp.py", line 226, in read
    frames_nbytes = await stream.read_bytes(fmt_size)
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/gabe/dev/distributed/distributed/client.py", line 1395, in _handle_report
    msgs = await self.scheduler_comm.comm.read()
  File "/Users/gabe/dev/distributed/distributed/comm/tcp.py", line 242, in read
    convert_stream_closed_error(self, e)
  File "/Users/gabe/dev/distributed/distributed/comm/tcp.py", line 150, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Client->Scheduler local=tcp://127.0.0.1:58800 remote=tcp://127.0.0.1:58791>: Stream is closed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/gabe/dev/distributed/distributed/utils.py", line 693, in log_errors
    yield
  File "/Users/gabe/dev/distributed/distributed/client.py", line 1521, in _close
    await asyncio.wait_for(asyncio.shield(handle_report_task), 0.1)
  File "/Users/gabe/miniconda3/envs/dask-distributed/lib/python3.9/asyncio/tasks.py", line 481, in wait_for
    return fut.result()
  File "/Users/gabe/dev/distributed/distributed/client.py", line 1401, in _handle_report
    await self._reconnect()
  File "/Users/gabe/dev/distributed/distributed/client.py", line 1225, in _reconnect
    await self._ensure_connected(timeout=timeout)
  File "/Users/gabe/dev/distributed/distributed/client.py", line 1254, in _ensure_connected
    comm = await connect(
  File "/Users/gabe/dev/distributed/distributed/comm/core.py", line 289, in connect
    comm = await asyncio.wait_for(
  File "/Users/gabe/miniconda3/envs/dask-distributed/lib/python3.9/asyncio/tasks.py", line 481, in wait_for
    return fut.result()
  File "/Users/gabe/dev/distributed/distributed/comm/tcp.py", line 439, in connect
    stream = await self.client.connect(
  File "/Users/gabe/miniconda3/envs/dask-distributed/lib/python3.9/site-packages/tornado/tcpclient.py", line 265, in connect
    addrinfo = await self.resolver.resolve(host, port, af)
  File "/Users/gabe/dev/distributed/distributed/comm/tcp.py", line 424, in resolve
    for fam, _, _, _, address in await asyncio.get_running_loop().getaddrinfo(
  File "/Users/gabe/miniconda3/envs/dask-distributed/lib/python3.9/asyncio/base_events.py", line 856, in getaddrinfo
    return await self.run_in_executor(
  File "/Users/gabe/miniconda3/envs/dask-distributed/lib/python3.9/asyncio/base_events.py", line 814, in run_in_executor
    executor.submit(func, *args), loop=self)
  File "/Users/gabe/miniconda3/envs/dask-distributed/lib/python3.9/concurrent/futures/thread.py", line 161, in submit
    raise RuntimeError('cannot schedule new futures after shutdown')

Environment:

Dask version: 6e30766
Python version: 3.9.5
Operating System: macOS
Install method (conda, pip, source): source

The text was updated successfully, but these errors were encountered:

graingert · 2022-04-08T08:51:03Z

at python interpreter finalization:

first threading._shutdown() which calls concurrent.futures.thread._python_exit and permanently shuts down all ThreadPoolExecutors: https://github.com/python/cpython/blob/f4b328e2bbbcc1096a28e903f07868b425397767/Lib/concurrent/futures/thread.py#L23-L37 https://bugs.python.org/issue39812
then the at exit handlers run - first atexit calls distributed.deploy.spec.close_clusters which waits 10 seconds for the LocalCuster to close
meanwhile the client handle_report_task attempts to reconnect to the LocalCluster and cannot call .resolve() anymore because the thread pool is shutdown - note on tornado even though the host is already an IP address resolve() is still called using an executor Try to skip getaddrinfo if "host" is already an IP tornadoweb/tornado#3113
then atexit.register(_close_global_client) gets called which collects the RuntimeError from the handle_report_task

The reason this worked before is that our ExecutorResolver used our legacy copy of ThreadPoolExecutor which uses atexit instead of threading._register_atexit:

distributed/distributed/_concurrent_futures_thread.py

Lines 40 to 50 in 6e30766

    
           def _python_exit(): 
        
               global _shutdown 
        
               _shutdown = True 
        
               items = list(_threads_queues.items()) 
        
               for t, q in items: 
        
                   q.put(None) 
        
               for t, q in items: 
        
                   t.join() 
        
           atexit.register(_python_exit)

A fix for this would be to ensure that clients then workers then clusters get shutdown atexit

fjetter · 2022-04-08T10:04:19Z

I'm running into all sorts of problems regarding future/task scheduling and cancellation and blocking threadpools lately. Here are a couple of references but I can't tell you if any are actually related

Ensure distributed.comm.core.connect can always be cancelled #6064
Do not allow closing workers to be awaited again #5910
Do not catch CancelledError in CommPool #6005
Unblock event loop while waiting for ThreadpoolExecutor to shut down #5883 (had to be reverted but I saw cluster shutdowns being blocked due to a blocked event loop)

mrocklin · 2022-04-12T23:32:56Z

We should probably block release on this, yes? cc @jrbourbeau

@graingert is this something that you have time to own? (this seems like the sort of thing that you know best among the team)

Fixes dask#6087

mrocklin · 2022-04-13T17:08:05Z

Here is a possible solution: #6120

Fixes #6087

vepadulano · 2022-07-22T07:17:27Z

Hi, I'm still having the same issue with the following configuration:

Python 3.10.5
dask==2022.7.0,distributed==2022.7.0

Reproducer

from dask.distributed import LocalCluster, Client

if __name__ == '__main__':
    cluster = LocalCluster(n_workers=1, threads_per_worker=1, processes=True)
    cluster.close()

Admittedly, the original reproducer didn't manually close the cluster object, but I'm not sure how much it matters in this case. I would like to be able to do so, in general.

jtlz2 · 2023-07-05T10:21:26Z

Is the context manager the only way round here?

gjoseph92 added bug Something is broken regression labels Apr 7, 2022

fjetter added the p1 Affects a large population and inhibits work label Apr 8, 2022

graingert mentioned this issue Apr 13, 2022

Revert "drop deprecated tornado.netutil.ExecutorResolver" #6117

Closed

mrocklin added a commit to mrocklin/distributed that referenced this issue Apr 13, 2022

Don't try to reconnect client on interpreter shutdown

e1848d3

Fixes dask#6087

mrocklin mentioned this issue Apr 13, 2022

Don't try to reconnect client on interpreter shutdown #6120

Merged

3 tasks

mrocklin closed this as completed in #6120 Apr 15, 2022

mrocklin added a commit that referenced this issue Apr 15, 2022

Don't try to reconnect client on interpreter shutdown (#6120)

6daf3bf

Fixes #6087

gjoseph92 mentioned this issue Apr 26, 2022

[Discussion] Structured concurrency #6201

Open

graingert mentioned this issue Aug 8, 2022

RuntimeError: cannot schedule new futures after shutdown #6846

Open

stl-maxgardner mentioned this issue Oct 4, 2022

LocalCluster raises "Stream is Closed" error on exit if initialized with too many threads per worker #7105

Open

matbryan52 mentioned this issue Nov 3, 2022

Make sure the Context cleans up after itself LiberTEM/LiberTEM#1343

Merged

5 tasks

kephale mentioned this issue Dec 13, 2022

Use layer slicer in viewer model napari/napari#5377

Merged

2 tasks

codingl2k1 mentioned this issue Jul 10, 2023

BUG: Fix RuntimeError: cannot schedule new futures after shutdown xorbitsai/xorbits#589

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error closing a local cluster when client still running #6087

Error closing a local cluster when client still running #6087

gjoseph92 commented Apr 7, 2022

graingert commented Apr 8, 2022 •

edited

Loading

fjetter commented Apr 8, 2022

mrocklin commented Apr 12, 2022

mrocklin commented Apr 13, 2022

vepadulano commented Jul 22, 2022

jtlz2 commented Jul 5, 2023

Error closing a local cluster when client still running #6087

Error closing a local cluster when client still running #6087

Comments

gjoseph92 commented Apr 7, 2022

graingert commented Apr 8, 2022 • edited Loading

fjetter commented Apr 8, 2022

mrocklin commented Apr 12, 2022

mrocklin commented Apr 13, 2022

vepadulano commented Jul 22, 2022

jtlz2 commented Jul 5, 2023

graingert commented Apr 8, 2022 •

edited

Loading