-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update SSHCluster usage in benchmarks with new CUDAWorker #326
Update SSHCluster usage in benchmarks with new CUDAWorker #326
Conversation
This PR is still waiting on dask/distributed#3907 getting merged. |
Still waiting on dask/distributed#3907 . |
This PR has been marked stale due to no recent activity in the past 30d. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be marked rotten if there is no activity in the next 60d. |
I updated the PR and the |
Codecov Report
@@ Coverage Diff @@
## branch-21.12 #326 +/- ##
===============================================
Coverage ? 70.01%
===============================================
Files ? 15
Lines ? 1914
Branches ? 0
===============================================
Hits ? 1340
Misses ? 574
Partials ? 0 Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, thanks for adding in the name
fix 😄 just a couple questions:
"scheduler_options": {"protocol": args.protocol}, | ||
"worker_module": "dask_cuda.dask_cuda_worker", | ||
"worker_options": worker_options, | ||
"scheduler_options": {"protocol": args.protocol, "port": 8786}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just so I understand better - why do we want to explicitly set the scheduler port to 8786?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, I can't remember the reason for that. I think it may have been because we need to have a scheduler_address
which is defined in https://github.com/rapidsai/dask-cuda/blob/branch-21.10/dask_cuda/benchmarks/utils.py#L207 and that was to ensure they both match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense - thanks for the clarification!
"scheduler_options": {"protocol": args.protocol, "port": 8786}, | ||
"worker_class": "dask_cuda.CUDAWorker", | ||
"worker_options": { | ||
"protocol": args.protocol, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is protocol
being handled by Distributed when creating the workers? Don't see anything in CUDAWorker
that suggests we need to pass this argument to the constructor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's passed via CUDAWorker
's kwargs
.
This is now ready for review. I've verified it works: Sample result
However, after the above finishes, the cluster doesn't exit cleanly and dies 30 seconds later: Close timeout tracebacktornado.application - ERROR - Exception in callback <bound method Client._heartbeat of <Client: 'tcp://10.33.227.163:8786' processes=16 threads=16, memory=1.97 TiB>>
Traceback (most recent call last):
File "/datasets/pentschev/miniconda3/envs/rn-112-21.12.210930/lib/python3.8/site-packages/tornado/ioloop.py", line 905, in _run
return self.callback()
File "/datasets/pentschev/miniconda3/envs/rn-112-21.12.210930/lib/python3.8/site-packages/distributed/client.py", line 1172, in _heartbeat
self.scheduler_comm.send({"op": "heartbeat-client"})
File "/datasets/pentschev/miniconda3/envs/rn-112-21.12.210930/lib/python3.8/site-packages/distributed/batched.py", line 136, in send
raise CommClosedError(f"Comm {self.comm!r} already closed.")
distributed.comm.core.CommClosedError: Comm <TCP (closed) Client->Scheduler local=tcp://127.0.0.1:57628 remote=tcp://dgx13:8786> already closed.
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0x7f8e2597d6d0>>, <Task finished name='Task-126' coro=<SpecCluster._correct_state_internal() done, defined at /datasets/pentschev/miniconda3/envs/rn-112-21.12.210930/lib/python3.8/site-packages/distributed/deploy/spec.py:325> exception=OSError('Timed out trying to connect to tcp://10.33.227.163:8786 after 30 s')>)
ConnectionRefusedError: [Errno 111] Connection refused
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/datasets/pentschev/miniconda3/envs/rn-112-21.12.210930/lib/python3.8/site-packages/distributed/comm/core.py", line 284, in connect
comm = await asyncio.wait_for(
File "/datasets/pentschev/miniconda3/envs/rn-112-21.12.210930/lib/python3.8/asyncio/tasks.py", line 494, in wait_for
return fut.result()
File "/datasets/pentschev/miniconda3/envs/rn-112-21.12.210930/lib/python3.8/site-packages/distributed/comm/tcp.py", line 410, in connect
convert_stream_closed_error(self, e)
File "/datasets/pentschev/miniconda3/envs/rn-112-21.12.210930/lib/python3.8/site-packages/distributed/comm/tcp.py", line 126, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <distributed.comm.tcp.TCPConnector object at 0x7f8dea2a65b0>: ConnectionRefusedError: [Errno 111] Connection refused
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/datasets/pentschev/miniconda3/envs/rn-112-21.12.210930/lib/python3.8/site-packages/tornado/ioloop.py", line 741, in _run_callback
ret = callback()
File "/datasets/pentschev/miniconda3/envs/rn-112-21.12.210930/lib/python3.8/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
future.result()
File "dask_cuda/benchmarks/local_cupy.py", line 315, in run
await client.shutdown()
File "/datasets/pentschev/miniconda3/envs/rn-112-21.12.210930/lib/python3.8/site-packages/distributed/deploy/cluster.py", line 439, in __aexit__
await f
File "/datasets/pentschev/miniconda3/envs/rn-112-21.12.210930/lib/python3.8/site-packages/distributed/deploy/spec.py", line 417, in _close
await self._correct_state()
File "/datasets/pentschev/miniconda3/envs/rn-112-21.12.210930/lib/python3.8/site-packages/distributed/deploy/spec.py", line 332, in _correct_state_internal
await self.scheduler_comm.retire_workers(workers=list(to_close))
File "/datasets/pentschev/miniconda3/envs/rn-112-21.12.210930/lib/python3.8/site-packages/distributed/core.py", line 785, in send_recv_from_rpc
comm = await self.live_comm()
File "/datasets/pentschev/miniconda3/envs/rn-112-21.12.210930/lib/python3.8/site-packages/distributed/core.py", line 742, in live_comm
comm = await connect(
File "/datasets/pentschev/miniconda3/envs/rn-112-21.12.210930/lib/python3.8/site-packages/distributed/comm/core.py", line 308, in connect
raise OSError(
OSError: Timed out trying to connect to tcp://10.33.227.163:8786 after 30 s
ConnectionRefusedError: [Errno 111] Connection refused
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/datasets/pentschev/miniconda3/envs/rn-112-21.12.210930/lib/python3.8/site-packages/distributed/comm/core.py", line 284, in connect
comm = await asyncio.wait_for(
File "/datasets/pentschev/miniconda3/envs/rn-112-21.12.210930/lib/python3.8/asyncio/tasks.py", line 494, in wait_for
return fut.result()
File "/datasets/pentschev/miniconda3/envs/rn-112-21.12.210930/lib/python3.8/site-packages/distributed/comm/tcp.py", line 410, in connect
convert_stream_closed_error(self, e)
File "/datasets/pentschev/miniconda3/envs/rn-112-21.12.210930/lib/python3.8/site-packages/distributed/comm/tcp.py", line 126, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <distributed.comm.tcp.TCPConnector object at 0x7f8dea2a65b0>: ConnectionRefusedError: [Errno 111] Connection refused
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "dask_cuda/benchmarks/local_cupy.py", line 380, in <module>
main()
File "dask_cuda/benchmarks/local_cupy.py", line 376, in main
asyncio.get_event_loop().run_until_complete(run(args))
File "/datasets/pentschev/miniconda3/envs/rn-112-21.12.210930/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
return future.result()
File "/datasets/pentschev/miniconda3/envs/rn-112-21.12.210930/lib/python3.8/site-packages/tornado/ioloop.py", line 741, in _run_callback
ret = callback()
File "/datasets/pentschev/miniconda3/envs/rn-112-21.12.210930/lib/python3.8/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
future.result()
File "dask_cuda/benchmarks/local_cupy.py", line 315, in run
await client.shutdown()
File "/datasets/pentschev/miniconda3/envs/rn-112-21.12.210930/lib/python3.8/site-packages/distributed/deploy/cluster.py", line 439, in __aexit__
await f
File "/datasets/pentschev/miniconda3/envs/rn-112-21.12.210930/lib/python3.8/site-packages/distributed/deploy/spec.py", line 417, in _close
await self._correct_state()
File "/datasets/pentschev/miniconda3/envs/rn-112-21.12.210930/lib/python3.8/site-packages/distributed/deploy/spec.py", line 332, in _correct_state_internal
await self.scheduler_comm.retire_workers(workers=list(to_close))
File "/datasets/pentschev/miniconda3/envs/rn-112-21.12.210930/lib/python3.8/site-packages/distributed/core.py", line 785, in send_recv_from_rpc
comm = await self.live_comm()
File "/datasets/pentschev/miniconda3/envs/rn-112-21.12.210930/lib/python3.8/site-packages/distributed/core.py", line 742, in live_comm
comm = await connect(
File "/datasets/pentschev/miniconda3/envs/rn-112-21.12.210930/lib/python3.8/site-packages/distributed/comm/core.py", line 308, in connect
raise OSError(
OSError: Timed out trying to connect to tcp://10.33.227.163:8786 after 30 s
/datasets/pentschev/miniconda3/envs/rn-112-21.12.210930/lib/python3.8/site-packages/distributed/deploy/spec.py:663: RuntimeWarning: coroutine 'wait_for' was never awaited
cluster.close(timeout=10)
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
/datasets/pentschev/miniconda3/envs/rn-112-21.12.210930/lib/python3.8/site-packages/distributed/deploy/spec.py:663: RuntimeWarning: coroutine 'SpecCluster._close' was never awaited
cluster.close(timeout=10)
RuntimeWarning: Enable tracemalloc to get the object allocation traceback Given this works despite the unclean exit, the low-priority of this task and the prior knowledge of the rabbit hole that Distributed closing/finalizers is, I won't go through that now. If anyone feels like doing that, have fun! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that is strange behavior, considering there are certainly other areas of the benchmarks that have clean up issues and this doesn't impact the actual times, I'm in agreement to merge this and potentially dig down that rabbit hole later on with a follow up PR in Distributed.
Thanks @pentschev 🙂
rerun tests |
Ok, this is now passing and since there are no objections to ignoring the unclean exit for now, I'm gonna go ahead and merge it. Thanks @charlesbluca for the review here! |
@gpucibot merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @pentschev
Thanks @madsbk ! |
Updates usage of
SSHCluster
according to changes in dask/distributed#5191.