-
-
Notifications
You must be signed in to change notification settings - Fork 727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consistent worker Client instance in get_client
#5467
base: main
Are you sure you want to change the base?
Conversation
Fixes dask#4959 `get_client` was calling the private `Worker._get_client` method when it ran within a task. `_get_client` should really have been called `_make_client`, since it created a new client every time. The simplest correct thing to do instead would have been to use the `Worker.client` property, which caches this instance. In order to pass the `timeout` parameter through though, I changed `Worker.get_client` to actually match its docstring and always return the same instance.
# must be lazy import otherwise cyclic import | ||
from distributed.deploy.cluster import Cluster | ||
try: | ||
from .client import default_client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be cleaner to move all imports to the top of the function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, this is just existing code. But I can refactor it.
assert self._client.status == "running" | ||
Worker._initialized_clients.add(self._client) | ||
if not asynchronous: | ||
assert self._client.status == "running" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need this? Isn't this already done by the sync Client constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know. This is all existing code I just indented under the with self._client_lock
.
Futures work fine, but there's all sorts of extra complexity with returning futures from futures. Don't want this to become a flaky test because of distributed reference-counting bugs with pickling futures, since that's not what this test is about.
Would this also potentially effect the Mainly, it throws non-blocking errors that it does not have the distributed.client - ERROR - No event handler known for topic some_topic.
Traceback (most recent call last):
File "C:\venv\lib\site-packages\distributed\client.py", line 1253, in _handle_report
await result
File "C:\venv\lib\site-packages\distributed\client.py", line 3602, in _handle_event
self.unsubscribe_topic(topic)
File "C:\venv\lib\site-packages\distributed\client.py", line 3653, in unsubscribe_topic
raise ValueError(f"No event handler known for topic {topic}.")
ValueError: No event handler known for topic some_topic. I am currently using dask/distributed 2021.10.0 code to reproduce# reproduce_worker_issue.py
import asyncio
import time
import dask
from dask.distributed import Client, get_client, Queue, Variable, get_worker
def event_func(event):
client = get_client()
ts, msg = event
print(f"event_func {ts} {msg} id:{client.id} handlers:{client._event_handlers}")
async def long_running_event_logger():
c = get_client()
w = get_worker()
print(f"log running c_id:{c.id} w_id:{w.client.id}")
while True:
for i in range(3):
await c.log_event('some_topic', f"{i}")
await asyncio.sleep(5)
if __name__ == '__main__':
client = Client()
single_worker = list(client.has_what().keys())[0]
client.subscribe_topic('some_topic', event_func)
print(single_worker)
client.submit(
long_running_event_logger,
workers=[single_worker]
)
for i in range(10):
time.sleep(1) errorPS C:> python .\reproduce_worker_issue.py
tcp://127.0.0.1:58760
log running c_id:Client-worker-83aef8c7-41b1-11ec-a6f0-534e57000000 w_id:Client-worker-83aef8c7-41b1-11ec-a6f0-534e57000000
event_func 1636499121.6069167 0 id:Client-82930104-41b1-11ec-8ad0-534e57000000 handlers:{'print': <function _handle_print at 0x000002A1D5197310>, 'warn': <function _handle_warn at 0x000002A1D51973A0>, 'some_topic': <function event_func at 0x000002A1D1FB7160>}
event_func 1636499121.608122 1 id:Client-82930104-41b1-11ec-8ad0-534e57000000 handlers:{'print': <function _handle_print at 0x000002A1D5197310>, 'warn': <function _handle_warn at 0x000002A1D51973A0>, 'some_topic': <function event_func at 0x000002A1D1FB7160>}
event_func 1636499121.6089892 2 id:Client-82930104-41b1-11ec-8ad0-534e57000000 handlers:{'print': <function _handle_print at 0x000002A1D5197310>, 'warn': <function _handle_warn at 0x000002A1D51973A0>, 'some_topic': <function event_func at 0x000002A1D1FB7160>}
event_func 1636499126.5905163 0 id:Client-82930104-41b1-11ec-8ad0-534e57000000 handlers:{'print': <function _handle_print at 0x000002A1D5197310>, 'warn': <function _handle_warn at 0x000002A1D51973A0>, 'some_topic': <function event_func at 0x000002A1D1FB7160>}
distributed.client - ERROR - No event handler known for topic some_topic.
Traceback (most recent call last):
File "C:\venv\lib\site-packages\distributed\client.py", line 1253, in _handle_report
await result
File "C:\venv\lib\site-packages\distributed\client.py", line 3602, in _handle_event
self.unsubscribe_topic(topic)
File "C:\venv\lib\site-packages\distributed\client.py", line 3653, in unsubscribe_topic
raise ValueError(f"No event handler known for topic {topic}.")
ValueError: No event handler known for topic some_topic.
distributed.client - ERROR - No event handler known for topic some_topic.
Traceback (most recent call last):
File "C:\venv\lib\site-packages\distributed\client.py", line 1253, in _handle_report
await result
File "C:\venv\lib\site-packages\distributed\client.py", line 3602, in _handle_event
self.unsubscribe_topic(topic)
File "C:\venv\lib\site-packages\distributed\client.py", line 3653, in unsubscribe_topic
raise ValueError(f"No event handler known for topic {topic}.")
ValueError: No event handler known for topic some_topic.
event_func 1636499126.5940294 1 id:Client-82930104-41b1-11ec-8ad0-534e57000000 handlers:{'print': <function _handle_print at 0x000002A1D5197310>, 'warn': <function _handle_warn at 0x000002A1D51973A0>, 'some_topic': <function event_func at 0x000002A1D1FB7160>}
event_func 1636499126.6008162 2 id:Client-82930104-41b1-11ec-8ad0-534e57000000 handlers:{'print': <function _handle_print at 0x000002A1D5197310>, 'warn': <function _handle_warn at 0x000002A1D51973A0>, 'some_topic': <function event_func at 0x000002A1D1FB7160>}
distributed.client - ERROR - No event handler known for topic some_topic.
Traceback (most recent call last):
File "C:\venv\lib\site-packages\distributed\client.py", line 1253, in _handle_report
await result
File "C:\venv\lib\site-packages\distributed\client.py", line 3602, in _handle_event
self.unsubscribe_topic(topic)
File "C:\venv\lib\site-packages\distributed\client.py", line 3653, in unsubscribe_topic
raise ValueError(f"No event handler known for topic {topic}.")
ValueError: No event handler known for topic some_topic.
Traceback (most recent call last):
File ".\validation\dashboard\reproduce_worker_issue.py", line 39, in <module>
time.sleep(1)
KeyboardInterrupt
Task was destroyed but it is pending!
task: <Task pending name='Task-167' coro=<Cluster._sync_cluster_info() done, defined at C:\venv\lib\site-packages\distributed\deploy\cluster.py:104> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x000002A1F6A15940>()]> cb=[IOLoop.add_future.<locals>.<lambda>() at
C:\venv\lib\site-packages\tornado\ioloop.py:688]>
Task was destroyed but it is pending!
task: <Task pending name='Task-168' coro=<BaseTCPConnector.connect() done, defined at C:\venv\lib\site-packages\distributed\comm\tcp.py:392> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x000002A1F6CD0160>()]> cb=[_release_waiter(<Future pendi...1F6A15940>()]>)() at C:\ProgramData\Anaconda3\lib\asyncio\tasks.py:429]> pip venv: (venv) PS C:\> pip list
Package Version
--------------------- -----------
altair 4.1.0
argon2-cffi 21.1.0
astor 0.8.1
attrs 21.2.0
backcall 0.2.0
backports.zoneinfo 0.2.1
base58 2.1.1
bleach 4.1.0
blinker 1.4
bokeh 2.4.1
cachetools 4.2.4
certifi 2021.10.8
cffi 1.15.0
charset-normalizer 2.0.7
click 8.0.3
cloudpickle 2.0.0
colorama 0.4.4
dask 2021.10.0
debugpy 1.5.1
decorator 5.1.0
defusedxml 0.7.1
distributed 2021.10.0
entrypoints 0.3
fsspec 2021.11.0
gitdb 4.0.9
GitPython 3.1.24
greenlet 1.1.2
HeapDict 1.0.1
idna 3.3
importlib-resources 5.4.0
ipykernel 6.5.0
ipython 7.29.0
ipython-genutils 0.2.0
ipywidgets 7.6.5
jedi 0.18.0
Jinja2 3.0.2
jsonschema 4.2.1
jupyter-client 7.0.6
jupyter-core 4.9.1
jupyterlab-pygments 0.1.2
jupyterlab-widgets 1.0.2
locket 0.2.1
MarkupSafe 2.0.1
matplotlib-inline 0.1.3
mistune 0.8.4
msgpack 1.0.2
nbclient 0.5.4
nbconvert 6.2.0
nbformat 5.1.3
nest-asyncio 1.5.1
notebook 6.4.5
npTDMS 1.4.0
numpy 1.21.4
packaging 21.2
pandas 1.3.4
pandocfilters 1.5.0
parso 0.8.2
partd 1.2.0
pickleshare 0.7.5
Pillow 8.4.0
pip 20.1.1
plotly 5.3.1
prometheus-client 0.12.0
prompt-toolkit 3.0.22
protobuf 3.19.1
psutil 5.8.0
pyarrow 6.0.0
pycparser 2.20
pydeck 0.7.1
Pygments 2.10.0
pyparsing 2.4.7
pyrsistent 0.18.0
python-dateutil 2.8.2
pytz 2021.3
pytz-deprecation-shim 0.1.0.post0
pywin32 302
pywinpty 1.1.5
PyYAML 6.0
pyzmq 22.3.0
requests 2.26.0
Send2Trash 1.8.0
setuptools 47.1.0
six 1.16.0
smmap 5.0.0
sortedcontainers 2.4.0
SQLAlchemy 1.4.26
streamlit 1.1.0
tblib 1.7.0
tenacity 8.0.1
terminado 0.12.1
testpath 0.5.0
toml 0.10.2
toolz 0.11.1
tornado 6.1
tqdm 4.62.3
traitlets 5.1.1
typing-extensions 3.10.0.2
tzdata 2021.5
tzlocal 4.1
urllib3 1.26.7
validators 0.18.2
watchdog 2.1.6
wcwidth 0.2.5
webencodings 0.5.1
wheel 0.37.0
widgetsnbextension 3.5.2
zict 2.0.0
zipp 3.6.0 |
test_serialize_future is consistently failing |
Looks to me like In #3729 you made it so that unpickling a Client tries to use the contextvar through distributed/distributed/client.py Lines 377 to 380 in d41c82b
Instead, if no contextvar is set, we'd fall back through the convoluted logic of The issue is that distributed/distributed/worker.py Lines 3726 to 3732 in d41c82b
distributed/distributed/worker.py Lines 3667 to 3677 in d41c82b
If we are to believe the docstring that the contract of BUT right now we don't do that. Instead, Now before this PR, calling distributed/distributed/worker.py Lines 3581 to 3586 in d41c82b
But with this PR, What I think is broken is that In general getting a correct client instance is a mess and there are too many ways to do it, too many places where it's defined, and not a clear enough definition of the hierarchy between these systems. I propose that Ideally, there'd be only one way to set a global client: a I'd much prefer that over this PR. This PR is just one small fix but leaves the overall system broken. |
Progress towards dask#5485. This was surprisingly easy and actually seems to work. Not sure yet what it breaks. I was pleasantly surprised to find that Tornado `loop.add_callback`s, `PeriodicCallback`s, etc. all play correctly with asyncio's [built-in contextvar management](https://www.python.org/dev/peps/pep-0567/#asyncio). With that, just setting the contextvar during `__init__` and `start` probably catches almost all cases, because all the long-running callbacks/coroutines (including comms) will inherit the context that's set when they're created. Where else should we add this `as_current_worker` decorator? This gives me confidence we'll be able to use the same pattern for a single current client contextvar as mentioned in dask#5467 (comment).
Fixes #5466, #3827
get_client
was calling the privateWorker._get_client
method when it ran within a task._get_client
should really have been called_make_client
, since it created a new client every time. The simplest correct thing to do instead would have been to use theWorker.client
property, which caches this instance.In order to pass the
timeout
parameter through though, I changedWorker._get_client
to actually match its docstring and always return the same instance.cc @crusaderky—I tested this on your reproducer from #3827 and it seems to fix it (helps to bump the
threads_per_worker
up to 16 or something to encourage race conditions). Since dask'sget_scheduler
usesget_client
internally, I think this was actually the problem.get_client
returns different Clients in different worker threads #5466, Race condition in default client on multithreaded workers #3827pre-commit run --all-files