Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Await on handle_stream raises missing delete_data await warning #3920

Closed
pentschev opened this issue Jun 22, 2020 · 6 comments · Fixed by #3922
Closed

Await on handle_stream raises missing delete_data await warning #3920

pentschev opened this issue Jun 22, 2020 · 6 comments · Fixed by #3922

Comments

@pentschev
Copy link
Member

For increased visibility, I'm reposting https://github.com/dask/distributed/pull/3847/files#r443766556 as an issue here:

We have a few tests in dask-cuda that check the behavior of Device<->Host<->Disk spilling and I noticed after the 2.19 release one of them has broken, I managed to track it down to one specific line of code in

await gen.sleep(0)
, introduced by #3847. The test in question happens in https://github.com/rapidsai/dask-cuda/blob/branch-0.15/dask_cuda/tests/test_spill.py#L409-L411, where we assert that the zict dictionaries are empty after deleting cdf2, which is the object being spilled. It seems that this is because we're not awaiting for Worker.delete_data somewhere, as per the warning below that doesn't happen if I comment await gen.sleep(0) out:

dask_cuda/tests/test_spill.py::test_cudf_device_spill[params0]
  /datasets/pentschev/miniconda3/envs/r-102-0.14/lib/python3.7/inspect.py:732: RuntimeWarning: coroutine 'Worker.delete_data' was never awaited
    for modname, module in list(sys.modules.items()):

I think that the only place where Worker.delete_data would be called and should be awaited is in

def worker_send(self, worker, msg):
""" Send message to worker
This also handles connection failures by adding a callback to remove
the worker on the next cycle.
"""
try:
self.stream_comms[worker].send(msg)
except (CommClosedError, AttributeError):
self.loop.add_callback(self.remove_worker, address=worker)
, but I don't have anything better than my guess at this time because it's really hard for me to understand all the async black magic. I'm gonna continue trying to figure this out, but any suggestions on how to pinpoint that are appreciated!

@jakirkham
Copy link
Member

It's interesting as delete_data is an async method, but we don't really call it from the worker (only the scheduler). So it doesn't get awaited anywhere AFAIK. Compare this to update_data, which is a similar method, but is not actually made async. I wonder if delete_data should have the async part removed.

@jakirkham
Copy link
Member

cc @mrocklin @jrbourbeau (in case you have thoughts here 🙂)

@pentschev
Copy link
Member Author

Thanks @jakirkham for looking at that, I actually verified that applying your suggestion things work again:

diff --git a/distributed/worker.py b/distributed/worker.py
index 59cd285d..8bd45394 100644
--- a/distributed/worker.py
+++ b/distributed/worker.py
@@ -1341,7 +1341,7 @@ class Worker(ServerNode):
         info = {"nbytes": {k: sizeof(v) for k, v in data.items()}, "status": "OK"}
         return info

-    async def delete_data(self, comm=None, keys=None, report=True):
+    def delete_data(self, comm=None, keys=None, report=True):
         if keys:
             for key in list(keys):
                 self.log.append((key, "delete"))
@@ -1355,7 +1355,7 @@ class Worker(ServerNode):
             if report:
                 logger.debug("Reporting loss of keys to scheduler")
                 # TODO: this route seems to not exist?
-                await self.scheduler.remove_keys(
+                self.scheduler.remove_keys(
                     address=self.contact_address, keys=list(keys)
                 )
         return "OK"

Possibly the second part can be removed/has to be fixed, as the comment above it suggests. There's no remove_keys anywhere in this repository.

Happy to file a PR if this change is reasonable.

@jakirkham
Copy link
Member

Thanks for checking Peter! Let's see what others say 🙂

@jakirkham
Copy link
Member

cc @quasiben (for vis)

@pentschev
Copy link
Member Author

I was able to write a test where we can reproduce the issue independent of GPUs and dask-cuda, therefore I opened #3922 with the fix suggested by @jakirkham and a test for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants