Await on handle_stream raises missing delete_data await warning #3920

pentschev · 2020-06-22T21:59:16Z

For increased visibility, I'm reposting https://github.com/dask/distributed/pull/3847/files#r443766556 as an issue here:

We have a few tests in dask-cuda that check the behavior of Device<->Host<->Disk spilling and I noticed after the 2.19 release one of them has broken, I managed to track it down to one specific line of code in

distributed/distributed/core.py

Line 573 in 44b2358

await gen.sleep(0)

, introduced by #3847. The test in question happens in https://github.com/rapidsai/dask-cuda/blob/branch-0.15/dask_cuda/tests/test_spill.py#L409-L411, where we assert that the zict dictionaries are empty after deleting cdf2, which is the object being spilled. It seems that this is because we're not awaiting for Worker.delete_data somewhere, as per the warning below that doesn't happen if I comment await gen.sleep(0) out:

dask_cuda/tests/test_spill.py::test_cudf_device_spill[params0]
  /datasets/pentschev/miniconda3/envs/r-102-0.14/lib/python3.7/inspect.py:732: RuntimeWarning: coroutine 'Worker.delete_data' was never awaited
    for modname, module in list(sys.modules.items()):

I think that the only place where Worker.delete_data would be called and should be awaited is in

distributed/distributed/scheduler.py

Lines 2791 to 2800 in 4f878b4

    
               def worker_send(self, worker, msg): 
        
                   """ Send message to worker 
        
                   This also handles connection failures by adding a callback to remove 
        
                   the worker on the next cycle. 
        
                   """ 
        
                   try: 
        
                       self.stream_comms[worker].send(msg) 
        
                   except (CommClosedError, AttributeError): 
        
                       self.loop.add_callback(self.remove_worker, address=worker)

, but I don't have anything better than my guess at this time because it's really hard for me to understand all the async black magic. I'm gonna continue trying to figure this out, but any suggestions on how to pinpoint that are appreciated!

The text was updated successfully, but these errors were encountered:

jakirkham · 2020-06-22T22:21:57Z

It's interesting as delete_data is an async method, but we don't really call it from the worker (only the scheduler). So it doesn't get awaited anywhere AFAIK. Compare this to update_data, which is a similar method, but is not actually made async. I wonder if delete_data should have the async part removed.

jakirkham · 2020-06-22T22:22:31Z

cc @mrocklin @jrbourbeau (in case you have thoughts here 🙂)

pentschev · 2020-06-22T22:44:03Z

Thanks @jakirkham for looking at that, I actually verified that applying your suggestion things work again:

diff --git a/distributed/worker.py b/distributed/worker.py
index 59cd285d..8bd45394 100644
--- a/distributed/worker.py
+++ b/distributed/worker.py
@@ -1341,7 +1341,7 @@ class Worker(ServerNode):
         info = {"nbytes": {k: sizeof(v) for k, v in data.items()}, "status": "OK"}
         return info

-    async def delete_data(self, comm=None, keys=None, report=True):
+    def delete_data(self, comm=None, keys=None, report=True):
         if keys:
             for key in list(keys):
                 self.log.append((key, "delete"))
@@ -1355,7 +1355,7 @@ class Worker(ServerNode):
             if report:
                 logger.debug("Reporting loss of keys to scheduler")
                 # TODO: this route seems to not exist?
-                await self.scheduler.remove_keys(
+                self.scheduler.remove_keys(
                     address=self.contact_address, keys=list(keys)
                 )
         return "OK"

Possibly the second part can be removed/has to be fixed, as the comment above it suggests. There's no remove_keys anywhere in this repository.

Happy to file a PR if this change is reasonable.

jakirkham · 2020-06-22T22:51:47Z

Thanks for checking Peter! Let's see what others say 🙂

jakirkham · 2020-06-22T22:53:14Z

cc @quasiben (for vis)

pentschev · 2020-06-23T13:40:44Z

I was able to write a test where we can reproduce the issue independent of GPUs and dask-cuda, therefore I opened #3922 with the fix suggested by @jakirkham and a test for that.

pentschev mentioned this issue Jun 23, 2020

Make Worker.delete_data sync #3922

Merged

jakirkham mentioned this issue Jun 23, 2020

Skip 2nd serialization pass of DeviceSerialized rapidsai/dask-cuda#309

Merged

mrocklin closed this as completed in #3922 Jun 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Await on handle_stream raises missing delete_data await warning #3920

Await on handle_stream raises missing delete_data await warning #3920

pentschev commented Jun 22, 2020

jakirkham commented Jun 22, 2020

jakirkham commented Jun 22, 2020

pentschev commented Jun 22, 2020

jakirkham commented Jun 22, 2020

jakirkham commented Jun 22, 2020

pentschev commented Jun 23, 2020

Await on handle_stream raises missing delete_data await warning #3920

Await on handle_stream raises missing delete_data await warning #3920

Comments

pentschev commented Jun 22, 2020

jakirkham commented Jun 22, 2020

jakirkham commented Jun 22, 2020

pentschev commented Jun 22, 2020

jakirkham commented Jun 22, 2020

jakirkham commented Jun 22, 2020

pentschev commented Jun 23, 2020