-
-
Notifications
You must be signed in to change notification settings - Fork 720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restart worker on CommClosedError #4884
Comments
I am currently investigating a deadlock issue which might be connected to this. Can you reproduce this? See #4784 (the issue is long and grew in scope) |
Reproduction can be tricky. I see this sporadically on a GKE cluster. It'll take me a day or two to find time to test this usefully. |
That's what I thought so don't worry. I still wanted to ask in case this was my lucky day :) If you can afford this without slowing down your system too much, debug logs might indicate one or two useful things. Either way, I'm working on getting #4784 merged which fixed two deadlock situations. Before you dive into too sophisticated testing, I would suggest to wait for this to be released |
We've recently merged an important PR addressing a few error handling edge cases which caused unrecoverable deadlocks. Deadlock fix #4784 |
Just wanted to say that things seem to have gotten better! I upgraded to the latest version, and haven't seen this issue yet! Fingers crossed. Thank you! |
On a dask-gateway GKE cluster, I have a few workers that stop processing and the logs show some combination of CommClosedError, missing dependency warnings, and garbage collection. The scheduler seems happy with the worker, as the "Last seen" remains up to date.
I've attached an example log. This worker was still in "processing" for its current task, which I think is sub-second, after 30 minutes. I killed the worker and the graph backed up to redo the lost work, and eventually completed.
I have written dask-cluster-manager jobs which restart schedulers leaking memory. I see there is a
client.get_scheduler_logs()
, which could be parsed to detect this. Or is there some way to detect this and restart the worker (besides me searching the GKE dashboard and manually doing it)? Is there a setting that can be used to somehow mitigate this? Thelifetime.{duration,stagger,restart}
seem like a last resort.Logs
The text was updated successfully, but these errors were encountered: