-
-
Notifications
You must be signed in to change notification settings - Fork 720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster hangs with a few tasks in "processing" state but no cpu load on any workers #4724
Comments
Sorry for the late response. My first suggestion is that you upgrade dask and distributed. This is kind of a shot in the dark, but 2020.12.0 had some pretty big changes with the scheduler so it does seem possible that there would be issues. I will also ping @jrbourbeau on this in case he has encountered anything similar. |
I agree with @jsignell that it would be good to confirm whether or not the issue is still present when using the latest |
Thanks for the replies @jsignell and @jrbourbeau, and of course thanks for dask and dask-distributed to begin with! I have confirmed that I still get hanging behavior with 2021.04.0 I have also checked the logs and found something. It was difficult at first since the problem was mostly occurring for jobs with hundreds of workers and there were lots of logs to comb through without really knowing what to look for. But I ran a simpler job with only 1 worker with 4 cpu cores and 2 processes and it was really helpful:
About halfway down are two lines that say It seems that both workers were closed, losing intermediate results required for subsequent tasks. An attempt at a data transfer to facilitate one of those tasks then fails. Then the scheduler tries to restart the workers, which is successful - but now the cluster is hung. The dashboard still says the (lost) intermediates are either in the "released" or "memory" state and the dependent tasks are just waiting. The restarted workers show 2-4% cpu load indefinitely. So I'm not sure why the scheduler closes the workers? And I assumed gracefully meant intermediate results would either be stored elsewhere before closing (though in this case all workers were restarted) - or scheduled to be recomputed after workers restart. (I'm not being critical, I just want to understand how things work and I'm willing to help out with fixes if given some guidance - I've never looked at any of the distributed source; assuming it's not something in my own code of course - which is still possible). |
That is really helpful extra info @GFleishman! I am going to transfer this issue to https://github.com/dask/distributed where it'll get more eyes from the hard-core distributed people. I think |
We’ve seen very similar problems with map_overlap (running on Kubernetes). In the end we prepended and appended data to each partition with our own logic to circumvent it... |
Hi @rubenvdg - thanks for the comment. This does happen most often with |
Same here. In our case it also happened on |
I agree that it's difficult to reproduce. On several occasions I've convinced myself that the problem was solved only to find out on the next big run that it wasn't. I think it must have something to do with worker to worker communication of dependencies and/or task states, which can be disrupted for a number of reasons, and then potentially is not reset properly by the scheduler. That's all speculation, but the error logs and behavior so far point that way. On some lucky runs I think it's possible that the disrupting events just don't occur (e.g. might be dependent on network traffic). |
@fjetter I think this may be another instance of the issues you're working on. |
We've recently merged an important PR addressing a few error handling edge cases which caused unrecoverable deadlocks. Deadlock fix #4784 |
As @fjetter mentioned there have been several stability updates since this issue was originally opened. Closing for now, but @GFleishman let us know if you're still experiencing the same issue on the latest |
This problem is stochastic. It seems to occur more frequently when there is more sharing of data between workers.
map_overlap
calls seem particularly problematic.Cluster is set up using
dask-jobqueue.LSFCluster
anddask.distributed.Client
Workers are all allocated properly, bash scripts invoking LSF all seem fine. The task graph starts to execute, but then gets hung up and sits indefinitely in this type of state:
No workers show any cpu activity (2-4% for all workers).
env_extra
above makes sure all MKL, BLAS, and OpenMP environment variables are set to 2 threads per core (should be fine with hyper threading?).When I click on the red task on the left of the graph I see:
hung_cluster_last_task_left.pdf
When I click on the red task on the right of the graph (second to last column) I see:
hung_cluster_last_task.pdf
For the red task on the right, the two "workers with data" show:
I've let these hang for upwards of 30 minutes with no meaningful cpu activity on any workers before killing the cluster manually. I can't let it run any longer because I'm paying for cluster time so I don't know if it's just (intractably) slow or totally hung. Comparatively the entire rest of the task graph was executed in less than 180 seconds.
Any pointers as to what could be causing this or how to permanently avoid it would be really appreciated.
The text was updated successfully, but these errors were encountered: