-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dask worker pods and nodes not removed by autoscaler #408
Comments
moving comment from other issue tracker: Okay, I can reproduce the dask pods and nodes in limbo by logging onto esip.pangeo.io and launching a dask cluster. if no worker nodes are available it seems to be taking too long to scale up and there are network errors (scroll to bottom):
then it becomes a kubernetes autoscaler problem (see my first comment) where "the node cannot be removed because the pod is not replicated". This seems relevant kubernetes/autoscaler#351 |
Looked into this a bit more today since it came up again, and the issue seems to be related to aws-cni issues that should be fixed by upgrading versions on nodegroups (kubernetes >1.13.8): aws/amazon-vpc-cni-k8s#282 (comment) Confirmed re-creating the nodegroups installed version 1.13.10 and we no longer have lingering dask nodes:
|
This has come up again where we have 5 worker nodes continuing to run 24 hours after a user has logged out:
pods:
The autoscaler won't remove these nodes according to this log message: I suspect this is a dask issue though, since these pods probably should not still be listed as
Pinging @jacobtomlinson @TomAugspurger and @jhamman for help sorting this out! workaround for now is to delete the pods and then the autoscaler removes the nodes a few minutes later: |
Do you think the Python process is just hanging there, and hasn't actually exited? |
That would be my guess. The workers should time out, but if they've hung they may not be able to run the timeout code. |
Thanks @TomAugspurger and @jacobtomlinson. Is there a bit of logic to add (distributed? dask-kubernetes?) such that many occurrences of |
Unfortunately, I'm not sure. I don't know, but it's possible that the event loop is blocked, and so the thread that should kill the process isn't able to run? Again, just speculation. If you have any context on what was going on when the worker timed out trying to reach the scheduler that'd be great, but I'm guessing you don't have access to that info. cc @mrocklin if you have any guesses. A potential solution is to include |
I am having a similar problem on the Azure Pangeo. I noticed that many worker pods were hanging around a long time, after users logged off. To test this, I logged in and created a Dask cluster, then logged off immediately. The worker pods have been up all night:
The pods are throwing errors when trying to connect to the scheduler, but I think this is what they should be doing:
Or maybe this isn't supposed to be happening? @mrocklin, any ideas here? Thank you! |
The OOI configuration is here: https://github.com/pangeo-data/pangeo-cloud-federation/tree/staging/deployments/ooi |
Looks like this issue has been fixed here: dask/distributed#2880. Will need to go onto a dev version of distributed I think. |
Ideally dask/distributed#3250 fixed this, but if not LMK. I see you've already deployed with distributed dev, but I believe that the fix was included in distributed 2.8.1 released on Nov 22. |
@tjcrone I'd recommend changing https://github.com/pangeo-data/pangeo-cloud-federation/blob/staging/deployments/ooi/image/binder/Dockerfile#L2 to the 2019.11.25 tag, which includes dask and distributed 2.8.1 (https://github.com/pangeo-data/pangeo-stacks). |
Ah! Great suggestion @scottyhq. Thank you. This is one of the updates I often forget to do when I move our image forward. I wonder if I can change it to LATEST or something like that so that I do not forget this in the future. |
@TomAugspurger, it looks like dask/distributed#3250 did fix this issue! Thanks. One thing I noticed is that after a worker loses contact with the scheduler, the pod goes into a "Completed" state, rather than terminating as it would if the dask cluster was explicitly shut down. The pod in a completed state still seems to retain an IP address, but it is not clear if the kubernetes cluster will kick that pod out and scale down. Any thoughts on whether it would be better to explicitly delete worker pods when they lose contact with the scheduler? |
Just want to link this discussion regarding @tjcrone's point. @jacobtomlinson has floated the idea of refactoring dask workers as Also, want to note that we are still faced with the situation of old dask versions in environments being run on binderhub that stick around:
So I'm wondering how to enforce the behavior of removing long-running pods within the pangeo helm chart configuration (#477). Perhaps this is yet another reason to incorporate dask-gateway? Thoughts @jhamman ? |
@scottyhq -A few things you may consider doing:
|
closing since as of #577 we are now using dask-gateway to manage dask pods |
dask-worker nodes were left running on the aws clusters even after user-notebook pods and nodes shut down. Here are some relevant kubectl outputs (dask worker node running for 9 days!)
kubectl get pods --all-namespaces -o wide
kubectl describe pod dask-rsignell-usgs-0b2bf9f0-1kc48v -n esip-prod
kubectl get nodes -o wide
kubectl logs cluster-autoscaler-6575548656-74955 | grep dask
and finally,
Originally posted by @scottyhq in pangeo-data/pangeo#712 (comment)
The text was updated successfully, but these errors were encountered: