Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dask worker pods and nodes not removed by autoscaler #408

Closed
scottyhq opened this issue Sep 20, 2019 · 17 comments
Closed

dask worker pods and nodes not removed by autoscaler #408

scottyhq opened this issue Sep 20, 2019 · 17 comments

Comments

@scottyhq
Copy link
Member

scottyhq commented Sep 20, 2019

dask-worker nodes were left running on the aws clusters even after user-notebook pods and nodes shut down. Here are some relevant kubectl outputs (dask worker node running for 9 days!)

kubectl get pods --all-namespaces -o wide

esip-prod dask-rsignell-usgs-0b2bf9f0-1kc48v 0/1 ContainerCreating 0 9d ip-192-168-38-94.us-west-2.compute.internal

kubectl describe pod dask-rsignell-usgs-0b2bf9f0-1kc48v -n esip-prod

State: Waiting
Reason: ContainerCreating

kubectl get nodes -o wide

ip-192-168-38-94.us-west-2.compute.internal Ready worker 9d v1.13.7-eks-c57ff8 192.168.38.94 34.223.66.115 Amazon Linux 2 4.14.128-112.105.amzn2.x86_64 docker://18.6.1

kubectl logs cluster-autoscaler-6575548656-74955 | grep dask

I0904 23:59:46.094549 1 cluster.go:104] Fast evaluation: node ip-192-168-38-94.us-west-2.compute.internal cannot be removed: esip-prod/dask-rsignell-usgs-0b2bf9f0-1kc48v is not replicated

and finally,

import dask
import dask.config
dask.config.config
{'distributed': {'logging': {'bokeh': 'critical'},
  'dashboard': {'link': '/user/{JUPYTERHUB_USER}/proxy/{port}/status'},
  'admin': {'tick': {'limit': '5s'}}},
 'kubernetes': {'name': 'dask-{JUPYTERHUB_USER}-{uuid}',
  'worker-template': {'spec': {'restartPolicy': 'Never',
    'containers': [{'name': 'dask-${JUPYTERHUB_USER}',
      'image': '${JUPYTER_IMAGE_SPEC}',
      'args': ['dask-worker',
       '--nthreads',
       '2',
       '--no-dashboard',
       '--memory-limit',
       '7GB',
       '--death-timeout',
       '60'],
      'resources': {'limits': {'cpu': '1.75', 'memory': '7G'},
       'requests': {'cpu': 1, 'memory': '7G'}}}]}}},
 'labextension': {'factory': {'module': 'dask_kubernetes',
   'class': 'KubeCluster',
   'args': [],
   'kwargs': {}}},
 'temporary-directory': None,
 'array': {'svg': {'size': 120}}}

Originally posted by @scottyhq in pangeo-data/pangeo#712 (comment)

@scottyhq
Copy link
Member Author

scottyhq commented Sep 20, 2019

moving comment from other issue tracker:

Okay, I can reproduce the dask pods and nodes in limbo by logging onto esip.pangeo.io and launching a dask cluster. if no worker nodes are available it seems to be taking too long to scale up and there are network errors (scroll to bottom):

kubectl describe pod dask-scottyhq-527e3b25-ag48fq -n esip-prod

Name:         dask-scottyhq-527e3b25-ag48fq
Namespace:    esip-prod
Priority:     0
Node:         ip-192-168-78-224.us-west-2.compute.internal/192.168.78.224
Start Time:   Wed, 04 Sep 2019 19:55:16 -0700
Labels:       app=dask
              component=dask-worker
              dask.org/cluster-name=dask-scottyhq-527e3b25-a
              user=jovyan
Annotations:  kubernetes.io/psp: eks.privileged
Status:       Pending
IP:           
Containers:
  dask-scottyhq:
    Container ID:  
    Image:         783380859522.dkr.ecr.us-west-2.amazonaws.com/pangeo-esip:e6ec0b9
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Args:
      dask-worker
      --nthreads
      2
      --no-dashboard
      --memory-limit
      7GB
      --death-timeout
      60
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     1750m
      memory:  7G
    Requests:
      cpu:     1
      memory:  7G
    Environment:
      DASK_SCHEDULER_ADDRESS:  tcp://192.168.4.168:35189
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-l8cgz (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  default-token-l8cgz:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-l8cgz
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     k8s.dask.org/dedicated=worker:NoSchedule
                 k8s.dask.org_dedicated=worker:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                  Age                    From                                                   Message
  ----     ------                  ----                   ----                                                   -------
  Warning  FailedScheduling        7m42s (x2 over 7m42s)  default-scheduler                                      0/2 nodes are available: 1 Insufficient cpu, 2 Insufficient memory.
  Normal   TriggeredScaleUp        7m32s                  cluster-autoscaler                                     pod triggered scale-up: [{eksctl-pangeo-esip-nodegroup-dask-worker-v2-NodeGroup-UKEXL45VXVA7 0->1 (max: 100)}]
  Warning  FailedScheduling        6m12s (x3 over 6m14s)  default-scheduler                                      0/3 nodes are available: 1 Insufficient cpu, 1 node(s) had taints that the pod didn't tolerate, 2 Insufficient memory.
  Normal   Scheduled               6m4s                   default-scheduler                                      Successfully assigned esip-prod/dask-scottyhq-527e3b25-ag48fq to ip-192-168-78-224.us-west-2.compute.internal
  Warning  FailedCreatePodSandBox  6m2s                   kubelet, ip-192-168-78-224.us-west-2.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "77c187a18ff7974198302f452bfb89cf126689939ff7fe72bb4598c55ea4d360" network for pod "dask-scottyhq-527e3b25-ag48fq": NetworkPlugin cni failed to set up pod "dask-scottyhq-527e3b25-ag48fq_esip-prod" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "77c187a18ff7974198302f452bfb89cf126689939ff7fe72bb4598c55ea4d360" network for pod "dask-scottyhq-527e3b25-ag48fq": NetworkPlugin cni failed to teardown pod "dask-scottyhq-527e3b25-ag48fq_esip-prod" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]

then it becomes a kubernetes autoscaler problem (see my first comment) where "the node cannot be removed because the pod is not replicated". This seems relevant kubernetes/autoscaler#351

@scottyhq
Copy link
Member Author

scottyhq commented Sep 20, 2019

Looked into this a bit more today since it came up again, and the issue seems to be related to aws-cni issues that should be fixed by upgrading versions on nodegroups (kubernetes >1.13.8): aws/amazon-vpc-cni-k8s#282 (comment)

Confirmed re-creating the nodegroups installed version 1.13.10 and we no longer have lingering dask nodes:
kubectl get node

NAME                                           STATUS                     ROLES    AGE     VERSION
ip-192-168-1-96.us-west-2.compute.internal     Ready                      user     11m     v1.13.10-eks-d6460e
ip-192-168-10-159.us-west-2.compute.internal   Ready                      worker   3m55s   v1.13.10-eks-d6460e

cc @rsignell-usgs

@scottyhq scottyhq changed the title dask workers stuck in STATUS 'ContainerCreating' on AWS Hubs dask worker pods and nodes not removed by autoscaler Nov 16, 2019
@scottyhq
Copy link
Member Author

scottyhq commented Nov 16, 2019

This has come up again where we have 5 worker nodes continuing to run 24 hours after a user has logged out:

ip-X.us-west-2.compute.internal    Ready    worker   27h   v1.14.7-eks-1861c5
ip-X.us-west-2.compute.internal    Ready    worker   26h   v1.14.7-eks-1861c5
ip-X.us-west-2.compute.internal   Ready    core     17d   v1.14.7-eks-1861c5
ip-X.us-west-2.compute.internal    Ready    worker   26h   v1.14.7-eks-1861c5
ip-X.us-west-2.compute.internal   Ready    user     15m   v1.14.7-eks-1861c5
ip-X.us-west-2.compute.internal   Ready    worker   26h   v1.14.7-eks-1861c5
ip-X.us-west-2.compute.internal    Ready    worker   26h   v1.14.7-eks-1861c5

pods:

NAMESPACE         NAME                                  READY   STATUS    RESTARTS   AGE
icesat2-prod      autohttps-59654c74c4-mnf9k            2/2     Running   0          14d
icesat2-prod      dask-rsignell-usgs-64057291-e7kzhs    1/1     Running   0          26h
icesat2-prod      dask-rsignell-usgs-64057291-e7qc7j    1/1     Running   0          26h
icesat2-prod      dask-rsignell-usgs-64057291-ehqqgt    1/1     Running   0          26h
icesat2-prod      dask-rsignell-usgs-64057291-ergjkp    1/1     Running   0          26h
icesat2-prod      dask-rsignell-usgs-64057291-evnqc7    1/1     Running   0          26h
icesat2-prod      dask-rsignell-usgs-64057291-exzv8p    1/1     Running   0          26h
icesat2-prod      dask-rsignell-usgs-64057291-ezv8lw    1/1     Running   0          26h
icesat2-prod      dask-rsignell-usgs-64057291-ezztnn    1/1     Running   0          26h
icesat2-prod      hub-dc99c9488-2czd8                   1/1     Running   0          14d
icesat2-prod      jupyter-robfatland                    1/1     Running   0          20m
icesat2-prod      proxy-8758bbc4c-g8lbj                 1/1     Running   0          14d
icesat2-prod      user-scheduler-9545b9788-fmgs8        1/1     Running   0          14d
icesat2-prod      user-scheduler-9545b9788-hp4kb        1/1     Running   0          14d

The autoscaler won't remove these nodes according to this log message: I1116 17:36:40.750318 1 cluster.go:90] Fast evaluation: ip-X.us-west-2.compute.internal for removal I1116 17:36:40.750331 1 cluster.go:104] Fast evaluation: node ip-X.us-west-2.compute.internal cannot be removed: icesat2-prod/dask-rsignell-usgs-64057291-ezv8lw is not replicated

I suspect this is a dask issue though, since these pods probably should not still be listed as Running, and looking at the logs of a dask pod I see errors such as:

tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7fe21e4b8748>>, <Task finished coro=<Worker.heartbeat() done, defined at /srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/worker.py:860> exception=OSError("Timed out trying to connect to 'tcp://192.168.183.87:38875' after 10 s: Timed out trying to connect to 'tcp://192.168.183.87:38875' after 10 s: connect() didn't finish in time")>)
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 221, in connect
    _raise(error)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 204, in _raise
    raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://192.168.183.87:38875' after 10 s: connect() didn't finish in time

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback
    ret = callback()
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
    future.result()
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/worker.py", line 869, in heartbeat
    metrics=await self.get_metrics(),
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/core.py", line 745, in send_recv_from_rpc
    comm = await self.pool.connect(self.addr)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/core.py", line 879, in connect
    connection_args=self.connection_args,
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 230, in connect
    _raise(error)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 204, in _raise
    raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://192.168.183.87:38875' after 10 s: Timed out trying to connect to 'tcp://192.168.183.87:38875' after 10 s: connect() didn't finish in time

Pinging @jacobtomlinson @TomAugspurger and @jhamman for help sorting this out!

workaround for now is to delete the pods and then the autoscaler removes the nodes a few minutes later: kubectl get pods -n icesat2-prod --no-headers=true | cut -d ' ' -f1 | grep dask --color=never | xargs kubectl delete pod -n icesat2-prod

@TomAugspurger
Copy link
Member

I suspect this is a dask issue though, since these pods probably should not still be listed as Running

Do you think the Python process is just hanging there, and hasn't actually exited?

@jacobtomlinson
Copy link
Member

That would be my guess. The workers should time out, but if they've hung they may not be able to run the timeout code.

@scottyhq
Copy link
Member Author

scottyhq commented Nov 18, 2019

Thanks @TomAugspurger and @jacobtomlinson. Is there a bit of logic to add (distributed? dask-kubernetes?) such that many occurrences of OSError: Timed out trying to connect to 'tcp://192.168.183.87:38875' after 10 s: Timed out trying to connect to 'tcp://192.168.183.87:38875' after 10 s: connect() didn't finish in time kills the process / transitions the pod from Running to Error or Completed ?

@TomAugspurger
Copy link
Member

TomAugspurger commented Nov 18, 2019

Unfortunately, I'm not sure. I don't know, but it's possible that the event loop is blocked, and so the thread that should kill the process isn't able to run? Again, just speculation. If you have any context on what was going on when the worker timed out trying to reach the scheduler that'd be great, but I'm guessing you don't have access to that info.

cc @mrocklin if you have any guesses.

A potential solution is to include --lifetime="1 hour" in the dask-worker command used. This would be a bit of a hammer though. All workers will be shut down after an hour, regardless of whether they're been active recently, and are replaced with a fresh worker. If the scheduler really has gone away, presumably this worker would fail to connect and the pod would be cleaned up appropriately by kubernetes. See https://docs.dask.org/en/latest/setup/cli.html#cmdoption-dask-worker-lifetime

@tjcrone
Copy link
Contributor

tjcrone commented Nov 27, 2019

I am having a similar problem on the Azure Pangeo. I noticed that many worker pods were hanging around a long time, after users logged off. To test this, I logged in and created a Dask cluster, then logged off immediately. The worker pods have been up all night:

(base) tjc@lulu:~$ kubectl get pods -o wide --all-namespaces --sort-by="{.spec.nodeName}" | grep ooi-prod | grep dask-tj
ooi-prod      dask-tjcrone-c82d1516-0smzlk           1/1     Running   0          8h      10.0.130.51    aks-nodepool1-41768535-vmss00001h   <none>           <none>
ooi-prod      dask-tjcrone-c82d1516-0z8hp8           1/1     Running   0          8h      10.0.128.75    aks-nodepool1-41768535-vmss000001   <none>           <none>

The pods are throwing errors when trying to connect to the scheduler, but I think this is what they should be doing:

distributed.nanny - INFO -         Start Nanny at: 'tcp://10.0.128.75:36773'
distributed.diskutils - INFO - Found stale lock file and directory '/home/jovyan/worker-plklaxl4', purging
distributed.worker - INFO -       Start worker at:    tcp://10.0.128.75:45553
distributed.worker - INFO -          Listening to:    tcp://10.0.128.75:45553
distributed.worker - INFO - Waiting to connect to:   tcp://10.0.128.211:39647
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                    1.25 GB
distributed.worker - INFO -       Local Directory: /home/jovyan/worker-4vez2fnb
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:   tcp://10.0.128.211:39647
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO - Connection to scheduler broken.  Reconnecting...
distributed.worker - WARNING - Heartbeat to scheduler failed
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7fa7247260f0>>, <Task finished coro=<Worker.heartbeat() done, defined at /srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/worker.py:871> exception=OSError("Timed out trying to connect to 'tcp://10.0.128.211:39647' after 10 s: Timed out trying to connect to 'tcp://10.0.128.211:39647' after 10 s: in <distributed.comm.tcp.TCPConnector object at 0x7fa71c494e48>: ConnectionRefusedError: [Errno 111] Connection refused")>)
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 221, in connect
    _raise(error)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 204, in _raise
    raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://10.0.128.211:39647' after 10 s: in <distributed.comm.tcp.TCPConnector object at 0x7fa71c494e48>: ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback
    ret = callback()
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
    future.result()
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/worker.py", line 880, in heartbeat
    metrics=await self.get_metrics(),
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/core.py", line 735, in send_recv_from_rpc
    comm = await self.pool.connect(self.addr)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/core.py", line 869, in connect
    connection_args=self.connection_args,
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 230, in connect
    _raise(error)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/core.py", line 204, in _raise
    raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://10.0.128.211:39647' after 10 s: Timed out trying to connect to 'tcp://10.0.128.211:39647' after 10 s: in <distributed.comm.tcp.TCPConnector object at 0x7fa71c494e48>: ConnectionRefusedError: [Errno 111] Connection refused

Or maybe this isn't supposed to be happening? @mrocklin, any ideas here? Thank you!

@tjcrone
Copy link
Contributor

tjcrone commented Nov 27, 2019

@tjcrone
Copy link
Contributor

tjcrone commented Nov 27, 2019

Looks like this issue has been fixed here: dask/distributed#2880. Will need to go onto a dev version of distributed I think.

@TomAugspurger
Copy link
Member

Ideally dask/distributed#3250 fixed this, but if not LMK.

I see you've already deployed with distributed dev, but I believe that the fix was included in distributed 2.8.1 released on Nov 22.

@scottyhq
Copy link
Member Author

@tjcrone
Copy link
Contributor

tjcrone commented Nov 27, 2019

Ah! Great suggestion @scottyhq. Thank you. This is one of the updates I often forget to do when I move our image forward. I wonder if I can change it to LATEST or something like that so that I do not forget this in the future.

@tjcrone
Copy link
Contributor

tjcrone commented Nov 29, 2019

@TomAugspurger, it looks like dask/distributed#3250 did fix this issue! Thanks. One thing I noticed is that after a worker loses contact with the scheduler, the pod goes into a "Completed" state, rather than terminating as it would if the dask cluster was explicitly shut down. The pod in a completed state still seems to retain an IP address, but it is not clear if the kubernetes cluster will kick that pod out and scale down. Any thoughts on whether it would be better to explicitly delete worker pods when they lose contact with the scheduler?

@scottyhq
Copy link
Member Author

scottyhq commented Dec 2, 2019

Just want to link this discussion regarding @tjcrone's point. @jacobtomlinson has floated the idea of refactoring dask workers as Jobs instead of Pods
dask/dask-kubernetes#53 (comment)

Also, want to note that we are still faced with the situation of old dask versions in environments being run on binderhub that stick around:
NCAR/cesm-lens-aws#27

binder-staging   dask-ncar-cesm-lens-aws-36dzuhnd-31978533-08f9zd                1/1     Running   0          4d14h
binder-staging   dask-ncar-cesm-lens-aws-36dzuhnd-31978533-08tvtq                1/1     Running   0          4d14h
binder-staging   dask-ncar-cesm-lens-aws-36dzuhnd-31978533-0f9vrf                1/1     Running   0          4d14h
binder-staging   dask-ncar-cesm-lens-aws-36dzuhnd-31978533-0fcpdn                1/1     Running   0          4d14h
binder-staging   dask-ncar-cesm-lens-aws-36dzuhnd-31978533-0splxm                1/1     Running   0          4d14h

So I'm wondering how to enforce the behavior of removing long-running pods within the pangeo helm chart configuration (#477). Perhaps this is yet another reason to incorporate dask-gateway? Thoughts @jhamman ?

@jhamman
Copy link
Member

jhamman commented Dec 2, 2019

@scottyhq -A few things you may consider doing:

  1. Purge all old images in your container registry. This will force rebuilds, even for repos that don't have new changes.
  2. Help us get dask-gateway working on binder (WIP PR here: [WIP] add dask-gateway to binder chart pangeo-binder#87). There will be a interim period where we'll have to let both work and a PR campaign to educate people on the new way of creating and managing dask clusters.

@scottyhq
Copy link
Member Author

scottyhq commented May 5, 2020

closing since as of #577 we are now using dask-gateway to manage dask pods

@scottyhq scottyhq closed this as completed May 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants