Pod Evictions (memory, node shutodowns) - requires manual hub restart! #223

consideRatio · 2018-08-01T10:53:13Z

About

We have an issue when pods are removed unexpectedly that seems to require a hub restart. This could happen if a node fails, is preempted, or an admin deletes a user pod I think. This is really bad I'd say as it requires the specific user that has this problem to contact and administrator and then the administrator has to restart the hub to solve it.

I deem this vital to fix as preemtible nodes (google) and spot nodes (amazon) are an awesome way to reduce cost, but usage of them risks causing this kind of huge trouble right now!

I'm still lacking some insight in whats going on in the proxy and the user management within JupyterHub though so I'm hoping someone can pick this issue up from my writeup. @minrk or @yuvipanda perhaps you know someone to /cc or could give me a pointer on where to look to solve this?

Experience summary

Pod dies
"503 - Service Unavailable"
hub/admin stop pod
"Your server is stopping"
Distracting consequence
Issue still not resolved
Hub restart
Everything is fine

Experience log

1. A preemptible node was reclaimed and my user pod was lost

The user pod name was erik-2esundell
The user name was erik.sundell

2. Later - I visit the hub

3. Directly after - I visit `hub/admin` and press the stop server button

The button turns blue and is labeled start server again

The hub logs sais

[I 2018-08-01 09:36:05.670 JupyterHub proxy:264] Removing user erik.sundell from proxy (/user/erik.sundell/)                                                                                                                 
[I 2018-08-01 09:36:05.675 JupyterHub spawner:1644] Deleting pod jupyter-erik-2esundell                                                                                                                                      
[W 2018-08-01 09:36:05.688 JupyterHub spawner:1657] No pod jupyter-erik-2esundell to delete. Assuming already deleted.                                                                                                       
[W 2018-08-01 09:36:15.671 JupyterHub base:751] User erik.sundell: server is slow to stop
[I 2018-08-01 09:36:15.672 JupyterHub log:158] 202 DELETE /hub/api/users/erik.sundell/server (erik.sundell@10.20.0.1) 10013.08ms

The proxy chp logs sais

09:36:05.673 - info: [ConfigProxy] Removing route /user/erik.sundell
09:36:05.674 - info: [ConfigProxy] 204 DELETE /api/routes/user/erik.sundell

If I would press start server, this would show

4. I revisit my singleuser server

What I see
What the hub log sais after a while

[I 2018-08-01 09:36:50.337 JupyterHub base:978] erik.sundell is pending stop                                                                                                                                                 
[I 2018-08-01 09:36:52.736 JupyterHub base:978] erik.sundell is pending stop                                                                                                                                                 
[I 2018-08-01 09:36:57.838 JupyterHub base:978] erik.sundell is pending stop                                                                                                                                                 
[I 2018-08-01 09:36:58.940 JupyterHub base:978] erik.sundell is pending stop                                                                                                                                                 
[I 2018-08-01 09:37:04.053 JupyterHub base:978] erik.sundell is pending stop                                                                                                                                                 
[I 2018-08-01 09:37:09.154 JupyterHub base:978] erik.sundell is pending stop                                                                                                                                                 
[I 2018-08-01 09:37:14.235 JupyterHub base:978] erik.sundell is pending stop                                                                                                                                                 
[I 2018-08-01 09:37:19.374 JupyterHub base:978] erik.sundell is pending stop                                                                                                                                                 
[I 2018-08-01 09:37:24.450 JupyterHub base:978] erik.sundell is pending stop                                                                                                                                                 
[I 2018-08-01 09:37:29.546 JupyterHub base:978] erik.sundell is pending stop                                                                                                                                                 
[I 2018-08-01 09:37:34.699 JupyterHub base:978] erik.sundell is pending stop                                                                                                                                                 
[I 2018-08-01 09:37:36.529 JupyterHub proxy:301] Checking routes                                                                                                                                                             
[I 2018-08-01 09:37:39.774 JupyterHub base:978] erik.sundell is pending stop                                                                                                                                                 
[I 2018-08-01 09:37:44.877 JupyterHub base:978] erik.sundell is pending stop                                                                                                                                                 
[I 2018-08-01 09:37:49.953 JupyterHub base:978] erik.sundell is pending stop                                                                                                                                                 
[I 2018-08-01 09:37:55.064 JupyterHub base:978] erik.sundell is pending stop                                                                                                                                                 
[I 2018-08-01 09:38:00.139 JupyterHub base:978] erik.sundell is pending stop                                                                                                                                                 
[I 2018-08-01 09:38:05.262 JupyterHub base:978] erik.sundell is pending stop                                                                                                                                                 
[I 2018-08-01 09:38:10.430 JupyterHub base:978] erik.sundell is pending stop                                                                                                                                                 
[I 2018-08-01 09:38:15.512 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:38:20.614 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:38:25.705 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:38:28.067 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:38:29.443 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:38:34.546 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:38:36.529 JupyterHub proxy:301] Checking routes
[I 2018-08-01 09:38:39.654 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:38:44.747 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:38:49.873 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:38:54.987 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:39:00.088 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:39:05.170 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:39:10.253 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:39:15.341 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:39:20.452 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:39:25.532 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:39:30.642 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:39:35.720 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:39:36.529 JupyterHub proxy:301] Checking routes
[I 2018-08-01 09:39:40.825 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:39:45.924 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:39:51.031 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:39:56.143 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:40:01.262 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:40:06.360 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:40:11.459 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:40:14.980 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:40:17.269 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:40:22.437 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:40:27.544 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:40:32.660 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:40:36.529 JupyterHub proxy:301] Checking routes
[I 2018-08-01 09:40:37.750 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:40:42.819 JupyterHub base:978] erik.sundell is pending stop
[I 2018-08-01 09:40:47.880 JupyterHub base:978] erik.sundell is pending stop

[E 2018-08-01 09:40:50.170 JupyterHub gen:974] Exception in Future <Task finished coro=<BaseHandler.stop_single_user.<locals>.stop() done, defined at /usr/local/lib/python3.6/dist-packages/jupyterhub/handlers/base.py:730> exception=TimeoutError('pod/jupyter-erik-2esundell did not disappear in 300 seconds!',)> after timeout
    Traceback (most recent call last):
      File "/usr/local/lib/python3.6/dist-packages/tornado/gen.py", line 970, in error_callback
        future.result()
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/handlers/base.py", line 740, in stop
        await user.stop(name)
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/user.py", line 548, in stop
        await spawner.stop()
      File "/usr/local/lib/python3.6/dist-packages/kubespawner/spawner.py", line 1664, in stop
        timeout=self.start_timeout
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/utils.py", line 155, in exponential_backoff
        raise TimeoutError(fail_message)
    TimeoutError: pod/jupyter-erik-2esundell did not disappear in 300 seconds!

5. I look again

This is now what I see
I think it is because excessive refreshing of the pending stop page, once for every log message entry in the written point above.

6. I refresh

This is what I see
This is what the proxy chp logs say - an route was added, but it is unreachable

09:41:36.531 - info: [ConfigProxy] Adding route /user/erik.sundell -> http://10.20.1.4:8888
09:41:36.532 - info: [ConfigProxy] 201 POST /api/routes/user/erik.sundell 
09:42:36.528 - info: [ConfigProxy] 200 GET /api/routes 
09:43:36.528 - info: [ConfigProxy] 200 GET /api/routes 
09:44:36.529 - info: [ConfigProxy] 200 GET /api/routes 
09:45:36.529 - info: [ConfigProxy] 200 GET /api/routes 
09:46:36.529 - info: [ConfigProxy] 200 GET /api/routes 
09:47:36.529 - info: [ConfigProxy] 200 GET /api/routes 
09:48:19.959 - error: [ConfigProxy] 503 GET /user/erik.sundell/ Error: connect EHOSTUNREACH 10.20.1.4:8888
    at Object.exports._errnoException (util.js:1020:11)
    at exports._exceptionWithHostPort (util.js:1043:20)
    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1086:14)

7. I restart the hub

The hub logs after restart

[I 2018-08-01 10:17:10.079 JupyterHub app:1667] Using Authenticator: builtins.GenericAuthenticator
[I 2018-08-01 10:17:10.079 JupyterHub app:1667] Using Spawner: kubespawner.spawner.KubeSpawner
[I 2018-08-01 10:17:10.080 JupyterHub app:1010] Loading cookie_secret from env[JPY_COOKIE_SECRET]
[W 2018-08-01 10:17:10.155 JupyterHub app:1129] JupyterHub.hub_connect_port is deprecated as of 0.9. Use JupyterHub.hub_connect_url to fully specify the URL for connecting to the Hub.
[I 2018-08-01 10:17:10.184 JupyterHub app:1199] Not using whitelist. Any authenticated user will be allowed.
[I 2018-08-01 10:17:10.237 JupyterHub reflector:176] watching for pods with label selector component=singleuser-server / field selector  in namespace jh
[W 2018-08-01 10:17:10.263 JupyterHub app:1460] erik.sundell appears to have stopped while the Hub was down
[W 2018-08-01 10:17:10.317 JupyterHub app:1513] Deleting OAuth client jupyterhub-user-erik-sundell
[I 2018-08-01 10:17:10.329 JupyterHub app:1849] Hub API listening on http://0.0.0.0:8081/hub/
[I 2018-08-01 10:17:10.329 JupyterHub app:1851] Private Hub API connect url http://10.0.8.147:8081/hub/
[I 2018-08-01 10:17:10.330 JupyterHub app:1864] Not starting proxy
[I 2018-08-01 10:17:10.334 JupyterHub proxy:301] Checking routes
[W 2018-08-01 10:17:10.335 JupyterHub proxy:363] Deleting stale route /user/erik.sundell/
[I 2018-08-01 10:17:10.337 JupyterHub app:1906] JupyterHub is now running at http://10.0.2.156:80/
[I 2018-08-01 10:18:10.341 JupyterHub proxy:301] Checking routes

8. I login again and everything works

The hub logs after restart and login

[I 2018-08-01 10:21:49.143 JupyterHub base:499] User logged in: erik.sundell
[I 2018-08-01 10:21:49.145 JupyterHub log:158] 302 GET /hub/oauth_callback?code=[secret]&state=[secret]&session_state=[secret] -> /user/erik.sundell/ (@10.20.0.1) 630.45ms
[I 2018-08-01 10:21:49.185 JupyterHub log:158] 302 GET /user/erik.sundell/ -> /hub/user/erik.sundell/ (@10.20.0.1) 1.56ms
[I 2018-08-01 10:21:49.322 JupyterHub spawner:1550] PVC claim-erik-2esundell already exists, so did not create new pvc.
[W 2018-08-01 10:21:59.231 JupyterHub base:679] User erik.sundell is slow to start (timeout=10)
[I 2018-08-01 10:21:59.231 JupyterHub base:1016] erik.sundell is pending spawn
[I 2018-08-01 10:21:59.238 JupyterHub log:158] 200 GET /hub/user/erik.sundell/ (erik.sundell@10.20.0.1) 10015.25ms
[I 2018-08-01 10:22:00.703 JupyterHub log:158] 200 GET /hub/api (@10.20.1.4) 0.75ms
[I 2018-08-01 10:22:03.854 JupyterHub base:628] User erik.sundell took 14.623 seconds to start
[I 2018-08-01 10:22:03.854 JupyterHub proxy:242] Adding user erik.sundell to proxy /user/erik.sundell/ => http://10.20.1.4:8888
[I 2018-08-01 10:22:03.857 JupyterHub users:510] Server erik.sundell is ready

The text was updated successfully, but these errors were encountered:

jgerardsimcock · 2018-08-02T18:43:08Z

I can confirm that we experience this frequently and my solution is always to restart the hub. We are on GKE.

consideRatio · 2018-08-09T17:46:39Z

@minrk @betatim @choldgraf this looks suspicious to me

1. Only DELETED state

We don't look for EVICTED state for example. I can look into exactly what states you can be in and what states will overlap, because they will. For example, you will be considered Pending even though you are ContainerCreating in the Status I recall. Perhaps we can find one state, including DELETED and EVICTED at the same time?

kubespawner/kubespawner/reflector.py

Lines 217 to 219 in 472a662

    
           if ev['type'] == 'DELETED': 
        
               # This is an atomic delete operation on the dictionary! 
        
               self.resources.pop(resource.metadata.name, None)

2. Abort in finally

A finally statement has a stop and a break within it even though the except statements have continue etc... Hmmm... UPDATE: oops okay hmmm perhaps not an issue, I jumped the gun.

kubespawner/kubespawner/reflector.py

Lines 244 to 248 in 472a662

    
           finally: 
        
               w.stop() 
        
               if self._stop_event.is_set(): 
        
                   self.log.info("%s watcher stopped", self.kind) 
        
                   break

For reference, this is what happens if you have an try / except / else issue...

Try -> Except -> Finally

Try -> Else -> Finally

consideRatio · 2018-08-25T19:08:19Z

@minrk helped me narrow the space where this needs to be fixed:

spawner.py@poll, or the function called from that including spawner.py@stop.

kakal · 2019-03-15T12:33:15Z

Hi, @consideRatio
Is there any progress on this issue?
I have the same problem that have to restart the hub menully.
Would you please give me some suggestion for this defect?

consideRatio · 2019-03-15T18:23:11Z

Hi @kakal! I'm happy to see some activity for this issue as I think it is very important. I have not done any additional work to resolve this since. Hoping to get back to it.

consideRatio · 2019-10-09T08:33:21Z

async launch_instance_async
- await: async initialize
  - semi await: async init_spawners
    - [...]: async check_spawner

From check_spawner, kubespawner's poll is invoked, and after it returns check_spawner will decide to kill the evicted pod.

kubespawner/kubespawner/spawner.py

Lines 1458 to 1494 in 0137336

    
               def poll(self): 
        
                   """ 
        
                   Check if the pod is still running. 
        
                   Uses the same interface as subprocess.Popen.poll(): if the pod is 
        
                   still running, returns None.  If the pod has exited, return the 
        
                   exit code if we can determine it, or 1 if it has exited but we 
        
                   don't know how.  These are the return values JupyterHub expects. 
        
                   Note that a clean exit will have an exit code of zero, so it is 
        
                   necessary to check that the returned value is None, rather than 
        
                   just Falsy, to determine that the pod is still running. 
        
                   """ 
        
                   # have to wait for first load of data before we have a valid answer 
        
                   if not self.pod_reflector.first_load_future.done(): 
        
                       yield self.pod_reflector.first_load_future 
        
                   data = self.pod_reflector.pods.get(self.pod_name, None) 
        
                   if data is not None: 
        
                       if data.status.phase == 'Pending': 
        
                           return None 
        
                       ctr_stat = data.status.container_statuses 
        
                       if ctr_stat is None:  # No status, no container (we hope) 
        
                           # This seems to happen when a pod is idle-culled. 
        
                           return 1 
        
                       for c in ctr_stat: 
        
                           # return exit code if notebook container has terminated 
        
                           if c.name == 'notebook': 
        
                               if c.state.terminated: 
        
                                   # call self.stop to delete the pod 
        
                                   if self.delete_stopped_pods: 
        
                                       yield self.stop(now=True) 
        
                                   return c.state.terminated.exit_code 
        
                               break 
        
                       # None means pod is running or starting up 
        
                       return None 
        
                   # pod doesn't exist or has been deleted 
        
                   return 1

What needs to happen, is that kubespawner or the hub realize that pods become evicted.

Questions raised

Is there a mechanism within kubespawner, or jupyterhub, that repeatedly probes if the spawned servers are running? Yes, JupyterHub only, it uses the spawner's poll method every 30 seconds according to the default poll_interval.
Is there a mechanism within kubespawner that would realize if the spawned servers shuts down? If so, does this catch evictions as well? Maybe...
- A poll will shut down the pod with self.stop(now=True) if it finds a pod with a notebook container in a terminated state. But, it could also simply return 1 to indicate that it has an exit code of 1 (as if it were a system process).
- Followup question: would reporting 1 back from poll when the pod has been evicted lead to the observed behavior?
Is the kubespawner's poll stopping the server, or is it jupyterhub, or both async?
Would a evicted pod show up in the list of self.pod_reflector.pods?
Is poll called in conjunction to a spawn somehow?

consideRatio · 2019-10-09T14:05:15Z

The pod reflector - code base investigation

The pod reflector will watch and update itself, and it will remove things from its list if it has

kubernetes python client API used to get pod information
the values "type" can hold in a response or here as well - ["ADDED", "MODIFIED", "DELETED" "BOOKMARK", "ERROR"]
- We should perhaps treat ERROR events specifically.

consideRatio · 2019-10-09T14:05:42Z

A Python 3 memory bomb for debugging

No... This failed. The process was instead killed. Hmmm... I ran into the "out of memory killer" or "OOM killer".

#!/usr/bin/env python
  
import sys
import time

if len(sys.argv) != 2:
    print("usage: fillmem <number-of-megabytes>")
    sys.exit()

count = int(sys.argv[1])

megabyte = (0,) * (1024 * 128)

data = megabyte * count

while True:
    time.sleep(1)

rkdarst · 2019-10-09T16:08:03Z

For reference, I'm quite sure I'm seeing this too. I don't have any extra help in debugging, other than the database showed pod_name in the state dict still when it wasn't running anymore. Next week I'll be able to investigate more, can't do more this week.

consideRatio · 2019-10-10T07:37:50Z

I think what happens is that we consider the pod terminated, but it remains in a Failed state.

kubernetes/kubernetes#54525 (comment)

In the poll function there is this section, note the check for ctr_status, I think that is missing for pods that has entered a Failed or Succeeded state where all containers have terminated according to specifications of these pod phases.

kubespawner/kubespawner/spawner.py

Lines 1476 to 1481 in 0137336

    
           if data.status.phase == 'Pending': 
        
               return None 
        
           ctr_stat = data.status.container_statuses 
        
           if ctr_stat is None:  # No status, no container (we hope) 
        
               # This seems to happen when a pod is idle-culled. 
        
               return 1

If a pod is evicted due to memory preassure etc, the pod will enter a Failed state, but the key difference from being culled into a Succeeded (?) state, is that the pod with the Failed state will remain and not be cleaned up quickly. It will have a pod phase reason being "Evicted" I think. So if we to surgically want to solve this, I think we should look for that specific scenario and decide what to do based on that. If we delete it we may loose some debugging information though, but it don't we will fail to spawn new pods as it blocks the pods name I think.

Perhaps we should look for this case, log stuff of relevance, and then delete it?

consideRatio · 2019-10-10T07:57:29Z

Below are three pods that will remain if stopped, and I can imagine could cause trouble by not being deleted and will remain listed in kubectl get pods for example, and also block creation of new user pods due to the new user pods having the same name as the old.

Observations

Any pod that has Success or Failed will probably remain for a while and block the pod name for future pods to be recreated.
the Failed Evicted pods lack a State of it's container, but instead has a pod level reason and message, as compared to having a container level reason.
This official documentation is the best reference on pod and container state etc.
There are subtle differences in the structure of kubectl describe and kubectl get -o yaml. For example, the describe will put the pods status: related to containers into the container information. So if there is missing state in a kubectl describe.

A pod quit - completes (pod phase: Success)

# Created by: kubectl run ubuntu-success --image=ubuntu --restart=Never -- /bin/bash -c 'sleep 10'
#
# Inspect by: kubectl describe ubuntu-success
#
Name:         ubuntu-success
Namespace:    jh-dev
Priority:     0
Node:         jh-dev-control-plane/172.19.0.1
Start Time:   Thu, 10 Oct 2019 09:50:27 +0200
Labels:       run=ubuntu-success
Annotations:  <none>
Status:       Succeeded
IP:           192.168.194.144
Containers:
  ubuntu-success:
    Container ID:  containerd://f2b1ae47a6ea4a848f9e433a4ff7eafdd991c7c15a2f15724fba461c4d044959
    Image:         ubuntu
    Image ID:      docker.io/library/ubuntu@sha256:b88f8848e9a1a4e4558ba7cfc4acc5879e1d0e7ac06401409062ad2627e6fb58
    Port:          <none>
    Host Port:     <none>
    Args:
      /bin/bash
      -c
      sleep 10
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 10 Oct 2019 09:50:30 +0200
      Finished:     Thu, 10 Oct 2019 09:50:40 +0200
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-4bc55 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  default-token-4bc55:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-4bc55
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type    Reason     Age   From                           Message
  ----    ------     ----  ----                           -------
  Normal  Scheduled  90s   default-scheduler              Successfully assigned jh-dev/ubuntu-success to jh-dev-control-plane
  Normal  Pulling    89s   kubelet, jh-dev-control-plane  Pulling image "ubuntu"
  Normal  Pulled     87s   kubelet, jh-dev-control-plane  Successfully pulled image "ubuntu"
  Normal  Created    87s   kubelet, jh-dev-control-plane  Created container ubuntu-success
  Normal  Started    87s   kubelet, jh-dev-control-plane  Started container ubuntu-success

A pod quit - errors (pod phase: Failed)

# Created by: kubectl run ubuntu --image=ubuntu --restart=Never -- /bin/bash -c 'sleep 30; time-to-error-this-is-a-bogus-command-not-to-be-found'
#
# Inspect by: kubectl describe pods ubuntu
#
Name:         ubuntu
Namespace:    jh-dev
Priority:     0
Node:         jh-dev-control-plane/172.19.0.1
Start Time:   Thu, 10 Oct 2019 09:40:54 +0200
Labels:       run=ubuntu
Annotations:  <none>
Status:       Failed
IP:           192.168.194.143
Containers:
  ubuntu:
    Container ID:  containerd://2432fa15a30413c2ca24bb7a39e9a94c542d2c09a67f71caba3d45867acd05e0
    Image:         ubuntu
    Image ID:      docker.io/library/ubuntu@sha256:b88f8848e9a1a4e4558ba7cfc4acc5879e1d0e7ac06401409062ad2627e6fb58
    Port:          <none>
    Host Port:     <none>
    Args:
      /bin/bash
      -c
      sleep 30; eeasldfj
    State:          Terminated
      Reason:       Error
      Exit Code:    127
      Started:      Thu, 10 Oct 2019 09:41:00 +0200
      Finished:     Thu, 10 Oct 2019 09:41:30 +0200
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-4bc55 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  default-token-4bc55:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-4bc55
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type    Reason     Age    From                           Message
  ----    ------     ----   ----                           -------
  Normal  Scheduled  6m41s  default-scheduler              Successfully assigned jh-dev/ubuntu to jh-dev-control-plane
  Normal  Pulling    6m40s  kubelet, jh-dev-control-plane  Pulling image "ubuntu"
  Normal  Pulled     6m36s  kubelet, jh-dev-control-plane  Successfully pulled image "ubuntu"
  Normal  Created    6m35s  kubelet, jh-dev-control-plane  Created container ubuntu
  Normal  Started    6m35s  kubelet, jh-dev-control-plane  Started container ubuntu

A pod quit - evicted (pod phase: Failed, reason: Evicted)

Name:               proprietary-crawler-preemptive-79f7cbc6cf-rzck9
Namespace:          proprietary
Priority:           0
PriorityClassName:  <none>
Node:               gke-general-preemptible-40c94gb-pool-35dd7b4d-mv7w/
Start Time:         Fri, 26 Apr 2019 03:55:55 -0300
Labels:             app.kubernetes.io/instance=proprietary-crawler
                    app.kubernetes.io/name=proprietary-crawler
                    date=1544806882
                    pod-template-hash=3593767279
Annotations:        <none>
Status:             Failed
Reason:             Evicted
Message:            The node was low on resource: memory. Container proprietary-crawler was using 9402820Ki, which exceeds its request of 7G. 
IP:                 
Controlled By:      ReplicaSet/proprietary-crawler-preemptive-79f7cbc6cf
Containers:
  proprietary-crawler:
    Image:      <redacted>
    Port:       3333/TCP
    Host Port:  0/TCP
    Limits:
      cpu:     8
      memory:  10G
    Requests:
      cpu:      4
      memory:   7G
    Liveness:   http-get http://:3333/health/proprietary-crawler/health delay=60s timeout=1s period=3s #success=1 #failure=5
    Readiness:  http-get http://:3333/health/proprietary-crawler/health delay=60s timeout=1s period=3s #success=1 #failure=3
    Environment:
      s_env:              prod
      JAVA_TOOL_OPTIONS:  -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -Xms4g -Xmx8g
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-cwcqv (ro)
Volumes:
  default-token-cwcqv:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-cwcqv
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  environment=production
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:          <none>

`kubectl get -o yaml` pods

Succeeded

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2019-10-10T07:50:27Z"
  labels:
    run: ubuntu-success
  name: ubuntu-success
  namespace: jh-dev
  resourceVersion: "38440"
  selfLink: /api/v1/namespaces/jh-dev/pods/ubuntu-success
  uid: 0bc3db6d-68e4-42b2-af68-029d5108f81c
spec:
  containers:
  - args:
    - /bin/bash
    - -c
    - sleep 10
    image: ubuntu
    imagePullPolicy: Always
    name: ubuntu-success
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-4bc55
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: jh-dev-control-plane
  priority: 0
  restartPolicy: Never
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: default-token-4bc55
    secret:
      defaultMode: 420
      secretName: default-token-4bc55
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2019-10-10T07:50:27Z"
    reason: PodCompleted
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2019-10-10T07:50:41Z"
    reason: PodCompleted
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2019-10-10T07:50:41Z"
    reason: PodCompleted
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2019-10-10T07:50:27Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://f2b1ae47a6ea4a848f9e433a4ff7eafdd991c7c15a2f15724fba461c4d044959
    image: docker.io/library/ubuntu:latest
    imageID: docker.io/library/ubuntu@sha256:b88f8848e9a1a4e4558ba7cfc4acc5879e1d0e7ac06401409062ad2627e6fb58
    lastState: {}
    name: ubuntu-success
    ready: false
    restartCount: 0
    state:
      terminated:
        containerID: containerd://f2b1ae47a6ea4a848f9e433a4ff7eafdd991c7c15a2f15724fba461c4d044959
        exitCode: 0
        finishedAt: "2019-10-10T07:50:40Z"
        reason: Completed
        startedAt: "2019-10-10T07:50:30Z"
  hostIP: 172.19.0.1
  phase: Succeeded
  podIP: 192.168.194.144
  qosClass: BestEffort
  startTime: "2019-10-10T07:50:27Z"

Errored

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2019-10-10T07:40:54Z"
  labels:
    run: ubuntu
  name: ubuntu
  namespace: jh-dev
  resourceVersion: "37485"
  selfLink: /api/v1/namespaces/jh-dev/pods/ubuntu
  uid: 2a0cb50c-24a5-4461-9384-ace2852092f1
spec:
  containers:
  - args:
    - /bin/bash
    - -c
    - sleep 30; eeasldfj
    image: ubuntu
    imagePullPolicy: Always
    name: ubuntu
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-4bc55
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: jh-dev-control-plane
  priority: 0
  restartPolicy: Never
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: default-token-4bc55
    secret:
      defaultMode: 420
      secretName: default-token-4bc55
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2019-10-10T07:40:54Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2019-10-10T07:41:31Z"
    message: 'containers with unready status: [ubuntu]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2019-10-10T07:41:31Z"
    message: 'containers with unready status: [ubuntu]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2019-10-10T07:40:54Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://2432fa15a30413c2ca24bb7a39e9a94c542d2c09a67f71caba3d45867acd05e0
    image: docker.io/library/ubuntu:latest
    imageID: docker.io/library/ubuntu@sha256:b88f8848e9a1a4e4558ba7cfc4acc5879e1d0e7ac06401409062ad2627e6fb58
    lastState: {}
    name: ubuntu
    ready: false
    restartCount: 0
    state:
      terminated:
        containerID: containerd://2432fa15a30413c2ca24bb7a39e9a94c542d2c09a67f71caba3d45867acd05e0
        exitCode: 127
        finishedAt: "2019-10-10T07:41:30Z"
        reason: Error
        startedAt: "2019-10-10T07:41:00Z"
  hostIP: 172.19.0.1
  phase: Failed
  podIP: 192.168.194.143
  qosClass: BestEffort
  startTime: "2019-10-10T07:40:54Z"

Evicted

apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubernetes.io/created-by: |
      {"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"default","name":"my-pod-2887686308","uid":"2169cfab-74f5-11e6-be44-0800277ad4a8","apiVersion":"extensions","resourceVersion":"188007"}}
    kubernetes.io/limit-ranger: 'LimitRanger plugin set: cpu request for container
      my-pod'
  creationTimestamp: 2016-09-07T12:18:32Z
  generateName: my-pod-2887686308-
  labels:
    pod-template-hash: "2887686308"
    run: my-pod
  name: my-pod-2887686308-3x99a
  namespace: default
  ownerReferences:
  - apiVersion: extensions/v1beta1
    controller: true
    kind: ReplicaSet
    name: my-pod-2887686308
    uid: 2169cfab-74f5-11e6-be44-0800277ad4a8
  resourceVersion: "188038"
  selfLink: /api/v1/namespaces/default/pods/my-pod-2887686308-3x99a
  uid: 31277ddf-74f5-11e6-be44-0800277ad4a8
spec:
  containers:
  - image: nginx
    imagePullPolicy: Always
    name: my-pod
    resources:
      requests:
        cpu: 100m
    terminationMessagePath: /dev/termination-log
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-akryk
      readOnly: true
  dnsPolicy: ClusterFirst
  nodeName: kubernetes-node-1
  restartPolicy: Always
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  volumes:
  - name: default-token-akryk
    secret:
      defaultMode: 420
      secretName: default-token-akryk
status:
  message: The node was low on memory.
  phase: Failed
  reason: Evicted
  startTime: 2016-09-07T12:18:32Z

https://github.com/microsoft/pai/blob/71ad3d63ff6b37301947c05663bc61b44d71032e/prometheus/test/data/no_condtion_pod.json#L177-L182

consideRatio · 2019-10-10T08:38:57Z

Hmmm... But in the initial description of the issue, things remained in...

[I 2018-08-01 09:37:04.053 JupyterHub base:978] erik.sundell is pending stop

Which is what happens while the spawner is waiting to see the pod disappear from its resource list...

kubespawner/kubespawner/spawner.py

Lines 1798 to 1807 in 0137336

    
           try: 
        
               yield exponential_backoff( 
        
                   lambda: self.pod_reflector.pods.get(self.pod_name, None) is None, 
        
                   'pod/%s did not disappear in %s seconds!' % (self.pod_name, self.start_timeout), 
        
                   timeout=self.start_timeout, 
        
               ) 
        
           except TimeoutError: 
        
               self.log.error("Pod %s did not disappear, restarting pod reflector", self.pod_name) 
        
               self._start_watching_pods(replace=True) 
        
               raise

But that only happens when...

kubespawner/kubespawner/reflector.py

Lines 229 to 235 in 0137336

    
           resource = ev['object'] 
        
           if ev['type'] == 'DELETED': 
        
               # This is an atomic delete operation on the dictionary! 
        
               self.resources.pop(resource.metadata.name, None) 
        
           else: 
        
               # This is an atomic operation on the dictionary! 
        
               self.resources[resource.metadata.name] = resource

But hmm... Wasn't a delete call made just before this wait started?

kubespawner/kubespawner/spawner.py

Lines 1780 to 1797 in 0137336

    
           delete_options.grace_period_seconds = grace_seconds 
        
           self.log.info("Deleting pod %s", self.pod_name) 
        
           try: 
        
               yield self.asynchronize( 
        
                   self.api.delete_namespaced_pod, 
        
                   name=self.pod_name, 
        
                   namespace=self.namespace, 
        
                   body=delete_options, 
        
                   grace_period_seconds=grace_seconds, 
        
               ) 
        
           except ApiException as e: 
        
               if e.status == 404: 
        
                   self.log.warning( 
        
                       "No pod %s to delete. Assuming already deleted.", 
        
                       self.pod_name, 
        
                   ) 
        
               else: 
        
                   raise

Yes, but it returned 404 as indicated by this log trace:

[W 2018-08-01 09:36:05.688 JupyterHub spawner:1657] No pod jupyter-erik-2esundell to delete. Assuming already deleted.

Why could we end up with a 404 on a pod delete request, while the pod remains in the list? has it been deleted already, but it did not trigger a DELETE V1WatchEvent at any time?

consideRatio · 2019-11-25T14:07:14Z

Ah hmmm I realize that we can watch for k8s events about pods being Evicted and take action based on that... https://www.bluematador.com/blog/kubernetes-events-explained

tritab · 2020-01-17T20:21:47Z

We've been experiencing this issue when a user pod exceeds the resource limit (e.g. consumes >8GB of RAM). In this scenario, k8s was restarting the user's pod with an OOM Killed message and the hub was unable to find the new user pod. This required the user to contact the admin team to manually delete their pod (we didn't try restarting the hub instead).

We have implemented a couple updates that don't resolve the underlying issue, but do improve the user experience.

Set c.KubeSpawner.extra_pod_config = { "restartPolicy" : "Never" } This causes the hub to avoid restarting the pod and the user is sent to the spawner screen instead. While this isn't optimal, it does remove the admin team from the resolution.
Installed https://github.com/yuvipanda/nbresuse into the notebook image(s). Initially, we had hoped to help users avoid exceeding the pod memory limit by showing a meter of their current memory usage. The extension does this well and has a nice warning feature where it turns red when exceeding a certain percent (e.g. 90%). This plugin has an added bonus that it actually proactively kills the offending process prior to the pod exceeding the memory limit. We weren't expecting this, but it seems to be reducing the frequency of these incidents.

The above items certainly don't resolve the primary issue, but have helped us reduce user impacts.

As far as the root cause, would it be possible to modify the hub to capture a unique label for the pod rather than the IP address? If the label were unique, then it has the upside that it doesn't change, even if the IP address does change due to being rescheduled by k8s.

manics · 2020-01-17T21:53:32Z

As far as the root cause, would it be possible to modify the hub to capture a unique label for the pod rather than the IP address?

@tritab A similar issue was brought up in #386 which requires a persistent pod identifier for SSL certificates. The solution in that PR was to create a service. Presumably that should also solve your problem?

meeseeksmachine · 2020-07-12T06:06:17Z

This issue has been mentioned on Jupyter Community Forum. There might be relevant details there:

https://discourse.jupyter.org/t/weve-been-seeing-503s-occasionally-for-user-servers/3824/2

cameronraysmith · 2020-07-12T06:10:41Z

This issue has been mentioned on Jupyter Community Forum. There might be relevant details there:

https://discourse.jupyter.org/t/weve-been-seeing-503s-occasionally-for-user-servers/3824/2

Yes, in particular, the contents of the attached log.

jamesbond-inactive-to-503-hub-interface.txt

consideRatio · 2020-10-25T13:46:56Z

I think the current status is that the hub won't require a manual restart, but that it will require some time before its check_routes function has concluded that a pod is no longer functional. It would be good to have an update on the situation or a concrete plan to make it more robust.

consideRatio · 2023-03-31T10:05:57Z

I think the current status is that the hub won't require a manual restart, but that it will require some time before its check_routes function has concluded that a pod is no longer functional.

This is the situation, and the issue is related to avoiding manual restarts specifically. Let's mark this as closed.

consideRatio added the bug label Aug 1, 2018

This was referenced Aug 2, 2018

JupyterLab - the default? jupyterhub/zero-to-jupyterhub-k8s#776

Closed

Hub out of sync with k8s cluster jupyterhub/jupyterhub#2043

Closed

consideRatio changed the title ~~Unexpected pod deaths - Super troubling UX~~ Pod Evictions (memory, node shutodowns) - requires manual hub restart! Aug 9, 2018

consideRatio mentioned this issue Aug 9, 2018

Investigation - How to react properly to evicted pods #233

Open

2 tasks

consideRatio mentioned this issue Aug 20, 2018

Consider putting all try.jupyter.org repositories on preemptible nodes jupyterhub/mybinder.org-deploy#687

Closed

consideRatio mentioned this issue Feb 7, 2019

[need help] updating changelog for v0.8 jupyterhub/zero-to-jupyterhub-k8s#1114

Merged

minrk mentioned this issue Feb 27, 2019

jupyterhub kubeSpawner get ip address of notebook server only at start phase, but this address can changed in k8s, how to let jupyterhub know the change jupyterhub/jupyterhub#2430

Closed

celine168 mentioned this issue Sep 15, 2019

Unable to access account after taking down a node LibreTexts/metalc#35

Closed

sstarcher mentioned this issue Apr 3, 2020

support internal_ssl for kubespawner #386

Closed

oneilsh mentioned this issue Mar 29, 2021

OutOfMemory Errors datasci-osu/dsosuk8s#60

Open

consideRatio closed this as completed Mar 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pod Evictions (memory, node shutodowns) - requires manual hub restart! #223

Pod Evictions (memory, node shutodowns) - requires manual hub restart! #223

consideRatio commented Aug 1, 2018

jgerardsimcock commented Aug 2, 2018

consideRatio commented Aug 9, 2018 •

edited

Loading

consideRatio commented Aug 25, 2018

kakal commented Mar 15, 2019

consideRatio commented Mar 15, 2019

consideRatio commented Oct 9, 2019 •

edited

Loading

consideRatio commented Oct 9, 2019 •

edited

Loading

consideRatio commented Oct 9, 2019 •

edited

Loading

rkdarst commented Oct 9, 2019

consideRatio commented Oct 10, 2019

consideRatio commented Oct 10, 2019 •

edited

Loading

consideRatio commented Oct 10, 2019 •

edited

Loading

consideRatio commented Nov 25, 2019

tritab commented Jan 17, 2020

manics commented Jan 17, 2020

meeseeksmachine commented Jul 12, 2020

cameronraysmith commented Jul 12, 2020

consideRatio commented Oct 25, 2020

consideRatio commented Mar 31, 2023

Pod Evictions (memory, node shutodowns) - requires manual hub restart! #223

Pod Evictions (memory, node shutodowns) - requires manual hub restart! #223

Comments

consideRatio commented Aug 1, 2018

About

Experience summary

Experience log

1. A preemptible node was reclaimed and my user pod was lost

2. Later - I visit the hub

3. Directly after - I visit hub/admin and press the stop server button

4. I revisit my singleuser server

5. I look again

6. I refresh

7. I restart the hub

8. I login again and everything works

jgerardsimcock commented Aug 2, 2018

consideRatio commented Aug 9, 2018 • edited Loading

1. Only DELETED state

2. Abort in finally

Try -> Except -> Finally

Try -> Else -> Finally

consideRatio commented Aug 25, 2018

kakal commented Mar 15, 2019

consideRatio commented Mar 15, 2019

consideRatio commented Oct 9, 2019 • edited Loading

Questions raised

consideRatio commented Oct 9, 2019 • edited Loading

The pod reflector - code base investigation

consideRatio commented Oct 9, 2019 • edited Loading

A Python 3 memory bomb for debugging

rkdarst commented Oct 9, 2019

consideRatio commented Oct 10, 2019

consideRatio commented Oct 10, 2019 • edited Loading

Observations

A pod quit - completes (pod phase: Success)

A pod quit - errors (pod phase: Failed)

A pod quit - evicted (pod phase: Failed, reason: Evicted)

kubectl get -o yaml pods

Succeeded

Errored

Evicted

consideRatio commented Oct 10, 2019 • edited Loading

consideRatio commented Nov 25, 2019

tritab commented Jan 17, 2020

manics commented Jan 17, 2020

meeseeksmachine commented Jul 12, 2020

cameronraysmith commented Jul 12, 2020

consideRatio commented Oct 25, 2020

consideRatio commented Mar 31, 2023

3. Directly after - I visit `hub/admin` and press the stop server button

consideRatio commented Aug 9, 2018 •

edited

Loading

consideRatio commented Oct 9, 2019 •

edited

Loading

consideRatio commented Oct 9, 2019 •

edited

Loading

consideRatio commented Oct 9, 2019 •

edited

Loading

consideRatio commented Oct 10, 2019 •

edited

Loading

`kubectl get -o yaml` pods

consideRatio commented Oct 10, 2019 •

edited

Loading