Transient issues spinning up servers related to pod reflector errors #1103

sgibson91 · 2022-03-15T10:45:33Z

Context

We often get support tickets along the lines of "can't spin up a user server" that appear to be transient in nature. Upon inspection of the logs, we see pod reflector errors which can be caused either by a k8s master API outage, or a race condition in the hub.

I have opened an issue in the jupyterhub/grafana-dashboards repo to ask for the k8s master API stats to be included since this will help us debug these types of issue: jupyterhub/grafana-dashboards#34

Deleting the hub pod and allowing it to be recreated should help things if it was a race condition in the hub too.

Specifically, the pilot-hubs cluster is zonal, not regional, which means it's k8s master API is not highly available and is therefore more prone to these issues. See #1102

Actions and updates

Resolve Upgrade our hubs to Z2JH 2 / JupyterHub 3.0 #1055
Check whether this solved our problem (which will probably require just noticing whether this behavior pops up over time)

The text was updated successfully, but these errors were encountered:

yuvipanda · 2022-03-15T10:55:18Z

I've bounced (aka restarted) all the hub pods on the pilot-hubs cluster with this line:

kubectl get ns | choose 0 | rg -v aup | xargs -L1  kubectl  delete pod -l component=hub -n

choose is an awesome alternative to cut or awk, and rg is ripgrep an awesome alternative to grep. I skipped aup because I had already restarted that hub pod, and didn't want to do it again.

yuvipanda · 2022-03-15T11:02:15Z

In this case, I think restarting the hub was the right call - I think we were just seeing jupyterhub/kubespawner#525 again. I see the following line in the log:

[C 2022-03-15 03:26:39.979 JupyterHub spawner:2222] Pods reflector failed, halting Hub.

But the hub does not halt :)

I think what happened is:

k8s api had some downtime / latency issues
hub's reflectors failed multiple times during this, causing kubespawner to decide it needs to shut itself down
This triggers Terminate process correctly from reflector thread jupyterhub/kubespawner#525, leaving the hub in a zombie state.

Restarting the pods fixes this, and upgrading to latest z2jh (which has jupyterhub/kubespawner#525 in it) will fix this for good.

yuvipanda · 2022-03-16T18:55:12Z

I think upgrading z2jh will help fix this.

choldgraf · 2022-03-16T22:00:30Z

Just noting that this issue is tracking upgrading Z2JH / JupyterHub to 2.0 (though maybe you're suggesting we upgrade to a later 1.x branch?

Upgrade our hubs to Z2JH 2 / JupyterHub 3.0 #1055

yuvipanda · 2022-03-16T22:01:41Z

@choldgraf nope, 2.x should fix it.

choldgraf · 2022-03-16T22:03:01Z

@yuvipanda awesome, I've updated the top comment to clarify the next steps here

choldgraf · 2022-03-19T14:13:21Z

Just noting that we have another incident related to this issue:

[Incident] ANU hub unable to spawn instances #1135

It looks like we can't solve this for good until z2jh 2.0 is released. Since this has become a fairly common problem, as a stopgap maybe we can share the actions that often resolve this problem in the top comment of this issue (or somewhere else?)

It seems like the easiest thing is to restart the hub pod, and it then works ok when it comes back. Is that right?

sgibson91 · 2022-03-19T14:22:30Z

It seems like the easiest thing is to restart the hub pod, and it then works ok when it comes back. Is that right?

Yes

2i2c-org#1103 is happening more frequently now, and it's a hard fail - many users just can not start up their servers. The z2jh upgrade will involve more work (2i2c-org#1055), so let's just bump up kubespawner in our custom hub image until then. Fixes 2i2c-org#1103

yuvipanda · 2022-03-19T20:38:06Z

@choldgraf we don't need to wait for z2jh 2.0 (which will involve other work to upgrade) to get this fixed, as there's a released version of kubespawner with the fix. #1137 should fix this.

sgibson91 added the support label Mar 15, 2022

This was referenced Mar 19, 2022

[Incident] ANU hub unable to spawn instances #1135

Closed

Switch to using pd-balanced for all user & dask nodes #1124

Merged

yuvipanda mentioned this issue Mar 19, 2022

Bump up kubespawner version #1137

Merged

yuvipanda closed this as completed in #1137 Mar 22, 2022

sgibson91 mentioned this issue Aug 12, 2022

Resilience needed against k8s master outages jupyterhub/kubespawner#627

Closed

sgibson91 mentioned this issue Aug 31, 2022

Upgrade our hubs to Z2JH 2 / JupyterHub 3.0 #1055

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transient issues spinning up servers related to pod reflector errors #1103

Transient issues spinning up servers related to pod reflector errors #1103

sgibson91 commented Mar 15, 2022 •

edited by choldgraf

Loading

yuvipanda commented Mar 15, 2022

yuvipanda commented Mar 15, 2022

yuvipanda commented Mar 16, 2022

choldgraf commented Mar 16, 2022

yuvipanda commented Mar 16, 2022

choldgraf commented Mar 16, 2022

choldgraf commented Mar 19, 2022

sgibson91 commented Mar 19, 2022

yuvipanda commented Mar 19, 2022

Transient issues spinning up servers related to pod reflector errors #1103

Transient issues spinning up servers related to pod reflector errors #1103

Comments

sgibson91 commented Mar 15, 2022 • edited by choldgraf Loading

Context

Actions and updates

yuvipanda commented Mar 15, 2022

yuvipanda commented Mar 15, 2022

yuvipanda commented Mar 16, 2022

choldgraf commented Mar 16, 2022

yuvipanda commented Mar 16, 2022

choldgraf commented Mar 16, 2022

choldgraf commented Mar 19, 2022

sgibson91 commented Mar 19, 2022

yuvipanda commented Mar 19, 2022

sgibson91 commented Mar 15, 2022 •

edited by choldgraf

Loading