-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transient issues spinning up servers related to pod reflector errors #1103
Comments
I've bounced (aka restarted) all the hub pods on the pilot-hubs cluster with this line:
choose is an awesome alternative to |
In this case, I think restarting the hub was the right call - I think we were just seeing jupyterhub/kubespawner#525 again. I see the following line in the log:
But the hub does not halt :) I think what happened is:
Restarting the pods fixes this, and upgrading to latest z2jh (which has jupyterhub/kubespawner#525 in it) will fix this for good. |
I think upgrading z2jh will help fix this. |
Just noting that this issue is tracking upgrading Z2JH / JupyterHub to 2.0 (though maybe you're suggesting we upgrade to a later 1.x branch? |
@choldgraf nope, 2.x should fix it. |
@yuvipanda awesome, I've updated the top comment to clarify the next steps here |
Just noting that we have another incident related to this issue: It looks like we can't solve this for good until z2jh 2.0 is released. Since this has become a fairly common problem, as a stopgap maybe we can share the actions that often resolve this problem in the top comment of this issue (or somewhere else?) It seems like the easiest thing is to restart the hub pod, and it then works ok when it comes back. Is that right? |
Yes |
2i2c-org#1103 is happening more frequently now, and it's a hard fail - many users just can not start up their servers. The z2jh upgrade will involve more work (2i2c-org#1055), so let's just bump up kubespawner in our custom hub image until then. Fixes 2i2c-org#1103
@choldgraf we don't need to wait for z2jh 2.0 (which will involve other work to upgrade) to get this fixed, as there's a released version of kubespawner with the fix. #1137 should fix this. |
Context
We often get support tickets along the lines of "can't spin up a user server" that appear to be transient in nature. Upon inspection of the logs, we see
pod reflector errors
which can be caused either by a k8s master API outage, or a race condition in the hub.I have opened an issue in the jupyterhub/grafana-dashboards repo to ask for the k8s master API stats to be included since this will help us debug these types of issue: jupyterhub/grafana-dashboards#34
Deleting the hub pod and allowing it to be recreated should help things if it was a race condition in the hub too.
Specifically, the pilot-hubs cluster is zonal, not regional, which means it's k8s master API is not highly available and is therefore more prone to these issues. See #1102
Actions and updates
The text was updated successfully, but these errors were encountered: