-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hub out of sync with k8s cluster #2043
Comments
Hi @ryanlovett, Have you seen this behavior recently? |
I have - jupyterhub/kubespawner#223! If it is the same reason for getting out of sync as I have, this is very important to solve I think. @ellisonbg described in how troublesome it can be if users run out of memory. That happened to a user on my cluster. Their pod got evicted and it led to this state I believe. It has also happened when a user pod was lost because a preemptible node shut down after 24 hours. A while back, a month perhaps, I think @minrk implemented a fix that cleaned up some routing during hub startup. @minrk does this sound familiar? Perhaps it is the last one of these I was thinking of? I scrambled some PR's that may be related or give a good background.
See jupyterhub/kubespawner#223 for reproduction details and logs of this issue, I'm fairly confident they have the same solution and this issue is probably reproducible using that. |
@willingc Yes, it happened 7/31, 7/30, 7/16, and 7/6. I'm happy to try out any potential fixes however the class will be over in a week or two. |
Thanks folks. @consideRatio, Sounds reasonable. |
Example hub log entries:
|
From the Gitter discussion:
|
Thanks @willingc! Found it, https://gitter.im/jupyterhub/jupyterhub?at=5b637310c917d40dc2340997. That will kill off all pods but the proxy. In our case, killing only the hub pod is mostly effective, though there can be a handful of users who require server restarts. |
@gedankenstuecke could you post the revision of the helm chart you have deployed? Just to make sure everyone knows which version you are on because it is a very recent one, not a stable release. I think that is important compared to Ryan's deployment which might be on a stable release. |
On mybinder.org and @gedankenstuecke's deployment (openhumans.org) we saw this behaviour with revisions of the helm chart newer than the latest stable release. The symptoms were that a user logs in, their pod appears ( We traced this to a problem in kubespawner which uses event reflectors to be notified of events related to pods. This seems to sometimes break and I'd say right now it isn't completely clear why/where the bug is. I think there is another bug because @gedankenstuecke and mybinder.org are both on (very) recent revision of the helm chart that include the first round of fixes to kubespawner. |
@betatim Yep, 0.6. I tried to upgrade to a 0.7 when we first saw this problem but it didn't go cleanly. I went back to 0.6. |
Sorry for the delay in answering (I had my driving test this morning with the DMV ✅ 🚗). The revision we're using should be |
A quick update...We're continuing to investigate the behavior of the pod reflectors losing track which we have mitigations in place. The other behavior that we are investigating is what happens after a pod is evicted, and we believe that this will involve modification of kubespawner. If you run into this issue and can share any relevant logs or behavior information, we would appreciate adding the information here. cc/@minrk |
We're seeing this on a new hub running v0.7-92ccb36. |
I've pasted a simple grep of the user's username from the hub logs. I attempted to start and stop their server from /hub/admin on occasion and my user is "rylo":
|
I believe this will be fixed by #2297 which verifies that a server is at the right URL during startup. There was a situation where spawners could be listed as running at the wrong URL if the Hub was restarted while the pod was mid-launch. |
Closing as outdated or assumed to be fixed by #2297. |
Temporary fix
On chart 0.6. Our hub has been periodically having problems where it won't permit current or new user servers to start. Users see 503 errors when visiting /user-redirect/git-sync (their primary entry point). When this happens, the hub's view of users servers at /hub/admin is out of sync with the cluster's view of them via
get pods
. Killing the hub pod gets users past this problem, though there are lingering issues. For example one user's pod was still up but the hub thought the server wasn't running.I realize this isn't much to go on. I've asked the instructors to preserve the hub logs the next time this problem comes up.
The text was updated successfully, but these errors were encountered: