-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resilience needed against k8s master outages #627
Comments
This comment was marked as resolved.
This comment was marked as resolved.
@sgibson91 can you track down what JupyterHub / KubeSpawner version was used when this was observed? Sorry, I could have done that myself - I conclude from this and the associated z2jh hub image with requirements.txt frozen here that jupyterhub has version 2.3.1 and kubespawner has version 4.1.0. The latest stable versions of both jupyterhub and kubespawner - at least assuming that the docker image used by 2i2c was rebuilt when bumping the helm chart version. |
Hence I believe it is safe to assume that the docker image was rebuilt when, and after, bumping the helm chart version. I'm not sure if it's relevant, but the tag of the So maybe our assumptions are wrong because we have a version mismatch? |
Are you still seeing the 'halting hub...' message, but the hub does not halt? That's how kubespawner is meant to recover from things like this (when in doubt: restart). So if we're back in failing-to-stop territory (We may have reintroduced a form of #525 after the switch to kubernetes-asyncio), this may be a KubeSpawner or JupyterHub bug. |
Yes, we have to manually restart the hub. EDIT: Actually the last time we saw this, we saw the "stopping your server" message but the server never stopped. So it might not quite be the same thing. |
OK, if you're seeing the 'halting hub...' message, that means kubespawner has recognized something's wrong, decided to do the right thing that will fix it (confirmed by the fact that a manual shutdown of the hub does fix it), and then it somehow didn't quite happen. Unfortunately, when I try to provoke this manually by inserting explicit failures, I get the exact messages @yuvipanda mentions here:
but then the Hub exits. This is with jupyterhub 2.3.1, kubespawner f3d44e6 . |
As I mentioned here, I'm not sure we can confident about what versions we are running since the base image we are building our hub image from has a different tag to the z2jh chart in our helm charts. I'm not sure how the two are linked, but there is a comment in our Chart.yaml that the base image tag needs bumping with tag bumps in the helm chart |
One way to find out:
|
This is taken from the staging hub running on our main 2i2c cluster (so I'm confident it isn't using it's own, custom hub image). Seems we are a little ahead in jupyterhub version than we first surmised but the kubespawner version is what we guessed.
And this is taken from the hub that last saw this issue (it's running on the same cluster, different namespace). It confirms the same versions.
|
OK, then it's got every fix I'm aware of. Is it possible to capture more detailed logs from the hub? Maybe even with debug-logging on? The trick is likely that there could be a long time between when the problem occurs and when it's observed, so a lot of logs might be needed to track it down. |
My hunch is that the switch to kubernetes-asyncio away from threads for the reflectors has changed something in how these failures are observed. But I'm pretty stumped about how the |
I will confer with Yuvi as to whether this is possible for us |
This happened to the staging hub as well right? I figure a reasonable choice is to enable debugging logs for staging hubs specifically. You can do it via z2jh config i think, debug=true. |
Yeah, when it happens usually all hubs on the cluster go down because it's caused by the k8s master outage. So I figure enable it on staging and maybe one or two others as a fallback? |
@sgibson91 not confident about if and how long 2i2c retains logs and the associated data amounts accumulated I'd go a 👍 only on the staging hubs for now, but if you feel at ease with that aspect then 👍 also for a production hub if you think it can be useful generally. |
I discovered we have a logs backup in GCP! So I think retention and space is good, but I will double check. |
@sgibson91 nice! I've been part of incidents when logs have been in such large amounts that it incurred very significant cost ingesting and storing them. Sometimes more logs is worse as it can incur costs like that, and can be harder to read through logs that have too many lines of information added to them. |
It's entirely possible regular logs over an extended period are sufficient, if you have still them surrounding any recent incidents. The main thing is capturing quite a bit of time to sift through. |
We just had another incident happen about an hour ago so I'm looking through the logs now |
So I can see a user server pending stop for a long time A server delete request made that triggers a message about already being active The a hub shutdown with complaints about leaving single-user servers running and not being able to clean up the proxy. This is when we start to see the asyncio/tornado errors that were reported in other issues. |
|
I've scrolled through logs over a time period of 11:30 UTC up to 17:00 UTC. Everything looks normal until this 303 GET to gmanuch's lab endpoint comes in. Then we start seeing the message that that server is pending stop. As far as I can tell, it's the only call to /lab in that time frame. @abkfenris do you have any JupyterLab modifications in your image do you know? I see you have default URL set as lab for the Python image |
For our Python environment (lockfile), we should be getting JupyterLab via |
Small note that I've also spotted a Maybe the issue is an inconsistent state because of the following:
|
Sorry about this, it's not complaining - this is normal behavior, chosen by z2jh configuration (as opposed to shutting down user servers when the hub shuts down). I don't have access to the logs in context, but there should at least be an event about triggering the initial stop prior to 16:34:46. Are there any logs about the username leading up to this? I think it's very unlikely that credentials are the issue, because the SpawnPendingHandler won't do anything if the credentials aren't up to date. Pending state is stored on the Spawner object itself, not related to authentication of any given request, and doesn't get registered until the stop has actually begun here. But is this the same as the original issue? One user being stuck definitely could be caused by the same issue as the original (reflector problems, general cluster communication problems), but it could also be a different problem related to one call to KubeSpawner.stop never returning, or taking a long time. The timeout for this method is long (at least 5 minutes, I think, depending on configuration), so a minute or two may not mean anything is wrong, just slow. |
Searching after the username, (starting 14:53, the earliest I checked) there are only logs like the one bellow every 5 mins
And then the
|
@minrk I invited you to the project with Logs Viewer role (I'm hoping that's enough for you to do what you need to, but let me know if it's not), so this link should work for you now. @GeorgianaElena is right though, I checked from 11am UTC and only found calls to the activity endpoint up until that 303 to /lab |
Why this stuff is hard: The relevant error occurred at 2022-08-24 20:00 UTC, while the issue was observed at 2022-08-25T 6:35 UTC, fully 20 hours later. |
Thank you @minrk 🙏 |
Also relevant is longstanding issue in JupyterHub: jupyterhub/jupyterhub#1677 We don't have timeouts on Spawner.stop because it's unclear what the state should be when stop times out or fails. The fact that these |
Bug description
During a kubernetes master outage, a few failed api calls leaves JupyterHub in an unrecoverable state. This causes user servers to fail to spin up.
Expected behaviour
JupyterHub can correctly recover itself after failed api calls to the k8s master.
Actual behaviour
JupyterHub is left in an unrecoverable state and user servers cannot start.
Some logs are available in the following issues:
How to reproduce
This is an intermittent issue we are seeing on cloud-based deployments.
Your personal set up
We are running z2jh on GCP clusters. We see this issue most frequently on non-regional clusters. Our clusters that are regional do not experience this as often, if at all, because the k8s master has high availability for these clusters.
You can find more information about how we deploy our hubs here: https://github.com/2i2c-org/infrastructure
Related issues
Versions
Established in this comment: #627 (comment)
The text was updated successfully, but these errors were encountered: