Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resilience needed against k8s master outages #627

Closed
sgibson91 opened this issue Aug 12, 2022 · 30 comments · Fixed by #635
Closed

Resilience needed against k8s master outages #627

sgibson91 opened this issue Aug 12, 2022 · 30 comments · Fixed by #635
Labels

Comments

@sgibson91
Copy link
Member

sgibson91 commented Aug 12, 2022

Bug description

During a kubernetes master outage, a few failed api calls leaves JupyterHub in an unrecoverable state. This causes user servers to fail to spin up.

Expected behaviour

JupyterHub can correctly recover itself after failed api calls to the k8s master.

Actual behaviour

JupyterHub is left in an unrecoverable state and user servers cannot start.

Some logs are available in the following issues:

How to reproduce

This is an intermittent issue we are seeing on cloud-based deployments.

Your personal set up

We are running z2jh on GCP clusters. We see this issue most frequently on non-regional clusters. Our clusters that are regional do not experience this as often, if at all, because the k8s master has high availability for these clusters.

You can find more information about how we deploy our hubs here: https://github.com/2i2c-org/infrastructure

Related issues

Versions

  • jupyterhub 2.3.1
  • kubespawner 4.1.0

Established in this comment: #627 (comment)

@sgibson91 sgibson91 added the bug label Aug 12, 2022
@welcome

This comment was marked as resolved.

@consideRatio
Copy link
Member

consideRatio commented Aug 12, 2022

@sgibson91 can you track down what JupyterHub / KubeSpawner version was used when this was observed?

Sorry, I could have done that myself - I conclude from this and the associated z2jh hub image with requirements.txt frozen here that jupyterhub has version 2.3.1 and kubespawner has version 4.1.0. The latest stable versions of both jupyterhub and kubespawner - at least assuming that the docker image used by 2i2c was rebuilt when bumping the helm chart version.

@sgibson91
Copy link
Member Author

sgibson91 commented Aug 15, 2022

Hence I believe it is safe to assume that the docker image was rebuilt when, and after, bumping the helm chart version.

I'm not sure if it's relevant, but the tag of the k8s-hub base image we define in our Dockerfile is behind that defined in the helm chart (I'm guessing these are the same thing based on comments?):

So maybe our assumptions are wrong because we have a version mismatch?

@minrk
Copy link
Member

minrk commented Aug 17, 2022

Are you still seeing the 'halting hub...' message, but the hub does not halt? That's how kubespawner is meant to recover from things like this (when in doubt: restart). So if we're back in failing-to-stop territory (We may have reintroduced a form of #525 after the switch to kubernetes-asyncio), this may be a KubeSpawner or JupyterHub bug.

@sgibson91
Copy link
Member Author

sgibson91 commented Aug 17, 2022

Are you still seeing the 'halting hub...' message, but the hub does not halt?

Yes, we have to manually restart the hub.

EDIT: Actually the last time we saw this, we saw the "stopping your server" message but the server never stopped. So it might not quite be the same thing.

@minrk
Copy link
Member

minrk commented Aug 17, 2022

OK, if you're seeing the 'halting hub...' message, that means kubespawner has recognized something's wrong, decided to do the right thing that will fix it (confirmed by the fact that a manual shutdown of the hub does fix it), and then it somehow didn't quite happen.

Unfortunately, when I try to provoke this manually by inserting explicit failures, I get the exact messages @yuvipanda mentions here:

[C 2022-08-17 15:12:43.350 JupyterHub spawner:2381] Pods reflector failed, halting Hub.
ERROR:asyncio:Task was destroyed but it is pending!
task: <Task pending name='Task-15' coro=<shared_client.<locals>.close_client_task() running at /Users/minrk/dev/jpy/hub/kubespawner/kubespawner/clients.py:58> wait_for=<Future pending cb=[Task.task_wakeup()]>>
Exception ignored in: <coroutine object shared_client.<locals>.close_client_task at 0x111cb6490>
RuntimeError: coroutine ignored GeneratorExit

but then the Hub exits. This is with jupyterhub 2.3.1, kubespawner f3d44e6 .

@sgibson91
Copy link
Member Author

I'm not sure if it's relevant, but the tag of the k8s-hub base image we define in our Dockerfile is behind that defined in the helm chart (I'm guessing these are the same thing based on comments?):

So maybe our assumptions are wrong because we have a version mismatch?

As I mentioned here, I'm not sure we can confident about what versions we are running since the base image we are building our hub image from has a different tag to the z2jh chart in our helm charts. I'm not sure how the two are linked, but there is a comment in our Chart.yaml that the base image tag needs bumping with tag bumps in the helm chart

@minrk
Copy link
Member

minrk commented Aug 18, 2022

One way to find out:

kubectl exec -it hub-... -- pip list

@sgibson91
Copy link
Member Author

This is taken from the staging hub running on our main 2i2c cluster (so I'm confident it isn't using it's own, custom hub image). Seems we are a little ahead in jupyterhub version than we first surmised but the kubespawner version is what we guessed.

$ k -n staging exec -it hub-7d6c86fcff-xhwh2 -c hub -- pip list
Package                          Version
-------------------------------- -----------
[CUT]
jupyterhub                       2.3.2.dev0
[CUT]
jupyterhub-kubespawner           4.1.0
[CUT]

And this is taken from the hub that last saw this issue (it's running on the same cluster, different namespace). It confirms the same versions.

$ k -n ohw exec -it hub-7cbc97f77f-br9jf -c hub -- pip list
Package                          Version
-------------------------------- -----------
[CUT]
jupyterhub                       2.3.2.dev0
[CUT]
jupyterhub-kubespawner           4.1.0
[CUT]

@minrk
Copy link
Member

minrk commented Aug 18, 2022

OK, then it's got every fix I'm aware of. Is it possible to capture more detailed logs from the hub? Maybe even with debug-logging on? The trick is likely that there could be a long time between when the problem occurs and when it's observed, so a lot of logs might be needed to track it down.

@minrk
Copy link
Member

minrk commented Aug 18, 2022

My hunch is that the switch to kubernetes-asyncio away from threads for the reflectors has changed something in how these failures are observed.

But I'm pretty stumped about how the halting Hub event is firing and the container is not actually exiting, even if it's showing all the statements that lead up to a successful exit that I see.

@sgibson91
Copy link
Member Author

Is it possible to capture more detailed logs from the hub? Maybe even with debug-logging on?

I will confer with Yuvi as to whether this is possible for us

@consideRatio
Copy link
Member

This happened to the staging hub as well right? I figure a reasonable choice is to enable debugging logs for staging hubs specifically.

You can do it via z2jh config i think, debug=true.

@sgibson91
Copy link
Member Author

This happened to the staging hub as well right?

Yeah, when it happens usually all hubs on the cluster go down because it's caused by the k8s master outage. So I figure enable it on staging and maybe one or two others as a fallback?

@consideRatio
Copy link
Member

@sgibson91 not confident about if and how long 2i2c retains logs and the associated data amounts accumulated I'd go a 👍 only on the staging hubs for now, but if you feel at ease with that aspect then 👍 also for a production hub if you think it can be useful generally.

@sgibson91
Copy link
Member Author

I discovered we have a logs backup in GCP! So I think retention and space is good, but I will double check.

@consideRatio
Copy link
Member

@sgibson91 nice!

I've been part of incidents when logs have been in such large amounts that it incurred very significant cost ingesting and storing them. Sometimes more logs is worse as it can incur costs like that, and can be harder to read through logs that have too many lines of information added to them.

@minrk
Copy link
Member

minrk commented Aug 19, 2022

It's entirely possible regular logs over an extended period are sufficient, if you have still them surrounding any recent incidents. The main thing is capturing quite a bit of time to sift through.

@sgibson91
Copy link
Member Author

We just had another incident happen about an hour ago so I'm looking through the logs now

@sgibson91
Copy link
Member Author

sgibson91 commented Aug 25, 2022

[logs source]

So I can see a user server pending stop for a long time

Screenshot 2022-08-25 at 18 57 07

A server delete request made that triggers a message about already being active

Screenshot 2022-08-25 at 18 59 14

The a hub shutdown with complaints about leaving single-user servers running and not being able to clean up the proxy. This is when we start to see the asyncio/tornado errors that were reported in other issues.

Screenshot 2022-08-25 at 19 00 42

@abkfenris
Copy link

gmanuch is the user that reported the issue. There should be a request from me to try to start a server to test if this was the same as last time, before I triggered the hub shutdown to try to clear it.

@sgibson91
Copy link
Member Author

sgibson91 commented Aug 25, 2022

I've scrolled through logs over a time period of 11:30 UTC up to 17:00 UTC. Everything looks normal until this 303 GET to gmanuch's lab endpoint comes in. Then we start seeing the message that that server is pending stop. As far as I can tell, it's the only call to /lab in that time frame. @abkfenris do you have any JupyterLab modifications in your image do you know?

Screenshot 2022-08-25 at 19 30 03

I see you have default URL set as lab for the Python image

https://github.com/2i2c-org/infrastructure/blob/8c8e37b012a4c42f916df767b47db1b98d91190c/config/clusters/2i2c/ohw.values.yaml#L50

@abkfenris
Copy link

For our Python environment (lockfile), we should be getting JupyterLab via pangeo-notebook=2022.07.12, and the only packages that I can think of that would directly change things would be jupyter_contrib_nbextensions and jupyterlab-git. We shouldn't be doing any other modifications during our build.

@GeorgianaElena
Copy link
Member

Small note that I've also spotted a W 2022-08-25 16:36:58.665 JupyterHub base:393] Invalid or expired cookie token a couple of minutes the first "is pending stop" message which is coming from here.

Maybe the issue is an inconsistent state because of the following:

  • because the cookie got invalidated, the user object cannot be retrieved by user_for_cookie and so, nor its server which may have never gotten the stop request.
  • but the SpawnPendingHandler registered the stop request for that particular user and knows it has to wait for it to finish, which will never happen since the user's access got invalidated and hub doesn't know how to trace back the request to the correct user and server

@minrk
Copy link
Member

minrk commented Aug 26, 2022

complaints about leaving single-user servers running

Sorry about this, it's not complaining - this is normal behavior, chosen by z2jh configuration (as opposed to shutting down user servers when the hub shuts down).

I don't have access to the logs in context, but there should at least be an event about triggering the initial stop prior to 16:34:46. Are there any logs about the username leading up to this?

I think it's very unlikely that credentials are the issue, because the SpawnPendingHandler won't do anything if the credentials aren't up to date. Pending state is stored on the Spawner object itself, not related to authentication of any given request, and doesn't get registered until the stop has actually begun here.

But is this the same as the original issue? One user being stuck definitely could be caused by the same issue as the original (reflector problems, general cluster communication problems), but it could also be a different problem related to one call to KubeSpawner.stop never returning, or taking a long time. The timeout for this method is long (at least 5 minutes, I think, depending on configuration), so a minute or two may not mean anything is wrong, just slow.

@GeorgianaElena
Copy link
Member

don't have access to the logs in context, but there should at least be an event about triggering the initial stop prior to 16:34:46. Are there any logs about the username leading up to this?

Searching after the username, (starting 14:53, the earliest I checked) there are only logs like the one bellow every 5 mins

[I 2022-08-25 14:53:33.732 JupyterHub log:189] 200 POST /hub/api/users/gmanuch/activity (gmanuch@10.0.41.10) 13.27ms

And then the 303 that's in the logs shared by @sgibson91:

[I 2022-08-25 16:34:46.423 JupyterHub log:189] 303 GET /hub/user/gmanuch/lab (gmanuch@10.128.0.29) 12.86ms
2022-08-25 19:34:46.487 EEST
[I 2022-08-25 16:34:46.486 JupyterHub pages:401] gmanuch is pending stop
2022-08-25 19:34:46.489 EEST
[I 2022-08-25 16:34:46.489 JupyterHub log:189] 200 GET /hub/spawn-pending/gmanuch?next=%2Fhub%2Fuser%2Fgmanuch%2Flab (gmanuch@10.128.0.29) 5.66ms
2022-08-25 19:34:52.043 EEST
[I 2022-08-25 16:34:52.043 JupyterHub pages:401] gmanuch is pending stop
2022-08-25 19:34:57.207 EEST
[I 2022-08-25 16:34:57.207 JupyterHub pages:401] gmanuch is pending stop
2022-08-25 19:35:02.358 EEST
[I 2022-08-25 16:35:02.358 JupyterHub pages:401] gmanuch is pending stop
2022-08-25 19:35:07.524 EEST
[I 2022-08-25 16:35:07.524 JupyterHub pages:401] gmanuch is pending stop
2022-08-25 19:35:12.681 EEST
[I 2022-08-25 16:35:12.680 JupyterHub pages:401] gmanuch is pending stop

@sgibson91
Copy link
Member Author

@minrk I invited you to the project with Logs Viewer role (I'm hoping that's enough for you to do what you need to, but let me know if it's not), so this link should work for you now.

@GeorgianaElena is right though, I checked from 11am UTC and only found calls to the activity endpoint up until that 303 to /lab

@minrk
Copy link
Member

minrk commented Aug 26, 2022

Why this stuff is hard: The relevant error occurred at 2022-08-24 20:00 UTC, while the issue was observed at 2022-08-25T 6:35 UTC, fully 20 hours later.

@sgibson91
Copy link
Member Author

Thank you @minrk 🙏

@minrk
Copy link
Member

minrk commented Aug 26, 2022

Also relevant is longstanding issue in JupyterHub: jupyterhub/jupyterhub#1677 We don't have timeouts on Spawner.stop because it's unclear what the state should be when stop times out or fails. The fact that these user.stop() calls are taking literally forever has very hard to identify symptoms, which could have been clearer if there were timeouts being logged. Plus, that would at least mean you can't be stuck in 'pending' forever (though what state you should be in is unclear).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants