Resilience needed against k8s master outages #627

sgibson91 · 2022-08-12T16:20:27Z

Bug description

During a kubernetes master outage, a few failed api calls leaves JupyterHub in an unrecoverable state. This causes user servers to fail to spin up.

Expected behaviour

JupyterHub can correctly recover itself after failed api calls to the k8s master.

Actual behaviour

JupyterHub is left in an unrecoverable state and user servers cannot start.

Some logs are available in the following issues:

[Incident] m2lines hub unavailable for a short period of time 2i2c-org/infrastructure#1248 (comment)
Move pilot-hubs cluster to a regional k8s cluster for better availability 2i2c-org/infrastructure#1102 (comment)

How to reproduce

This is an intermittent issue we are seeing on cloud-based deployments.

Your personal set up

We are running z2jh on GCP clusters. We see this issue most frequently on non-regional clusters. Our clusters that are regional do not experience this as often, if at all, because the k8s master has high availability for these clusters.

You can find more information about how we deploy our hubs here: https://github.com/2i2c-org/infrastructure

Related issues

Versions

jupyterhub 2.3.1
kubespawner 4.1.0

Established in this comment: #627 (comment)

consideRatio · 2022-08-12T16:52:57Z

@sgibson91 can you track down what JupyterHub / KubeSpawner version was used when this was observed?

Sorry, I could have done that myself - I conclude from this and the associated z2jh hub image with requirements.txt frozen here that jupyterhub has version 2.3.1 and kubespawner has version 4.1.0. The latest stable versions of both jupyterhub and kubespawner - at least assuming that the docker image used by 2i2c was rebuilt when bumping the helm chart version.

sgibson91 · 2022-08-15T13:03:45Z

The most recent tag was pushed ~2 weeks ago: https://quay.io/repository/2i2c/pilot-hub?tab=tags
The most recent commit on basehub's Chart.yaml file (i.e. where we define which version of the z2jh helm chart we depend on) was Jul 6, 2022: https://github.com/2i2c-org/infrastructure/commits/master/helm-charts/basehub/Chart.yaml

Hence I believe it is safe to assume that the docker image was rebuilt when, and after, bumping the helm chart version.

I'm not sure if it's relevant, but the tag of the k8s-hub base image we define in our Dockerfile is behind that defined in the helm chart (I'm guessing these are the same thing based on comments?):

https://github.com/2i2c-org/infrastructure/blob/dd74fe1bcf2d7aeab814116a32f21bf0b2f48a7a/helm-charts/images/hub/Dockerfile#L14

So maybe our assumptions are wrong because we have a version mismatch?

minrk · 2022-08-17T09:04:45Z

Are you still seeing the 'halting hub...' message, but the hub does not halt? That's how kubespawner is meant to recover from things like this (when in doubt: restart). So if we're back in failing-to-stop territory (We may have reintroduced a form of #525 after the switch to kubernetes-asyncio), this may be a KubeSpawner or JupyterHub bug.

sgibson91 · 2022-08-17T09:36:49Z

Are you still seeing the 'halting hub...' message, but the hub does not halt?

Yes, we have to manually restart the hub.

EDIT: Actually the last time we saw this, we saw the "stopping your server" message but the server never stopped. So it might not quite be the same thing.

minrk · 2022-08-17T13:17:51Z

OK, if you're seeing the 'halting hub...' message, that means kubespawner has recognized something's wrong, decided to do the right thing that will fix it (confirmed by the fact that a manual shutdown of the hub does fix it), and then it somehow didn't quite happen.

Unfortunately, when I try to provoke this manually by inserting explicit failures, I get the exact messages @yuvipanda mentions here:

[C 2022-08-17 15:12:43.350 JupyterHub spawner:2381] Pods reflector failed, halting Hub.
ERROR:asyncio:Task was destroyed but it is pending!
task: <Task pending name='Task-15' coro=<shared_client.<locals>.close_client_task() running at /Users/minrk/dev/jpy/hub/kubespawner/kubespawner/clients.py:58> wait_for=<Future pending cb=[Task.task_wakeup()]>>
Exception ignored in: <coroutine object shared_client.<locals>.close_client_task at 0x111cb6490>
RuntimeError: coroutine ignored GeneratorExit

but then the Hub exits. This is with jupyterhub 2.3.1, kubespawner f3d44e6 .

sgibson91 · 2022-08-17T13:24:19Z

I'm not sure if it's relevant, but the tag of the k8s-hub base image we define in our Dockerfile is behind that defined in the helm chart (I'm guessing these are the same thing based on comments?):

2i2c-org/infrastructure@dd74fe1/helm-charts/images/hub/Dockerfile#L14

So maybe our assumptions are wrong because we have a version mismatch?

As I mentioned here, I'm not sure we can confident about what versions we are running since the base image we are building our hub image from has a different tag to the z2jh chart in our helm charts. I'm not sure how the two are linked, but there is a comment in our Chart.yaml that the base image tag needs bumping with tag bumps in the helm chart

minrk · 2022-08-18T07:52:05Z

One way to find out:

kubectl exec -it hub-... -- pip list

sgibson91 · 2022-08-18T08:52:39Z

This is taken from the staging hub running on our main 2i2c cluster (so I'm confident it isn't using it's own, custom hub image). Seems we are a little ahead in jupyterhub version than we first surmised but the kubespawner version is what we guessed.

$ k -n staging exec -it hub-7d6c86fcff-xhwh2 -c hub -- pip list
Package                          Version
-------------------------------- -----------
[CUT]
jupyterhub                       2.3.2.dev0
[CUT]
jupyterhub-kubespawner           4.1.0
[CUT]

And this is taken from the hub that last saw this issue (it's running on the same cluster, different namespace). It confirms the same versions.

$ k -n ohw exec -it hub-7cbc97f77f-br9jf -c hub -- pip list
Package                          Version
-------------------------------- -----------
[CUT]
jupyterhub                       2.3.2.dev0
[CUT]
jupyterhub-kubespawner           4.1.0
[CUT]

minrk · 2022-08-18T09:02:05Z

OK, then it's got every fix I'm aware of. Is it possible to capture more detailed logs from the hub? Maybe even with debug-logging on? The trick is likely that there could be a long time between when the problem occurs and when it's observed, so a lot of logs might be needed to track it down.

minrk · 2022-08-18T09:04:16Z

My hunch is that the switch to kubernetes-asyncio away from threads for the reflectors has changed something in how these failures are observed.

But I'm pretty stumped about how the halting Hub event is firing and the container is not actually exiting, even if it's showing all the statements that lead up to a successful exit that I see.

sgibson91 · 2022-08-18T09:05:57Z

Is it possible to capture more detailed logs from the hub? Maybe even with debug-logging on?

I will confer with Yuvi as to whether this is possible for us

consideRatio · 2022-08-18T10:37:56Z

This happened to the staging hub as well right? I figure a reasonable choice is to enable debugging logs for staging hubs specifically.

You can do it via z2jh config i think, debug=true.

sgibson91 · 2022-08-18T11:11:13Z

This happened to the staging hub as well right?

Yeah, when it happens usually all hubs on the cluster go down because it's caused by the k8s master outage. So I figure enable it on staging and maybe one or two others as a fallback?

consideRatio · 2022-08-18T11:29:46Z

@sgibson91 not confident about if and how long 2i2c retains logs and the associated data amounts accumulated I'd go a 👍 only on the staging hubs for now, but if you feel at ease with that aspect then 👍 also for a production hub if you think it can be useful generally.

sgibson91 · 2022-08-18T11:33:04Z

I discovered we have a logs backup in GCP! So I think retention and space is good, but I will double check.

consideRatio · 2022-08-18T11:44:48Z

@sgibson91 nice!

I've been part of incidents when logs have been in such large amounts that it incurred very significant cost ingesting and storing them. Sometimes more logs is worse as it can incur costs like that, and can be harder to read through logs that have too many lines of information added to them.

minrk · 2022-08-19T09:22:49Z

It's entirely possible regular logs over an extended period are sufficient, if you have still them surrounding any recent incidents. The main thing is capturing quite a bit of time to sift through.

sgibson91 · 2022-08-25T17:41:29Z

We just had another incident happen about an hour ago so I'm looking through the logs now

sgibson91 · 2022-08-25T18:04:35Z

[logs source]

So I can see a user server pending stop for a long time

A server delete request made that triggers a message about already being active

The a hub shutdown with complaints about leaving single-user servers running and not being able to clean up the proxy. This is when we start to see the asyncio/tornado errors that were reported in other issues.

abkfenris · 2022-08-25T18:10:30Z

gmanuch is the user that reported the issue. There should be a request from me to try to start a server to test if this was the same as last time, before I triggered the hub shutdown to try to clear it.

sgibson91 · 2022-08-25T18:32:27Z

I've scrolled through logs over a time period of 11:30 UTC up to 17:00 UTC. Everything looks normal until this 303 GET to gmanuch's lab endpoint comes in. Then we start seeing the message that that server is pending stop. As far as I can tell, it's the only call to /lab in that time frame. @abkfenris do you have any JupyterLab modifications in your image do you know?

I see you have default URL set as lab for the Python image

https://github.com/2i2c-org/infrastructure/blob/8c8e37b012a4c42f916df767b47db1b98d91190c/config/clusters/2i2c/ohw.values.yaml#L50

abkfenris · 2022-08-25T18:45:17Z

For our Python environment (lockfile), we should be getting JupyterLab via pangeo-notebook=2022.07.12, and the only packages that I can think of that would directly change things would be jupyter_contrib_nbextensions and jupyterlab-git. We shouldn't be doing any other modifications during our build.

GeorgianaElena · 2022-08-25T20:59:28Z

Small note that I've also spotted a W 2022-08-25 16:36:58.665 JupyterHub base:393] Invalid or expired cookie token a couple of minutes the first "is pending stop" message which is coming from here.

Maybe the issue is an inconsistent state because of the following:

because the cookie got invalidated, the user object cannot be retrieved by user_for_cookie and so, nor its server which may have never gotten the stop request.
but the SpawnPendingHandler registered the stop request for that particular user and knows it has to wait for it to finish, which will never happen since the user's access got invalidated and hub doesn't know how to trace back the request to the correct user and server

minrk · 2022-08-26T07:26:06Z

complaints about leaving single-user servers running

Sorry about this, it's not complaining - this is normal behavior, chosen by z2jh configuration (as opposed to shutting down user servers when the hub shuts down).

I don't have access to the logs in context, but there should at least be an event about triggering the initial stop prior to 16:34:46. Are there any logs about the username leading up to this?

I think it's very unlikely that credentials are the issue, because the SpawnPendingHandler won't do anything if the credentials aren't up to date. Pending state is stored on the Spawner object itself, not related to authentication of any given request, and doesn't get registered until the stop has actually begun here.

But is this the same as the original issue? One user being stuck definitely could be caused by the same issue as the original (reflector problems, general cluster communication problems), but it could also be a different problem related to one call to KubeSpawner.stop never returning, or taking a long time. The timeout for this method is long (at least 5 minutes, I think, depending on configuration), so a minute or two may not mean anything is wrong, just slow.

GeorgianaElena · 2022-08-26T08:26:43Z

don't have access to the logs in context, but there should at least be an event about triggering the initial stop prior to 16:34:46. Are there any logs about the username leading up to this?

Searching after the username, (starting 14:53, the earliest I checked) there are only logs like the one bellow every 5 mins

[I 2022-08-25 14:53:33.732 JupyterHub log:189] 200 POST /hub/api/users/gmanuch/activity (gmanuch@10.0.41.10) 13.27ms

And then the 303 that's in the logs shared by @sgibson91:

[I 2022-08-25 16:34:46.423 JupyterHub log:189] 303 GET /hub/user/gmanuch/lab (gmanuch@10.128.0.29) 12.86ms
2022-08-25 19:34:46.487 EEST
[I 2022-08-25 16:34:46.486 JupyterHub pages:401] gmanuch is pending stop
2022-08-25 19:34:46.489 EEST
[I 2022-08-25 16:34:46.489 JupyterHub log:189] 200 GET /hub/spawn-pending/gmanuch?next=%2Fhub%2Fuser%2Fgmanuch%2Flab (gmanuch@10.128.0.29) 5.66ms
2022-08-25 19:34:52.043 EEST
[I 2022-08-25 16:34:52.043 JupyterHub pages:401] gmanuch is pending stop
2022-08-25 19:34:57.207 EEST
[I 2022-08-25 16:34:57.207 JupyterHub pages:401] gmanuch is pending stop
2022-08-25 19:35:02.358 EEST
[I 2022-08-25 16:35:02.358 JupyterHub pages:401] gmanuch is pending stop
2022-08-25 19:35:07.524 EEST
[I 2022-08-25 16:35:07.524 JupyterHub pages:401] gmanuch is pending stop
2022-08-25 19:35:12.681 EEST
[I 2022-08-25 16:35:12.680 JupyterHub pages:401] gmanuch is pending stop

sgibson91 · 2022-08-26T09:40:37Z

@minrk I invited you to the project with Logs Viewer role (I'm hoping that's enough for you to do what you need to, but let me know if it's not), so this link should work for you now.

@GeorgianaElena is right though, I checked from 11am UTC and only found calls to the activity endpoint up until that 303 to /lab

minrk · 2022-08-26T12:02:19Z

Why this stuff is hard: The relevant error occurred at 2022-08-24 20:00 UTC, while the issue was observed at 2022-08-25T 6:35 UTC, fully 20 hours later.

sgibson91 · 2022-08-26T12:29:17Z

Thank you @minrk 🙏

minrk · 2022-08-26T12:51:00Z

Also relevant is longstanding issue in JupyterHub: jupyterhub/jupyterhub#1677 We don't have timeouts on Spawner.stop because it's unclear what the state should be when stop times out or fails. The fact that these user.stop() calls are taking literally forever has very hard to identify symptoms, which could have been clearer if there were timeouts being logged. Plus, that would at least mean you can't be stuck in 'pending' forever (though what state you should be in is unclear).

sgibson91 added the bug label Aug 12, 2022

This comment was marked as resolved.

Sign in to view

minrk mentioned this issue Aug 26, 2022

catch errors in reflector.start #635

Merged

minrk mentioned this issue Aug 26, 2022

Handling of spawners that are slow to stop jupyterhub/jupyterhub#1677

Open

consideRatio closed this as completed in #635 Aug 29, 2022

This was referenced Aug 29, 2022

Move pilot-hubs cluster to a regional k8s cluster for better availability 2i2c-org/infrastructure#1102

Open

Upgrade our hubs to Z2JH 2 / JupyterHub 3.0 2i2c-org/infrastructure#1055

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resilience needed against k8s master outages #627

Resilience needed against k8s master outages #627

sgibson91 commented Aug 12, 2022 •

edited

Loading

This comment was marked as resolved.

consideRatio commented Aug 12, 2022 •

edited

Loading

sgibson91 commented Aug 15, 2022 •

edited

Loading

minrk commented Aug 17, 2022

sgibson91 commented Aug 17, 2022 •

edited

Loading

minrk commented Aug 17, 2022

sgibson91 commented Aug 17, 2022

minrk commented Aug 18, 2022

sgibson91 commented Aug 18, 2022

minrk commented Aug 18, 2022

minrk commented Aug 18, 2022

sgibson91 commented Aug 18, 2022

consideRatio commented Aug 18, 2022

sgibson91 commented Aug 18, 2022

consideRatio commented Aug 18, 2022

sgibson91 commented Aug 18, 2022

consideRatio commented Aug 18, 2022

minrk commented Aug 19, 2022

sgibson91 commented Aug 25, 2022

sgibson91 commented Aug 25, 2022 •

edited

Loading

abkfenris commented Aug 25, 2022

sgibson91 commented Aug 25, 2022 •

edited

Loading

abkfenris commented Aug 25, 2022

GeorgianaElena commented Aug 25, 2022

minrk commented Aug 26, 2022

GeorgianaElena commented Aug 26, 2022

sgibson91 commented Aug 26, 2022

minrk commented Aug 26, 2022

sgibson91 commented Aug 26, 2022

minrk commented Aug 26, 2022

Resilience needed against k8s master outages #627

Resilience needed against k8s master outages #627

Comments

sgibson91 commented Aug 12, 2022 • edited Loading

Bug description

Expected behaviour

Actual behaviour

How to reproduce

Your personal set up

Related issues

Versions

This comment was marked as resolved.

consideRatio commented Aug 12, 2022 • edited Loading

sgibson91 commented Aug 15, 2022 • edited Loading

minrk commented Aug 17, 2022

sgibson91 commented Aug 17, 2022 • edited Loading

minrk commented Aug 17, 2022

sgibson91 commented Aug 17, 2022

minrk commented Aug 18, 2022

sgibson91 commented Aug 18, 2022

minrk commented Aug 18, 2022

minrk commented Aug 18, 2022

sgibson91 commented Aug 18, 2022

consideRatio commented Aug 18, 2022

sgibson91 commented Aug 18, 2022

consideRatio commented Aug 18, 2022

sgibson91 commented Aug 18, 2022

consideRatio commented Aug 18, 2022

minrk commented Aug 19, 2022

sgibson91 commented Aug 25, 2022

sgibson91 commented Aug 25, 2022 • edited Loading

abkfenris commented Aug 25, 2022

sgibson91 commented Aug 25, 2022 • edited Loading

abkfenris commented Aug 25, 2022

GeorgianaElena commented Aug 25, 2022

minrk commented Aug 26, 2022

GeorgianaElena commented Aug 26, 2022

sgibson91 commented Aug 26, 2022

minrk commented Aug 26, 2022

sgibson91 commented Aug 26, 2022

minrk commented Aug 26, 2022

sgibson91 commented Aug 12, 2022 •

edited

Loading

consideRatio commented Aug 12, 2022 •

edited

Loading

sgibson91 commented Aug 15, 2022 •

edited

Loading

sgibson91 commented Aug 17, 2022 •

edited

Loading

sgibson91 commented Aug 25, 2022 •

edited

Loading

sgibson91 commented Aug 25, 2022 •

edited

Loading