[Incident] OceanHackWeek hub cannot start server #1616

sgibson91 · 2022-08-10T14:47:16Z

Summary

OceanHackWeek hub cannot start a server. Reported in https://2i2c.freshdesk.com/a/tickets/172

Impact on users

Hub unusable as no one can start a server.

Important information

Hub URL: oceanhackweek.2i2c.cloud
Support ticket ref: https://2i2c.freshdesk.com/a/tickets/172

Tasks and updates

Discuss and address incident, leaving comments below with updates
Incident has been dealt with or is over
Copy/paste the after-action report below and fill in relevant sections
Incident title is discoverable and accurate
All actionable items in report have linked GitHub Issues

After-action report template

# After-action report

These sections should be filled out once we've resolved the incident and know what happened.
They should focus on the knowledge we've gained and any improvements we should take.

## Timeline

_A short list of dates / times and major updates, with links to relevant comments in the issue for more context._

All times in {{ most convenient timezone}}.

- {{ yyyy-mm-dd }} - [Summary of first update](link to comment)
- {{ yyyy-mm-dd }} - [Summary of another update](link to comment)
- {{ yyyy-mm-dd }} - [Summary of final update](link to comment)


## What went wrong

_Things that could have gone better. Ideally these should result in concrete
action items that have GitHub issues created for them and linked to under
Action items._

- Thing one
- Thing two

## Where we got lucky

_These are good things that happened to us but not because we had planned for them._

- Thing one
- Thing two

## Follow-up actions

_Every action item should have a GitHub issue (even a small skeleton of one) attached to it, so these do not get forgotten. These issues don't have to be in `infrastructure/`, they can be in other repositories._

### Process improvements

1. {{ summary }} [link to github issue]
2. {{ summary }} [link to github issue]

### Documentation improvements

1. {{ summary }} [link to github issue]
2. {{ summary }} [link to github issue]

### Technical improvements

1. {{ summary }} [link to github issue]
2. {{ summary }} [link to github issue]

The text was updated successfully, but these errors were encountered:

sgibson91 · 2022-08-10T14:54:51Z

Problem

When requesting a server only results in the "Server Requested" log and does not proceed. Alex reported in the ticket this eventually changes to "Your server is stopping, you will be able to start is again once it has finished stopping."

Recreation

I logged in and tried to start a server myself and also got the "Server Requested" log and no further. I have not yet seen the "Your server is stopping..." message.

I activated kubectl access to the hub and looked at the events of my pod with kubectl describe pod jupyter-sgibson91 which revealed:

Events:
  Type    Reason     Age   From                                   Message
  ----    ------     ----  ----                                   -------
  Normal  Scheduled  17m   gke.io/optimize-utilization-scheduler  Successfully assigned ohw/jupyter-sgibson91 to gke-pilot-hubs-cluster-nb-ohw-e4ba9924-lj4s
  Normal  Pulled     17m   kubelet                                Container image "ghcr.io/oceanhackweek/python:d98b914" already present on machine
  Normal  Created    17m   kubelet                                Created container notebook
  Normal  Started    17m   kubelet                                Started container notebook

Which indicates to me the server started successfully but none of the logs were streamed to the spawning page, and indeed the redirection to the server did not occur.

I tried stopping my server from the UI and then deleting the pod and this is when I see the "Your server is stopping..." message and appear to be stuck there.

All other pods in the namespace are running:

NAME                                           READY   STATUS    RESTARTS   AGE
api-ohw-dask-gateway-b45486c7b-62m9j           1/1     Running   0          12d
continuous-image-puller-44pjd                  1/1     Running   0          47h
controller-ohw-dask-gateway-7f5dd9b6dd-ln9m7   1/1     Running   0          48d
hub-86b885f486-45pms                           2/2     Running   1          5d1h
jupyter-abkfenris                              1/1     Running   0          4h7m
jupyter-almacarolina                           1/1     Running   0          46h
jupyter-anujjain2579                           1/1     Running   0          10h
jupyter-clairedavies                           1/1     Running   0          8h
jupyter-gmanuch                                1/1     Running   0          17h
jupyter-jinjintwice                            1/1     Running   0          9h
jupyter-leonardolaiolo                         1/1     Running   0          8h
jupyter-noraloose                              1/1     Running   0          19m
proxy-5bc58b774d-nmx9j                         1/1     Running   0          34d
traefik-ohw-dask-gateway-694d9776f6-48hxg      1/1     Running   0          40d

sgibson91 · 2022-08-10T14:57:41Z

I tried restarting the hub pod (by deleting it) and that has resolved the issue, I can now start a server. Not entirely sure what happened though.

abkfenris · 2022-08-10T15:06:04Z

Do the events from the hub pod show any issues from probe failures?

sgibson91 · 2022-08-10T15:10:33Z

Unfortunately I don't know how to retrieve logs from before the restart. I tried the following with no luck:

❯ k logs -c hub hub-86b885f486-mwnqt --previous
Error from server (BadRequest): previous terminated container "hub" in pod "hub-86b885f486-mwnqt" not found

abkfenris · 2022-08-10T15:15:28Z

I believe k describe hub-86b885f486-mwnqt would show probe events, but it looks like the JupyterHub healthcheck that the probes hit is pretty shallow, so they may not be the most effective at catching when a hub pod has an issue.

https://github.com/jupyterhub/zero-to-jupyterhub-k8s/blob/548111f1b7d716b744c3abaa414b76c03b7eeed9/jupyterhub/templates/hub/deployment.yaml#L222-L229

https://github.com/jupyterhub/jupyterhub/blob/3b59c4861f155f868bcf29c00dfa78034d289950/jupyterhub/handlers/pages.py#L584-L592

Do you ship the logs anywhere?

sgibson91 · 2022-08-10T15:19:44Z

After-action report

These sections should be filled out once we've resolved the incident and know what happened.
They should focus on the knowledge we've gained and any improvements we should take.

Timeline

A short list of dates / times and major updates, with links to relevant comments in the issue for more context.

All times in BST (UTC+1).

2022-08-10 15:54 - Incident reported and initial
2022-08-10 15:57 - Incident resolved

What went wrong

Things that could have gone better. Ideally these should result in concrete
action items that have GitHub issues created for them and linked to under
Action items.

It is not clear what technically went wrong, because... Update: This was an outage of the k8s master on the 2i2c cluster
Engineer didn't collect all relevant logs

Where we got lucky

These are good things that happened to us but not because we had planned for them.

The random decision to restart the pod worked

Follow-up actions

Every action item should have a GitHub issue (even a small skeleton of one) attached to it, so these do not get forgotten. These issues don't have to be in infrastructure/, they can be in other repositories.

Process improvements

A checklist of what "top 5" things to try when faced with an incident would help a lot, but it also needs to have steps that ensure any logs are preserved in an issue before destructive actions are taken and they are lost
- Create a task checklist for responding to incidents #1617

Documentation improvements

{{ summary }} [link to github issue]
{{ summary }} [link to github issue]

Technical improvements

We need to make the 2i2c cluster regional to prevent these kinds of outages
- Move pilot-hubs cluster to a regional k8s cluster for better availability #1102

sgibson91 · 2022-08-10T15:26:43Z

Do you ship the logs anywhere?

Not to my knowledge

abkfenris · 2022-08-10T15:34:55Z

Do you ship the logs anywhere?

Not to my knowledge

Dang. GCP or whomever may have caught the logs anyways if you poke around in the console. Loki also collects them, not sure if you have that set up as part of your Grafana.

sgibson91 · 2022-08-10T16:35:26Z

Yuvi has just informed me that we do indeed have logs in the GCP console! They're a bit hard to read though, but there are a lot of CancelledErrors stemming from asyncio from before the incident report.

abkfenris · 2022-08-10T16:38:35Z

Try looking back before 7 or so Eastern. That's when I first had issues.

abkfenris · 2022-08-10T16:41:23Z

Does the hub pod share the NFS mount with the users? Could it have been affected by the same space issue as we hit yesterday, and we just didn't have anyone else try to launch a server after that fix rolled out?

sgibson91 · 2022-08-10T16:43:54Z

@abkfenris We suspect this is an issue with the availability of the k8s master. We are seeing spawning processes being cancelled, should not be related to the NFS. The 2i2c cluster is not regional, so it does not have high availability of the k8s master so we see issues like this occasionally and the fix is to restart the hub pod. We have an issue to move the cluster to be a regional one, but it would be a destructive process and we need to coordinate appropriate downtime with everyone who has a hub running on this cluster: #1102

yuvipanda · 2022-08-10T18:22:47Z

I found these logs in another hub that had the same symptoms at the same time:

client_session: <aiohttp.client.ClientSession object at 0x7f76cb1a4790>
[E 2022-08-10 04:55:25.413 JupyterHub reflector:351] Watching resources never recovered, giving up
    Traceback (most recent call last):
      File "/usr/local/lib/python3.9/site-packages/kubespawner/reflector.py", line 285, in _watch_and_update
        resource_version = await self._list_and_update()
      File "/usr/local/lib/python3.9/site-packages/kubespawner/reflector.py", line 228, in _list_and_update
        initial_resources_raw = await list_method(**kwargs)
      File "/usr/local/lib/python3.9/site-packages/kubernetes_asyncio/client/api_client.py", line 185, in __call_api
        response_data = await self.request(
      File "/usr/local/lib/python3.9/site-packages/kubernetes_asyncio/client/rest.py", line 193, in GET
        return (await self.request("GET", url,
      File "/usr/local/lib/python3.9/site-packages/kubernetes_asyncio/client/rest.py", line 177, in request
        r = await self.pool_manager.request(**args)
      File "/usr/local/lib/python3.9/site-packages/aiohttp/client.py", line 535, in _request
        conn = await self._connector.connect(
      File "/usr/local/lib/python3.9/site-packages/aiohttp/connector.py", line 542, in connect
        proto = await self._create_connection(req, traces, timeout)
      File "/usr/local/lib/python3.9/site-packages/aiohttp/connector.py", line 907, in _create_connection
        _, proto = await self._create_direct_connection(req, traces, timeout)
      File "/usr/local/lib/python3.9/site-packages/aiohttp/connector.py", line 1206, in _create_direct_connection
        raise last_exc
      File "/usr/local/lib/python3.9/site-packages/aiohttp/connector.py", line 1175, in _create_direct_connection
        transp, proto = await self._wrap_create_connection(
      File "/usr/local/lib/python3.9/site-packages/aiohttp/connector.py", line 992, in _wrap_create_connection
        raise client_error(req.connection_key, exc) from exc
    aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host 10.3.240.1:443 ssl:default [Connect call failed ('10.3.240.1', 443)]
    
[C 2022-08-10 04:55:25.414 JupyterHub spawner:2326] Pods reflector failed, halting Hub.
ERROR:asyncio:Task was destroyed but it is pending!
task: <Task pending name='Task-3' coro=<shared_client.<locals>.close_client_task() running at /usr/local/lib/python3.9/site-packages/kubespawner/clients.py:58> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f76cb113280>()]>>
Exception ignored in: <coroutine object shared_client.<locals>.close_client_task at 0x7f76d0662dc0>
RuntimeError: coroutine ignored GeneratorExit

This is definitely due to the GKE master having a hiccup. It usually recovers shortly, but it looks like the jupyterhub process isn't :( We should definitely report this upstream.

yuvipanda · 2022-08-10T18:24:10Z

The hub also has a 'shut down' button in the admin panel that would also fix this specific problem, where you just see 'server requested' and nothing happens.

abkfenris · 2022-08-10T19:58:54Z

The hub also has a 'shut down' button in the admin panel that would also fix this specific problem, where you just see 'server requested' and nothing happens.

Do you mean 'Shutdown Hub' or 'Stop All' or the per user 'Stop Server'? I had tried 'Stop server' on my own server when it was in that state.

sgibson91 · 2022-08-10T20:16:04Z

The "Shutdown hub" button will restart the hub

sgibson91 · 2022-08-11T09:04:32Z

FYI, location of the logs in GCP: https://console.cloud.google.com/logs/query;query=resource.type%3D%22k8s_container%22%0Aresource.labels.container_name%3D%22hub%22%0Aresource.labels.namespace_name%3D%22ohw%22;timeRange=2022-08-10T14:22:12.000Z%2F2022-08-10T15:30:12.000Z;cursorTimestamp=2022-08-10T14:55:01.930889041Z?authuser=1&project=two-eye-two-see

sgibson91 · 2022-08-11T09:19:18Z

I am going to close this issue now because:

The incident is over
The After Incident report has been posted and updated
All items in the After Incident report have links to other issues for tracking

abkfenris · 2022-08-11T15:51:12Z

Just to clarify, the the 'Shutdown Hub' button asks the hub pod to terminate itself, so that Kubernetes replaces it?

I ask, as I caused an incident in high school when I found a shutdown & restart button on the compute cluster that I wasn't supposed to have access too. Unsurprisingly I clicked it, then everything went down for hundreds of students and faculty. One screen broken from a classmate putting their fist through it later, it came back up as I thankfully hit restart, but I was hanging out with the tech crew for the rest of the class period making sure that no one else could find the same bug.

sgibson91 · 2022-08-11T17:39:04Z

Just to clarify, the the 'Shutdown Hub' button asks the hub pod to terminate itself, so that Kubernetes replaces it?

Yes, that's correct!

abkfenris · 2022-08-25T16:51:57Z

We just (~12:26 Eastern) had the hub lock up again, but things seemed to recover immediately after hitting the 'Shutdown Hub' button.

sgibson91 added type: Hub Incident labels Aug 10, 2022

damianavila added this to DEPRECATED Engineering and Product Backlog Aug 10, 2022

damianavila assigned sgibson91 and yuvipanda Aug 10, 2022

sgibson91 closed this as completed Aug 11, 2022

sgibson91 moved this to Complete in DEPRECATED Engineering and Product Backlog Aug 11, 2022

sgibson91 mentioned this issue Aug 12, 2022

Resilience needed against k8s master outages jupyterhub/kubespawner#627

Closed

sgibson91 mentioned this issue Aug 31, 2022

Upgrade our hubs to Z2JH 2 / JupyterHub 3.0 #1055

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Incident] OceanHackWeek hub cannot start server #1616

[Incident] OceanHackWeek hub cannot start server #1616

sgibson91 commented Aug 10, 2022 •

edited

Loading

sgibson91 commented Aug 10, 2022

sgibson91 commented Aug 10, 2022 •

edited

Loading

abkfenris commented Aug 10, 2022

sgibson91 commented Aug 10, 2022

abkfenris commented Aug 10, 2022

sgibson91 commented Aug 10, 2022 •

edited

Loading

sgibson91 commented Aug 10, 2022

abkfenris commented Aug 10, 2022

sgibson91 commented Aug 10, 2022

abkfenris commented Aug 10, 2022

abkfenris commented Aug 10, 2022

sgibson91 commented Aug 10, 2022 •

edited

Loading

yuvipanda commented Aug 10, 2022

yuvipanda commented Aug 10, 2022

abkfenris commented Aug 10, 2022

sgibson91 commented Aug 10, 2022

sgibson91 commented Aug 11, 2022

sgibson91 commented Aug 11, 2022

abkfenris commented Aug 11, 2022

sgibson91 commented Aug 11, 2022

abkfenris commented Aug 25, 2022

[Incident] OceanHackWeek hub cannot start server #1616

[Incident] OceanHackWeek hub cannot start server #1616

Comments

sgibson91 commented Aug 10, 2022 • edited Loading

Summary

Impact on users

Important information

Tasks and updates

sgibson91 commented Aug 10, 2022

Problem

Recreation

sgibson91 commented Aug 10, 2022 • edited Loading

abkfenris commented Aug 10, 2022

sgibson91 commented Aug 10, 2022

abkfenris commented Aug 10, 2022

sgibson91 commented Aug 10, 2022 • edited Loading

After-action report

Timeline

What went wrong

Where we got lucky

Follow-up actions

Process improvements

Documentation improvements

Technical improvements

sgibson91 commented Aug 10, 2022

abkfenris commented Aug 10, 2022

sgibson91 commented Aug 10, 2022

abkfenris commented Aug 10, 2022

abkfenris commented Aug 10, 2022

sgibson91 commented Aug 10, 2022 • edited Loading

yuvipanda commented Aug 10, 2022

yuvipanda commented Aug 10, 2022

abkfenris commented Aug 10, 2022

sgibson91 commented Aug 10, 2022

sgibson91 commented Aug 11, 2022

sgibson91 commented Aug 11, 2022

abkfenris commented Aug 11, 2022

sgibson91 commented Aug 11, 2022

abkfenris commented Aug 25, 2022

sgibson91 commented Aug 10, 2022 •

edited

Loading

sgibson91 commented Aug 10, 2022 •

edited

Loading

sgibson91 commented Aug 10, 2022 •

edited

Loading

sgibson91 commented Aug 10, 2022 •

edited

Loading