-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Incident] OceanHackWeek hub cannot start server #1616
Comments
ProblemWhen requesting a server only results in the "Server Requested" log and does not proceed. Alex reported in the ticket this eventually changes to "Your server is stopping, you will be able to start is again once it has finished stopping." RecreationI logged in and tried to start a server myself and also got the "Server Requested" log and no further. I have not yet seen the "Your server is stopping..." message. I activated kubectl access to the hub and looked at the events of my pod with
Which indicates to me the server started successfully but none of the logs were streamed to the spawning page, and indeed the redirection to the server did not occur. I tried stopping my server from the UI and then deleting the pod and this is when I see the "Your server is stopping..." message and appear to be stuck there. All other pods in the namespace are running:
|
I tried restarting the hub pod (by deleting it) and that has resolved the issue, I can now start a server. Not entirely sure what happened though. |
Do the events from the hub pod show any issues from probe failures? |
Unfortunately I don't know how to retrieve logs from before the restart. I tried the following with no luck:
|
I believe Do you ship the logs anywhere? |
After-action reportThese sections should be filled out once we've resolved the incident and know what happened. TimelineA short list of dates / times and major updates, with links to relevant comments in the issue for more context. All times in BST (UTC+1).
What went wrongThings that could have gone better. Ideally these should result in concrete
Where we got luckyThese are good things that happened to us but not because we had planned for them.
Follow-up actionsEvery action item should have a GitHub issue (even a small skeleton of one) attached to it, so these do not get forgotten. These issues don't have to be in Process improvements
Documentation improvements
Technical improvements
|
Not to my knowledge |
Dang. GCP or whomever may have caught the logs anyways if you poke around in the console. Loki also collects them, not sure if you have that set up as part of your Grafana. |
Yuvi has just informed me that we do indeed have logs in the GCP console! They're a bit hard to read though, but there are a lot of |
Try looking back before 7 or so Eastern. That's when I first had issues. |
Does the hub pod share the NFS mount with the users? Could it have been affected by the same space issue as we hit yesterday, and we just didn't have anyone else try to launch a server after that fix rolled out? |
@abkfenris We suspect this is an issue with the availability of the k8s master. We are seeing spawning processes being cancelled, should not be related to the NFS. The 2i2c cluster is not regional, so it does not have high availability of the k8s master so we see issues like this occasionally and the fix is to restart the hub pod. We have an issue to move the cluster to be a regional one, but it would be a destructive process and we need to coordinate appropriate downtime with everyone who has a hub running on this cluster: #1102 |
I found these logs in another hub that had the same symptoms at the same time:
This is definitely due to the GKE master having a hiccup. It usually recovers shortly, but it looks like the jupyterhub process isn't :( We should definitely report this upstream. |
The hub also has a 'shut down' button in the admin panel that would also fix this specific problem, where you just see 'server requested' and nothing happens. |
Do you mean 'Shutdown Hub' or 'Stop All' or the per user 'Stop Server'? I had tried 'Stop server' on my own server when it was in that state. |
The "Shutdown hub" button will restart the hub |
I am going to close this issue now because:
|
Just to clarify, the the 'Shutdown Hub' button asks the hub pod to terminate itself, so that Kubernetes replaces it? I ask, as I caused an incident in high school when I found a shutdown & restart button on the compute cluster that I wasn't supposed to have access too. Unsurprisingly I clicked it, then everything went down for hundreds of students and faculty. One screen broken from a classmate putting their fist through it later, it came back up as I thankfully hit restart, but I was hanging out with the tech crew for the rest of the class period making sure that no one else could find the same bug. |
Yes, that's correct! |
We just (~12:26 Eastern) had the hub lock up again, but things seemed to recover immediately after hitting the 'Shutdown Hub' button. |
Summary
OceanHackWeek hub cannot start a server. Reported in https://2i2c.freshdesk.com/a/tickets/172
Impact on users
Hub unusable as no one can start a server.
Important information
Tasks and updates
After-action report template
The text was updated successfully, but these errors were encountered: