-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incident report: U. Toronto credentials expired #637
Comments
Current hypothesis is that the original AzureAD credentials had a 1y timer set and that expired. Once we get new credentials, we should update them in https://github.com/utoronto-2i2c/jupyterhub-deploy/blob/e1fd790000734f1356a5a78357ccca0df0d21f38/deployments/utoronto/secrets/prod.yaml#L20 and equivalent staging.yaml file, and do a deploy. |
Also note that there's some conversation about this in Slack here: https://2i2c.slack.com/archives/C01DB2JRP8W/p1630332209002400 |
I believe that the main issue is now resolved - we just need to open up a post mortem to discuss what happened and make sure that we have follow-ups in place. Here's some conversation about it: https://2i2c.slack.com/archives/C01DB2JRP8W/p1630426159029600 |
I have updated this issue to reflect the major timeline of this incident, and added some extra context about it in the top comment. @yuvipanda I believe that I've tracked the two issues we opened to improve process. There is one other item about using credentials provided by UToronto instead of our own. I am not sure what kind of issue could address this (or if it would even be possible for UToronto). Do you have any advice for what to do there? Should we open an issue about asking them to let us provision this credentail? |
I am going to close this one, as I think that we can discuss the broader credentials challenge in #638 |
Summary
The university of Toronto Hub started reporting
500 server error
s, as reported to us via FreshDesk. Investigation discovered that this was because the credentials we were using with Azure had expired, meaning that no users could log in to the hub. We fixed this by renewing the credentials from UToronto IT, and updating the hub configuration with those values.Initial Description
The University of Toronto hub as https://jupyter.utoronto.ca is returning
500 server error
responses to all users. In addition, it appears that the Grafana page (https://grafana.utoronto.2i2c.cloud/) is not accessible.To reproduce this, go to https://jupyter.utoronto.ca/hub/login?next=%2Fhub%2F
it redirects, and then the page is a 500 error. Here's the URL of my redirect in case it's helpful:
https://jupyter.utoronto.ca/hub/oauth_callback?code=0.ARwAJsKqeAMvTUuQN7RtVsVSEJxFG96QWi9DgPrQ_jtop5sXAKA.AQABAAIAAAD--DLA3VO7QrddgJg7WevrzKJaPUkFPoaOFC7o2CxoIHQ5eDQAOmLAh7GABLchfycjtv-5fFVmfthWgWeyd_TUM7px2xFhXzdTr1ymAu7de0aW8JRgLgn33vIhHITeTi0gkg4vJTAxkZ89P_MF55GwqL7Bbq1ZaMYb3XxYpJGUFv5MZN0EOMitnvIfzOEFpB_QxRPKilqDNTaq3GEV41GcUbySjfTzc9iX0qBcURhKD9LIsjU4sIIYoKdyoj6Yjro4NZQzNe4EKvB8gcJKAJ3vPe1dxUf9JUzDvtXZfPpLDAViHw5ib0N_dSslaq2gTD_B1m6Xf5Av0vv3X--IA3dwkC9yHidjtwCrX6gXNBGwCECXLLyjYZ2lbN0V76c7oEhpYZsOQpSsNYKX9J3jshGqzUEOgIqjJ6kZbIUMFUyNnnuF76e6OVas1mPmAT7mxQP-7Dlg9HAP6o037ebGehoMGFJjrOesURpA9FU6Ew4edH0Y4vRypgWxKDtqg_H3Oq7pSj-7VhdReYjsqTylBx8snW-nCTCpZNR7k0ySAA4TDDamwvHTkivkpp6I8kGxF3ib-WatFQfMnVG-GKiLvtkolO4MBd5WLL5vC8WIllOxca5cWbN6wOJuysugZOA0o8Jn3IEZTW8OZk_So0EGo1tdfsgOk1MzidSiSk9HTnrPWlCJzRyVGHFOle3Pp6Qmh48gAA&state=eyJzdGF0ZV9pZCI6ICJjYmZhMzNhMTRjNGQ0MmUyYTE3ZGY0NGUyOTUwOWUxMSIsICJuZXh0X3VybCI6ICIvaHViLyJ9&session_state=281b083b-dbee-4829-b16e-c8b4224ff38c#
The hub infrastructure is deployed differently from our others, and is at this repository:
https://github.com/utoronto-2i2c/jupyterhub-deploy/
I believe that @yuvipanda or @GeorgianaElena may only be the ones that have
kubectl
access however. We should confirm this with them.FreshDesk ticket thread: https://2i2c.freshdesk.com/a/tickets/20
Timeline
If it makes sense to include a timeline for this debrief, then do so below. This is usually most-useful for post-mortems.
All times in US/Pacific.
08/30 9:30am
Ticket opened in Freshdesk noting that users could not log in due to 500 errors on hub.
10:12am
Incident is reported in our Slack and investigation starts.
11:45am
Notice that cluster scale-up event had happened recently. However this didn't seem to be an explanation for this problem.
12:05pm
Noted that attempting to access hub logs was showing a lot of
HTTP 401: Unauthorized
responses.We guessed that this might be related to an authentication problem
1:00pm
We asked Toronto IT if the credentials for Azure Active Directory were outdated, as this would explain the 401: Unauthorized errors.
1:30pm
They note that the person that provisioned these credentials is away and they are following up.
3:07pm
They confirm that the credentials do look out of date, they ask how they can send to us.
3:30pm
We suggest a few different options for sending secure credentials. By this point the person with the credentials @ toronto was no longer on the clock.
08/31 12:00pm
We receive the new credentials and update the hub. Confirm that it now works!
What went wrong
Where we got lucky
We were lucky to diagnose the problem correctly relatively quickly!
Follow-ups
See the linked issues above for follow ups
In addition, we slightly altered our incident reporting label: #648
Actions
The text was updated successfully, but these errors were encountered: