Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incident report: U. Toronto credentials expired #637

Closed
3 tasks done
choldgraf opened this issue Aug 30, 2021 · 5 comments
Closed
3 tasks done

Incident report: U. Toronto credentials expired #637

choldgraf opened this issue Aug 30, 2021 · 5 comments
Assignees

Comments

@choldgraf
Copy link
Member

choldgraf commented Aug 30, 2021

Summary

The university of Toronto Hub started reporting 500 server errors, as reported to us via FreshDesk. Investigation discovered that this was because the credentials we were using with Azure had expired, meaning that no users could log in to the hub. We fixed this by renewing the credentials from UToronto IT, and updating the hub configuration with those values.

Initial Description

The University of Toronto hub as https://jupyter.utoronto.ca is returning 500 server error responses to all users. In addition, it appears that the Grafana page (https://grafana.utoronto.2i2c.cloud/) is not accessible.

To reproduce this, go to https://jupyter.utoronto.ca/hub/login?next=%2Fhub%2F

it redirects, and then the page is a 500 error. Here's the URL of my redirect in case it's helpful:

https://jupyter.utoronto.ca/hub/oauth_callback?code=0.ARwAJsKqeAMvTUuQN7RtVsVSEJxFG96QWi9DgPrQ_jtop5sXAKA.AQABAAIAAAD--DLA3VO7QrddgJg7WevrzKJaPUkFPoaOFC7o2CxoIHQ5eDQAOmLAh7GABLchfycjtv-5fFVmfthWgWeyd_TUM7px2xFhXzdTr1ymAu7de0aW8JRgLgn33vIhHITeTi0gkg4vJTAxkZ89P_MF55GwqL7Bbq1ZaMYb3XxYpJGUFv5MZN0EOMitnvIfzOEFpB_QxRPKilqDNTaq3GEV41GcUbySjfTzc9iX0qBcURhKD9LIsjU4sIIYoKdyoj6Yjro4NZQzNe4EKvB8gcJKAJ3vPe1dxUf9JUzDvtXZfPpLDAViHw5ib0N_dSslaq2gTD_B1m6Xf5Av0vv3X--IA3dwkC9yHidjtwCrX6gXNBGwCECXLLyjYZ2lbN0V76c7oEhpYZsOQpSsNYKX9J3jshGqzUEOgIqjJ6kZbIUMFUyNnnuF76e6OVas1mPmAT7mxQP-7Dlg9HAP6o037ebGehoMGFJjrOesURpA9FU6Ew4edH0Y4vRypgWxKDtqg_H3Oq7pSj-7VhdReYjsqTylBx8snW-nCTCpZNR7k0ySAA4TDDamwvHTkivkpp6I8kGxF3ib-WatFQfMnVG-GKiLvtkolO4MBd5WLL5vC8WIllOxca5cWbN6wOJuysugZOA0o8Jn3IEZTW8OZk_So0EGo1tdfsgOk1MzidSiSk9HTnrPWlCJzRyVGHFOle3Pp6Qmh48gAA&state=eyJzdGF0ZV9pZCI6ICJjYmZhMzNhMTRjNGQ0MmUyYTE3ZGY0NGUyOTUwOWUxMSIsICJuZXh0X3VybCI6ICIvaHViLyJ9&session_state=281b083b-dbee-4829-b16e-c8b4224ff38c#

The hub infrastructure is deployed differently from our others, and is at this repository:

https://github.com/utoronto-2i2c/jupyterhub-deploy/

I believe that @yuvipanda or @GeorgianaElena may only be the ones that have kubectl access however. We should confirm this with them.

FreshDesk ticket thread: https://2i2c.freshdesk.com/a/tickets/20

Timeline

If it makes sense to include a timeline for this debrief, then do so below. This is usually most-useful for post-mortems.

All times in US/Pacific.

08/30 9:30am

Ticket opened in Freshdesk noting that users could not log in due to 500 errors on hub.

10:12am

Incident is reported in our Slack and investigation starts.

11:45am

Notice that cluster scale-up event had happened recently. However this didn't seem to be an explanation for this problem.

12:05pm

Noted that attempting to access hub logs was showing a lot of HTTP 401: Unauthorized responses.

We guessed that this might be related to an authentication problem

1:00pm

We asked Toronto IT if the credentials for Azure Active Directory were outdated, as this would explain the 401: Unauthorized errors.

1:30pm

They note that the person that provisioned these credentials is away and they are following up.

3:07pm

They confirm that the credentials do look out of date, they ask how they can send to us.

3:30pm

We suggest a few different options for sending secure credentials. By this point the person with the credentials @ toronto was no longer on the clock.

08/31 12:00pm

We receive the new credentials and update the hub. Confirm that it now works!

What went wrong

  • We are relying on credentials that are provisioned and controlled by the UToronto IT team. We did not have visibility into these credentials and they aren't automatically updated. Moreover, UToronto IT didn't have a process for keeping track of the need to update. This is why nobody saw this coming.
  • Only a subset of the team has access to the UToronto deployment. This meant that only a few team members could help out, and had access to U.Toronto IT communication channels. Migrate UToronto hub to our pilot-hubs repository #638
  • We didn't have a process for easily receiving secure credentials, which led to confusion about how Toronto should send us the updated creds. Design a secure method to receive secret keys #639

Where we got lucky

We were lucky to diagnose the problem correctly relatively quickly!

Follow-ups

See the linked issues above for follow ups

In addition, we slightly altered our incident reporting label: #648

Actions

  • Incident has been dealt with or is over
  • Sections above are filled out
  • All actionable items above have linked GitHub Issues
@yuvipanda
Copy link
Member

Current hypothesis is that the original AzureAD credentials had a 1y timer set and that expired.

Once we get new credentials, we should update them in https://github.com/utoronto-2i2c/jupyterhub-deploy/blob/e1fd790000734f1356a5a78357ccca0df0d21f38/deployments/utoronto/secrets/prod.yaml#L20 and equivalent staging.yaml file, and do a deploy.

@choldgraf
Copy link
Member Author

Also note that there's some conversation about this in Slack here: https://2i2c.slack.com/archives/C01DB2JRP8W/p1630332209002400

@choldgraf
Copy link
Member Author

I believe that the main issue is now resolved - we just need to open up a post mortem to discuss what happened and make sure that we have follow-ups in place. Here's some conversation about it: https://2i2c.slack.com/archives/C01DB2JRP8W/p1630426159029600

@choldgraf choldgraf self-assigned this Aug 31, 2021
@choldgraf choldgraf changed the title U. Toronto hub is returning 500 errors for all users Incident report: U. Toronto hub is returning 500 errors for all users Aug 31, 2021
@choldgraf choldgraf changed the title Incident report: U. Toronto hub is returning 500 errors for all users Incident report: U. Toronto credentials expired Aug 31, 2021
@choldgraf
Copy link
Member Author

I have updated this issue to reflect the major timeline of this incident, and added some extra context about it in the top comment.

@yuvipanda I believe that I've tracked the two issues we opened to improve process. There is one other item about using credentials provided by UToronto instead of our own. I am not sure what kind of issue could address this (or if it would even be possible for UToronto). Do you have any advice for what to do there? Should we open an issue about asking them to let us provision this credentail?

@choldgraf
Copy link
Member Author

I am going to close this one, as I think that we can discuss the broader credentials challenge in #638

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants