Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Incident] OpenScapes unauthenticated users and CPU usage spike #908

Closed
8 tasks done
choldgraf opened this issue Jan 2, 2022 · 11 comments
Closed
8 tasks done

[Incident] OpenScapes unauthenticated users and CPU usage spike #908

choldgraf opened this issue Jan 2, 2022 · 11 comments
Assignees

Comments

@choldgraf
Copy link
Member

choldgraf commented Jan 2, 2022

Summary

Over the past 2 weeks there has been a spike in usage of the OpenScapes hub. User pods have been increasingly steadily over time. Because nodes scale with users on a 1-1 ratio, this has resulted in a large spike in nodes as well.

Hub information

  • Hub URL: openscapes.2i2c.org

Timeline (if relevant)

All times in US/Pacific

2021-01-02 11:00AM

OpenScapes emails 2i2c support saying that they've gotten an extremely large AWS bill for December.

They reported several users who they did not add to the hub, and who had GitHub accounts created in the last 2 weeks.

The hub was shut down as a precautionary measure.

12:01pm

A look at the grafana logs showed that around the 21st the hub users started going up steadily:

image

It seems like many of these users were maxing out the CPU:

image

Though the plot for users with high CPU usage is broken so this is hard to confirm:

image

EDIT: the users have mining software in their filesystems:

image

12:40pm

Found the reason that unauthorized users can access:

Any github user that logs into this hub is authorized. A new GitHub account was created from scratch, and attempted to log-in, and was able to do so successfully despite not being added to the hub's user list.

Noticed that the openscapes hub does not explicitly have an allowed_usersconfig set:

21:00

The hub is made inaccessible via this PR: #911

21:25

Corrected the configuration to properly authorize users: 2449ab1

Also deleted all of the users on the hub, so they will need to be manually added back in.


After-action report

There are two major takeaways from this incident:

  1. We need some kind of automated cost reporting infrastructure to account for runaway costs like this.
  2. We need better testing infrastructure to make sure that basic failure modes aren't present (like anybody being able to access a hub regardless of authorization)

What went wrong

  • We were not alerted to this increased usage + CPU maxing out when it happened. Because nobody was actively paying attention to the cluster, we only learned about this when the AWS bill came in.
  • We had no automated checks for whether our authentication was working properly, and so did not detect that any github user was allowed in.
  • There is no easy (or documented) way for a hub administrator to restrict access to all users of the hub, so we are constrained to do this by manually shutting down hub users until somebody with cluster access can restrict the whole hub
  • Automatic deploys for AWS did not work when we merged in a PR to fix this

Where we got lucky

The OpenScapes team checked their bill relatively quickly, otherwise we would have incurred significant extra cloud cost. They were on track to spend $20,000 in January alone.

Action items

Process improvements

  1. [discuss] Should we change our policy for self-merging infrastructure/ PRs? #913

Documentation improvements

  1. PR to instruct hub admins how to restrict access to their hub via the UI: ENH: Add how-to about disabling user sessions on a hub docs#119
  2. {{ summary }} [link to github issue]

Technical improvements

  1. Automate and document setting up eksctl clusters #885 is merged
  2. We've already got an issue to build cost reporting infrastructure, we should probably focus a bit of attention on this to get an MVP: Cloud usage monitoring and alerting infrastructure and process #328
  3. Updated CI/CD infrastructure for hubs that is automated, tested, and parallelized  #879 to mention that we should test for unauthorized user access not being possible.

Actions

  • Incident has been dealt with or is over
  • Re-enable OpenScapes auto-deploys
  • Re-enable the openscapes hub #916
  • Follow up with OpenScapes to figure out how to recoup some of their costs
  • Sections above are filled out
  • Incident title and after-action report is cleaned up
  • All actionable items above have linked GitHub Issues
  • Confirm resolved OpenScapes billing issue
@choldgraf choldgraf changed the title [Incident] OpenScapes CPU usage spike [Incident] OpenScapes unauthenticated users and CPU usage spike Jan 2, 2022
@choldgraf
Copy link
Member Author

update: new user image

@betolink just set the hub's user image to busybox:latest so that user sessions cannot start. We'll use this as a short-term fix until we can either shut down the hub entirely or patch this bug in the authentication.

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Jan 3, 2022
I've deleted the ingress objects to prevent external
access to the hubs.

k -n prod delete ingress jupyterhub
k -n staging delete ingress jupyterhub

Ref 2i2c-org#908
@yuvipanda
Copy link
Member

I've made the hub inaccessible (#911).

@yuvipanda
Copy link
Member

With 2449ab1, I've fixed the hole here letting anyone unauthenticated through. I also took a look at our config to see if we had other hubs missing this config, and there weren't any.

@yuvipanda
Copy link
Member

I've made a backup of the jupyterhub.sqlite file on the openscapes hub, and deleted the existing database. Upon next deploy, this will remove all existing users on the hub including the cryptominers.

@choldgraf
Copy link
Member Author

I've updated the top comment with some follow-up issues. @yuvipanda do you agree with the main things to follow up on? Feel free to add or edit as you wish.

@choldgraf
Copy link
Member Author

Update: report sent to OpenScapes

I've sent an email report to OpenScapes with the following text:

Email text

Hey all - here is a brief after action report to describe what happened, and the current state of things. I'm also cc'ing the Code for Science and Society team so that they have visibility.

Summary of what happened

The JupyterHub configuration for the OpenScapes hub was missing an option that was not critical for the hub to function properly, but was critical for authorization to function properly. Because of this, unauthorized users were able to access the hub. This was 2i2c's responsibility and we missed this mis-configuration.

An anonymous user found the link to the OpenScapes hub, was able to log-in without authorization, and around Dec 21st they started creating fake user accounts and spinning up crypto mining sessions on the hub. This resulted in the large spike in cloud costs.

Erin noticed this spike on the morning of January 2nd (US/Pacific time) and alerted 2i2c support. By that evening we had patched this bug and deleted the non-admin user accounts. There were around 2 weeks of heavy use related to this user's crypto-mining scripts.

Current situation

  • The hub is now secured: the incorrect configuration that made it possible for unauthorized users to access the hub has been fixed.
  • All non-admin users are deleted: we did this to ensure that nefarious users wouldn't have access, you can manually add back any users via the JupyterHub admin UI
  • The user image is still BusyBox: that will need to be changed back for user images to work again. I also have a PR to add this suggestion to our docs, thanks Luis for thinking of this

Next steps

  • We should file a support ticket with AWS Support telling them what happened, and asking for an invoice adjustment due to crypto mining abuse. It is common for AWS to forgive cloud costs that are the result of abuse. I will help you put together whatever materials we need to make a case. The screenshots that Luis shared, along with our Grafana logs, should help greatly.
  • If they do not forgive these costs, then 2i2c will bear the cost. We will discount the monthly hub fee until the cloud costs you incurred during this time window are covered.
  • In parallel, the 2i2c engineering team will work on process and technology improvements to avoid this in the future. We've identified a number of places to improve our practises moving forward.

You can find a full incident report and ongoing conversation here.

Erin and others - I want to extend my apologies for this problem, and the stress that it has caused. We'll take necessary precautions to avoid this in the future, and will follow up if we have clarifications we need then. We'll also do what is necessary to make sure that OpenScapes isn't the one to bear the extra cloud costs.

@choldgraf
Copy link
Member Author

I've put together a short report for the OpenScapes team to use in their appeal to reduce their cloud bill. Here's a link:

https://docs.google.com/document/d/106VbSeHDOGbsu-oLENmVJIWkK3MZu-EQN22Qgia4ybo/edit#

@erinmr
Copy link

erinmr commented Jan 5, 2022

Thanks so much for this doc, @choldgraf. I have added additional details for what I was doing on the AWS side to manage and monitor. I submitted and will stay in touch about resolution.

@choldgraf
Copy link
Member Author

Hey @erinmr / @jules32 - just wanted to see if you had any updates from AWS on this one.

@jules32
Copy link
Contributor

jules32 commented Jan 20, 2022

Thanks for checking in @choldgraf - a note today said sorry they are slow, still reviewing

@choldgraf
Copy link
Member Author

We've just heard back from @erinmr that AWS has forgiven their cloud bill for this incident. I think that we can close this one. Phew!

Thanks so much everybody for being a combination of helpful, patient, and generally awesome :-)

@choldgraf choldgraf changed the title [Incident] OpenScapes unauthenticated users and CPU usage spike w Feb 3, 2022
@choldgraf choldgraf changed the title w [Incident] OpenScapes unauthenticated users and CPU usage spike Feb 3, 2022
@choldgraf choldgraf moved this from Blocked to Complete in DEPRECATED Engineering and Product Backlog Mar 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

No branches or pull requests

4 participants