Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incident Report: CarbonPlan outage 2021-07-17 #526

Closed
3 tasks done
yuvipanda opened this issue Jul 18, 2021 · 5 comments
Closed
3 tasks done

Incident Report: CarbonPlan outage 2021-07-17 #526

yuvipanda opened this issue Jul 18, 2021 · 5 comments

Comments

@yuvipanda
Copy link
Member

yuvipanda commented Jul 18, 2021

Summary

The cluster was out of commission due to the master and core nodes
(containing k8s control plane + hub components) died, and new
replacements couldn't automatically be brought up due to CPU quota
limits on AWS. Making the core nodes even bigger, and manually
reducing the size of the dask worker instance group (497 was
requested, 320 had been provisioned) brought everything back.

Timeline (if relevant)

All times in IST

2021-07-17 2:38 PM

In the course of writing up a previous incident report,
it was observed that the cluster was having another outage - the k8s API
and the hub were both unreachable. Investigation is started.

2:41 PM

Looking at the instances page
on the AWS console in the carbonplan account showed two master nodes,
even though our k8s API wasn't able to reach them. Upon finding their
public IP and sshing in, it was found that one of the master nodes was
maxed out on CPU, due to dask gateway using up a lot of CPU. The gateway
pods themselves didn't have a CPU limit, so had taken down the entire
node.

2:46 PM

The problematic node is rebooted via ssh, and comes back on.
The kubernetes API is reachable sporadically, but the hub is not.

2:52 PM

The core node size is increased again, to m5.2xlarge. However,
EC2 doesn't bring up the new nodes, since we were hitting up our
CPU quotas. Looking at the Activity tab
in the AWS console for the autoscaling group for the master
shows the following error message:

Launching a new EC2 instance. Status Reason: You have requested more vCPU
capacity than your current vCPU limit of 1362 allows for the instance bucket
that the specified instance type belongs to. Please visit
http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this
limit. Launching EC2 instance failed.

The dask-r5-2xlarge instancegroup had been asked to scale up to 500,
and already had about 320 nodes. This had exhausted the CPU quota, causing
problems with bringing up the new master nodes.

3:00 PM

The dask-r5-2xlarge autoscalinggroup is manually set to 0 nodes, and
AWS starts bringing these nodes down. This allows the master to come up,
and start bringing everything back.

3:11 PM

Everything is functional again.

What went wrong

  1. The dask-gateway pods had no CPU limit, so they were able to use
    enough CPU to take down the entire cluster.
  2. The hub control plane and the k8s API control plane share nodes,
    so problems with the hub control plane can take down the whole
    cluster.
  3. We don't have full awareness of quotas on AWS, and how they
    might affect workflows.

Where we got lucky

  1. The problem occured on a saturday, and was only luckily discovered
    at that time.

Action items

Process improvements

  1. Look at AWS quotas for carbonplan, and request increases where
    needed.
  2. Add 'look at quotas' as part of process when setting up a
    hub, to evaluate along with users' expected needs

tracked in #591

Documentation improvements

  1. Document how to ssh into a node in a kops cluster
  2. Document how to identify the appropriate autoscaling group for
    an instancegroup so you can look at its activity report.
  3. Document how to perform manual nodegroup scaling operations
    for each instancegroup via the AWS console.

tracked in #590

Technical improvements

  1. Explicitly set limits and requests on all hub control plane
    pods. Make carbonplan cluster more resilient #525
  2. Setup Grafana / Prometheus for the cluster so we can diagnose
    issues better. Setup prometheus + grafana for carbonplan #533
  3. Increase the master node redundancy of the cluster even more.
    Make carbonplan cluster more resilient #525 Make carbonplan hub more resilient - part 2 #532

Actions

  • Incident has been dealt with or is over
  • Sections above are filled out
  • All actionable items above have linked GitHub Issues
yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Jul 18, 2021
yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Jul 18, 2021
@jhamman
Copy link

jhamman commented Jul 19, 2021

Hi folks! At this time, the hub seems to have gone down again.

@damianavila
Copy link
Contributor

Thanks again for the write-up, @yuvipanda.
Planning to close this one soon unless you want to keep it open for some reason.

Btw, for historical context, @jhamman's above ping was acknowledged and worked out in Slack.

@choldgraf
Copy link
Member

Hey all - I've moved this to the Activity board so that we don't lose track of it, also added a little checklist to make sure we follow-up on creating issues before closing.

I'm not sure what over the documentation/process steps above need their own issues...@yuvipanda could you advise there?

@damianavila
Copy link
Contributor

damianavila commented Aug 5, 2021

I'm not sure what over the documentation/process steps above need their own issues...@yuvipanda could you advise there?

Ping @yuvipanda 😉

IMHO, all of those docs points belong to a new AWS debugging section, maybe???

@choldgraf
Copy link
Member

OK I've added tracking issues for the process + documentation items that @yuvipanda had in the top comment, so going to close this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants