Incident Report: CarbonPlan outage 2021-07-17 #526

yuvipanda · 2021-07-18T06:45:19Z

Summary

The cluster was out of commission due to the master and core nodes
(containing k8s control plane + hub components) died, and new
replacements couldn't automatically be brought up due to CPU quota
limits on AWS. Making the core nodes even bigger, and manually
reducing the size of the dask worker instance group (497 was
requested, 320 had been provisioned) brought everything back.

Timeline (if relevant)

All times in IST

2021-07-17 2:38 PM

In the course of writing up a previous incident report,
it was observed that the cluster was having another outage - the k8s API
and the hub were both unreachable. Investigation is started.

2:41 PM

Looking at the instances page
on the AWS console in the carbonplan account showed two master nodes,
even though our k8s API wasn't able to reach them. Upon finding their
public IP and sshing in, it was found that one of the master nodes was
maxed out on CPU, due to dask gateway using up a lot of CPU. The gateway
pods themselves didn't have a CPU limit, so had taken down the entire
node.

2:46 PM

The problematic node is rebooted via ssh, and comes back on.
The kubernetes API is reachable sporadically, but the hub is not.

2:52 PM

The core node size is increased again, to m5.2xlarge. However,
EC2 doesn't bring up the new nodes, since we were hitting up our
CPU quotas. Looking at the Activity tab
in the AWS console for the autoscaling group for the master
shows the following error message:

Launching a new EC2 instance. Status Reason: You have requested more vCPU
capacity than your current vCPU limit of 1362 allows for the instance bucket
that the specified instance type belongs to. Please visit
http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this
limit. Launching EC2 instance failed.

The dask-r5-2xlarge instancegroup had been asked to scale up to 500,
and already had about 320 nodes. This had exhausted the CPU quota, causing
problems with bringing up the new master nodes.

3:00 PM

The dask-r5-2xlarge autoscalinggroup is manually set to 0 nodes, and
AWS starts bringing these nodes down. This allows the master to come up,
and start bringing everything back.

3:11 PM

Everything is functional again.

What went wrong

The dask-gateway pods had no CPU limit, so they were able to use
enough CPU to take down the entire cluster.
The hub control plane and the k8s API control plane share nodes,
so problems with the hub control plane can take down the whole
cluster.
We don't have full awareness of quotas on AWS, and how they
might affect workflows.

Where we got lucky

The problem occured on a saturday, and was only luckily discovered
at that time.

Action items

Process improvements

Look at AWS quotas for carbonplan, and request increases where
needed.
Add 'look at quotas' as part of process when setting up a
hub, to evaluate along with users' expected needs

tracked in #591

Documentation improvements

Document how to ssh into a node in a kops cluster
Document how to identify the appropriate autoscaling group for
an instancegroup so you can look at its activity report.
Document how to perform manual nodegroup scaling operations
for each instancegroup via the AWS console.

tracked in #590

Technical improvements

Explicitly set limits and requests on all hub control plane
pods. Make carbonplan cluster more resilient #525
Setup Grafana / Prometheus for the cluster so we can diagnose
issues better. Setup prometheus + grafana for carbonplan #533
Increase the master node redundancy of the cluster even more.
Make carbonplan cluster more resilient #525 Make carbonplan hub more resilient - part 2 #532

Actions

Incident has been dealt with or is over
Sections above are filled out
All actionable items above have linked GitHub Issues

The text was updated successfully, but these errors were encountered:

Ref 2i2c-org#526

jhamman · 2021-07-19T16:21:09Z

Hi folks! At this time, the hub seems to have gone down again.

damianavila · 2021-07-21T21:57:48Z

Thanks again for the write-up, @yuvipanda.
Planning to close this one soon unless you want to keep it open for some reason.

Btw, for historical context, @jhamman's above ping was acknowledged and worked out in Slack.

choldgraf · 2021-07-26T16:20:22Z

Hey all - I've moved this to the Activity board so that we don't lose track of it, also added a little checklist to make sure we follow-up on creating issues before closing.

I'm not sure what over the documentation/process steps above need their own issues...@yuvipanda could you advise there?

damianavila · 2021-08-05T23:52:24Z

I'm not sure what over the documentation/process steps above need their own issues...@yuvipanda could you advise there?

Ping @yuvipanda 😉

IMHO, all of those docs points belong to a new AWS debugging section, maybe???

choldgraf · 2021-08-06T18:39:23Z

OK I've added tracking issues for the process + documentation items that @yuvipanda had in the top comment, so going to close this

yuvipanda added the type: post-mortem label Jul 18, 2021

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Jul 18, 2021

Bump up resource requests for dask control plane

8049ac6

Ref 2i2c-org#526

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Jul 18, 2021

Set requests & limits for hub control plane

01eb20b

Ref 2i2c-org#526

yuvipanda mentioned this issue Jul 18, 2021

Make carbonplan cluster more resilient #525

Merged

yuvipanda mentioned this issue Jul 26, 2021

Team Sync - Jul 26, 2021 2i2c-org/team-compass#172

Closed

choldgraf mentioned this issue Aug 6, 2021

Improve team documentation for debugging AWS clusters #590

Closed

3 tasks

choldgraf closed this as completed Aug 6, 2021

yuvipanda mentioned this issue Aug 23, 2021

Set limits & resource requests on all core pods #90

Closed

3 tasks

yuvipanda mentioned this issue Nov 15, 2021

Guidelines for using kops vs EKS #431

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incident Report: CarbonPlan outage 2021-07-17 #526

Incident Report: CarbonPlan outage 2021-07-17 #526

yuvipanda commented Jul 18, 2021 •

edited by damianavila

Loading

jhamman commented Jul 19, 2021

damianavila commented Jul 21, 2021

choldgraf commented Jul 26, 2021

damianavila commented Aug 5, 2021 •

edited

Loading

choldgraf commented Aug 6, 2021

Incident Report: CarbonPlan outage 2021-07-17 #526

Incident Report: CarbonPlan outage 2021-07-17 #526

Comments

yuvipanda commented Jul 18, 2021 • edited by damianavila Loading

Summary

Timeline (if relevant)

2021-07-17 2:38 PM

2:41 PM

2:46 PM

2:52 PM

3:00 PM

3:11 PM

What went wrong

Where we got lucky

Action items

Process improvements

Documentation improvements

Technical improvements

Actions

jhamman commented Jul 19, 2021

damianavila commented Jul 21, 2021

choldgraf commented Jul 26, 2021

damianavila commented Aug 5, 2021 • edited Loading

choldgraf commented Aug 6, 2021

yuvipanda commented Jul 18, 2021 •

edited by damianavila

Loading

damianavila commented Aug 5, 2021 •

edited

Loading