Incident Report: CarbonPlan hub outage July 15 2021 #524

yuvipanda · 2021-07-17T08:43:30Z

Summary

New user and dask worker servers had stopped working, due to cluster autoscaler
being evicted by kube-dns pods scaling up in size due to cluster size increasing.
Resizing the master to be larger brough the cluster back up.

Timeline

All times are in IST

2021-07-15 09:02 PM

Hub issues are reported on slack.

09:06 PM

Hub issues are acknowledged, and an investigation is started.

09:35 PM

A large number of dask-worker pods were in Pending
state, and a large number of nodes were in a NotReady state.

Look at describe on a dask-worker pod, it was just in pending -
without any newer event. So this seems an autoscaling problem.
The nodes being NotReady was a different mystery.

As node count increased, kube-dns pods scaled up. This at some
point kicked out the cluster-autoscaler, but also somehow the
kube-dns pods themselves?

09:42 PM

Since there seemed to be about 20 nodes stuck in the NotReady
state, and that's the current per-instancegroup limit in the carbonplan
kops config, that limit was increased to 500 nodes per instancegroup.
This didn't have any immediate effect.

10:23 PM

kube-dns is brought back up by deleting some other pods, and this
helps make the nodes go into Ready states. But the cluster-autoscaler
is still dead, so no new nodes come back up.

10:45 PM

kube-dns increases the number of replicas based on the total number
of nodes, and this seemed to set off the sequence of events leading
to this outage. The core node was full, and the autoscaler was out
of commission - so increasing the capacity of the core node should
help. So let's try to increase the size of the core node, from t3.medium
to m5.xlarge.

kops rolling-update fails validation, since it needs the kubernetes
core pods need to be 'Running' for it to work. All the deployments of
the staging namespace are deleted to make space for the core pods.

11:35 PM

New master node is up, and pods are now schedulable. cluster-autoscaler
kicks back up again, and winds down the formerly NotReady nodes. Things
are back to working again now.

What went wrong

The core node size was too small when the cluster size was scaled up.
The autoscaler working is dependent on the cluster working, but that is
sometimes dependent on the autoscaler working. In this case, we got
caught up in that circular dependency and human intervention was needed
to unbreak this.
The maximum size of the instancegroups was too small for the usage it
was getting.
There is no prometheus / grafana setup here, so we couldn't diagnose the
problems better.
There was only one core node, so there was limited redundancy.

Where we got lucky

Folks were active on slack to notice the incident and respond to it
immediately.

Action items

Process improvements

Setup a support + escalation pathway that notifies people appropriately,
without reliance on slack. Team process for support using FreshDesk team-compass#167
Find a better understanding of intended workloads for each hub, so we
can make better tradeoffs about base cost minimization and redundancy. Improve our understanding of expected workloads per hub #589

Technical improvements

Find a way to mark the cluster autoscaler as critical
so it doesn't get evicted. Mark the cluster autoscaler as cluster-critical kubernetes/kops#12004
Increase the master instance size for the carbonplan cluster, and make
sure there is more than 1 master node. Make carbonplan cluster more resilient #525
Deploy prometheus and grafana for the carbonplan cluster so we have
more observability in the cluster Setup prometheus + grafana for carbonplan #533

Actions

Incident has been dealt with or is over
Sections above are filled out
All actionable items above have linked GitHub Issues

The text was updated successfully, but these errors were encountered:

Ref 2i2c-org#524

Ref: 2i2c-org#524

choldgraf · 2021-07-17T09:13:31Z

This is an awesome write up, thanks @yuvipanda 🙂

damianavila · 2021-07-21T21:54:39Z

I could not agree more, this was really nice report to look at and learn from.
Thanks for taking the time to write it @yuvipanda.

I guess we can close this one now or is there any other actionable item from here?

choldgraf · 2021-07-23T19:13:21Z

@damianavila good question, @yuvipanda and i were chatting about this today as well.

What would the "actions checklist" look like for these reports? I think it's good to have something like this so that we know when to close them and what to do.

damianavila · 2021-07-24T02:09:11Z

What would the "actions checklist" look like for these reports?

I think what @yuvipanda wrote as improvements should be actually a checklist pointing to specific issues for each of them (if they are different enough). Once those new issues are well-scoped and ready to work on, then I guess we can close the report? (I mean, we do not need to complete and close the child issues to close the report, we just need to evolve them to be ready to work material, IMHO).

choldgraf · 2021-07-26T16:15:31Z

OK I opened up #553 so that we have some concrete actions to work from.

I think on this issue, the only thing we don't have a clear next step for is

Find a better understanding of intended workloads for each hub, so we
can make better tradeoffs about base cost minimization and redundancy.

That's more of a longer-term goal than a specific "issue" to track, so maybe we can just close this? Or should we try to track that somehow before closing?

damianavila · 2021-08-05T23:49:35Z

That's more of a longer-term goal than a specific "issue" to track, so maybe we can just close this? Or should we try to track that somehow before closing?

Maybe a ticket with that content as a trigger for the discussion? Otherwise, we are going to "miss" this idea/thought, IMHO.

choldgraf · 2021-08-06T18:32:46Z

ok, I've opened up #589 and will now close this issue since we're tracking each of the "next steps"

yuvipanda added the type: post-mortem label Jul 17, 2021

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Jul 17, 2021

Increase VM size and count of carbonplan master nodes

d17340c

Ref 2i2c-org#524

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Jul 17, 2021

Increase instancegroup limit for carbonplan

9f0af15

Ref: 2i2c-org#524

This was referenced Jul 18, 2021

Incident Report: CarbonPlan outage 2021-07-17 #526

Closed

Make carbonplan cluster more resilient #525

Merged

yuvipanda mentioned this issue Jul 26, 2021

Team Sync - Jul 26, 2021 2i2c-org/team-compass#172

Closed

choldgraf mentioned this issue Jul 26, 2021

Add actions for incident reports #553

Merged

choldgraf mentioned this issue Aug 6, 2021

Improve our understanding of expected workloads per hub #589

Closed

2 tasks

choldgraf closed this as completed Aug 6, 2021

yuvipanda mentioned this issue Nov 15, 2021

Guidelines for using kops vs EKS #431

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incident Report: CarbonPlan hub outage July 15 2021 #524

Incident Report: CarbonPlan hub outage July 15 2021 #524

yuvipanda commented Jul 17, 2021 •

edited by choldgraf

Loading

choldgraf commented Jul 17, 2021

damianavila commented Jul 21, 2021

choldgraf commented Jul 23, 2021

damianavila commented Jul 24, 2021

choldgraf commented Jul 26, 2021

damianavila commented Aug 5, 2021

choldgraf commented Aug 6, 2021

Incident Report: CarbonPlan hub outage July 15 2021 #524

Incident Report: CarbonPlan hub outage July 15 2021 #524

Comments

yuvipanda commented Jul 17, 2021 • edited by choldgraf Loading

Summary

Timeline

2021-07-15 09:02 PM

09:06 PM

09:35 PM

09:42 PM

10:23 PM

10:45 PM

11:35 PM

What went wrong

Where we got lucky

Action items

Process improvements

Technical improvements

Actions

choldgraf commented Jul 17, 2021

damianavila commented Jul 21, 2021

choldgraf commented Jul 23, 2021

damianavila commented Jul 24, 2021

choldgraf commented Jul 26, 2021

damianavila commented Aug 5, 2021

choldgraf commented Aug 6, 2021

yuvipanda commented Jul 17, 2021 •

edited by choldgraf

Loading