Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incident Report: CarbonPlan hub outage July 15 2021 #524

Closed
2 of 3 tasks
yuvipanda opened this issue Jul 17, 2021 · 7 comments
Closed
2 of 3 tasks

Incident Report: CarbonPlan hub outage July 15 2021 #524

yuvipanda opened this issue Jul 17, 2021 · 7 comments

Comments

@yuvipanda
Copy link
Member

yuvipanda commented Jul 17, 2021

Summary

New user and dask worker servers had stopped working, due to cluster autoscaler
being evicted by kube-dns pods scaling up in size due to cluster size increasing.
Resizing the master to be larger brough the cluster back up.

Timeline

All times are in IST

2021-07-15 09:02 PM

Hub issues are reported on slack.

09:06 PM

Hub issues are acknowledged, and an investigation is started.

09:35 PM

A large number of dask-worker pods were in Pending
state, and a large number of nodes were in a NotReady state.

Look at describe on a dask-worker pod, it was just in pending -
without any newer event. So this seems an autoscaling problem.
The nodes being NotReady was a different mystery.

As node count increased, kube-dns pods scaled up. This at some
point kicked out the cluster-autoscaler, but also somehow the
kube-dns pods themselves?

09:42 PM

Since there seemed to be about 20 nodes stuck in the NotReady
state, and that's the current per-instancegroup limit in the carbonplan
kops config, that limit was increased to 500 nodes per instancegroup.
This didn't have any immediate effect.

10:23 PM

kube-dns is brought back up by deleting some other pods, and this
helps make the nodes go into Ready states. But the cluster-autoscaler
is still dead, so no new nodes come back up.

10:45 PM

kube-dns increases the number of replicas based on the total number
of nodes, and this seemed to set off the sequence of events leading
to this outage. The core node was full, and the autoscaler was out
of commission - so increasing the capacity of the core node should
help. So let's try to increase the size of the core node, from t3.medium
to m5.xlarge.

kops rolling-update fails validation, since it needs the kubernetes
core pods need to be 'Running' for it to work. All the deployments of
the staging namespace are deleted to make space for the core pods.

11:35 PM

New master node is up, and pods are now schedulable. cluster-autoscaler
kicks back up again, and winds down the formerly NotReady nodes. Things
are back to working again now.

What went wrong

  1. The core node size was too small when the cluster size was scaled up.
  2. The autoscaler working is dependent on the cluster working, but that is
    sometimes dependent on the autoscaler working. In this case, we got
    caught up in that circular dependency and human intervention was needed
    to unbreak this.
  3. The maximum size of the instancegroups was too small for the usage it
    was getting.
  4. There is no prometheus / grafana setup here, so we couldn't diagnose the
    problems better.
  5. There was only one core node, so there was limited redundancy.

Where we got lucky

  1. Folks were active on slack to notice the incident and respond to it
    immediately.

Action items

Process improvements

  1. Setup a support + escalation pathway that notifies people appropriately,
    without reliance on slack. Team process for support using FreshDesk team-compass#167
  2. Find a better understanding of intended workloads for each hub, so we
    can make better tradeoffs about base cost minimization and redundancy. Improve our understanding of expected workloads per hub #589

Technical improvements

  1. Find a way to mark the cluster autoscaler as critical
    so it doesn't get evicted. Mark the cluster autoscaler as cluster-critical kubernetes/kops#12004
  2. Increase the master instance size for the carbonplan cluster, and make
    sure there is more than 1 master node. Make carbonplan cluster more resilient #525
  3. Deploy prometheus and grafana for the carbonplan cluster so we have
    more observability in the cluster Setup prometheus + grafana for carbonplan #533

Actions

  • Incident has been dealt with or is over
  • Sections above are filled out
  • All actionable items above have linked GitHub Issues
yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Jul 17, 2021
yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Jul 17, 2021
@choldgraf
Copy link
Member

This is an awesome write up, thanks @yuvipanda 🙂

@damianavila
Copy link
Contributor

I could not agree more, this was really nice report to look at and learn from.
Thanks for taking the time to write it @yuvipanda.

I guess we can close this one now or is there any other actionable item from here?

@choldgraf
Copy link
Member

@damianavila good question, @yuvipanda and i were chatting about this today as well.

What would the "actions checklist" look like for these reports? I think it's good to have something like this so that we know when to close them and what to do.

@damianavila
Copy link
Contributor

What would the "actions checklist" look like for these reports?

I think what @yuvipanda wrote as improvements should be actually a checklist pointing to specific issues for each of them (if they are different enough). Once those new issues are well-scoped and ready to work on, then I guess we can close the report? (I mean, we do not need to complete and close the child issues to close the report, we just need to evolve them to be ready to work material, IMHO).

@choldgraf
Copy link
Member

OK I opened up #553 so that we have some concrete actions to work from.

I think on this issue, the only thing we don't have a clear next step for is

Find a better understanding of intended workloads for each hub, so we
can make better tradeoffs about base cost minimization and redundancy.

That's more of a longer-term goal than a specific "issue" to track, so maybe we can just close this? Or should we try to track that somehow before closing?

@damianavila
Copy link
Contributor

That's more of a longer-term goal than a specific "issue" to track, so maybe we can just close this? Or should we try to track that somehow before closing?

Maybe a ticket with that content as a trigger for the discussion? Otherwise, we are going to "miss" this idea/thought, IMHO.

@choldgraf
Copy link
Member

ok, I've opened up #589 and will now close this issue since we're tracking each of the "next steps"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants