-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incident Report: CarbonPlan hub outage July 15 2021 #524
Comments
This is an awesome write up, thanks @yuvipanda 🙂 |
I could not agree more, this was really nice report to look at and learn from. I guess we can close this one now or is there any other actionable item from here? |
@damianavila good question, @yuvipanda and i were chatting about this today as well. What would the "actions checklist" look like for these reports? I think it's good to have something like this so that we know when to close them and what to do. |
I think what @yuvipanda wrote as improvements should be actually a checklist pointing to specific issues for each of them (if they are different enough). Once those new issues are well-scoped and ready to work on, then I guess we can close the report? (I mean, we do not need to complete and close the child issues to close the report, we just need to evolve them to be ready to work material, IMHO). |
OK I opened up #553 so that we have some concrete actions to work from. I think on this issue, the only thing we don't have a clear next step for is
That's more of a longer-term goal than a specific "issue" to track, so maybe we can just close this? Or should we try to track that somehow before closing? |
Maybe a ticket with that content as a trigger for the discussion? Otherwise, we are going to "miss" this idea/thought, IMHO. |
ok, I've opened up #589 and will now close this issue since we're tracking each of the "next steps" |
Summary
New user and dask worker servers had stopped working, due to cluster autoscaler
being evicted by kube-dns pods scaling up in size due to cluster size increasing.
Resizing the master to be larger brough the cluster back up.
Timeline
All times are in IST
2021-07-15 09:02 PM
Hub issues are reported on slack.
09:06 PM
Hub issues are acknowledged, and an investigation is started.
09:35 PM
A large number of dask-worker pods were in
Pending
state, and a large number of nodes were in a
NotReady
state.Look at describe on a dask-worker pod, it was just in pending -
without any newer event. So this seems an autoscaling problem.
The nodes being NotReady was a different mystery.
As node count increased, kube-dns pods scaled up. This at some
point kicked out the cluster-autoscaler, but also somehow the
kube-dns pods themselves?
09:42 PM
Since there seemed to be about 20 nodes stuck in the NotReady
state, and that's the current per-instancegroup limit in the carbonplan
kops config, that limit was increased to 500 nodes per instancegroup.
This didn't have any immediate effect.
10:23 PM
kube-dns
is brought back up by deleting some other pods, and thishelps make the nodes go into Ready states. But the cluster-autoscaler
is still dead, so no new nodes come back up.
10:45 PM
kube-dns
increases the number of replicas based on the total numberof nodes, and this seemed to set off the sequence of events leading
to this outage. The core node was full, and the autoscaler was out
of commission - so increasing the capacity of the core node should
help. So let's try to increase the size of the core node, from
t3.medium
to
m5.xlarge
.kops rolling-update
fails validation, since it needs the kubernetescore pods need to be 'Running' for it to work. All the deployments of
the
staging
namespace are deleted to make space for the core pods.11:35 PM
New master node is up, and pods are now schedulable.
cluster-autoscaler
kicks back up again, and winds down the formerly
NotReady
nodes. Thingsare back to working again now.
What went wrong
sometimes dependent on the autoscaler working. In this case, we got
caught up in that circular dependency and human intervention was needed
to unbreak this.
was getting.
problems better.
Where we got lucky
immediately.
Action items
Process improvements
without reliance on slack. Team process for support using FreshDesk team-compass#167
can make better tradeoffs about base cost minimization and redundancy. Improve our understanding of expected workloads per hub #589
Technical improvements
so it doesn't get evicted. Mark the cluster autoscaler as cluster-critical kubernetes/kops#12004
sure there is more than 1 master node. Make carbonplan cluster more resilient #525
more observability in the cluster Setup prometheus + grafana for carbonplan #533
Actions
The text was updated successfully, but these errors were encountered: