-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incident Report: CarbonPlan outage 2021-07-17 #526
Comments
Hi folks! At this time, the hub seems to have gone down again. |
Thanks again for the write-up, @yuvipanda. Btw, for historical context, @jhamman's above ping was acknowledged and worked out in Slack. |
Hey all - I've moved this to the Activity board so that we don't lose track of it, also added a little checklist to make sure we follow-up on creating issues before closing. I'm not sure what over the documentation/process steps above need their own issues...@yuvipanda could you advise there? |
Ping @yuvipanda 😉 IMHO, all of those docs points belong to a new AWS debugging section, maybe??? |
OK I've added tracking issues for the process + documentation items that @yuvipanda had in the top comment, so going to close this |
Summary
The cluster was out of commission due to the master and core nodes
(containing k8s control plane + hub components) died, and new
replacements couldn't automatically be brought up due to CPU quota
limits on AWS. Making the core nodes even bigger, and manually
reducing the size of the dask worker instance group (497 was
requested, 320 had been provisioned) brought everything back.
Timeline (if relevant)
All times in IST
2021-07-17 2:38 PM
In the course of writing up a previous incident report,
it was observed that the cluster was having another outage - the k8s API
and the hub were both unreachable. Investigation is started.
2:41 PM
Looking at the instances page
on the AWS console in the carbonplan account showed two master nodes,
even though our k8s API wasn't able to reach them. Upon finding their
public IP and sshing in, it was found that one of the master nodes was
maxed out on CPU, due to dask gateway using up a lot of CPU. The gateway
pods themselves didn't have a CPU limit, so had taken down the entire
node.
2:46 PM
The problematic node is rebooted via ssh, and comes back on.
The kubernetes API is reachable sporadically, but the hub is not.
2:52 PM
The core node size is increased again, to
m5.2xlarge
. However,EC2 doesn't bring up the new nodes, since we were hitting up our
CPU quotas. Looking at the Activity tab
in the AWS console for the autoscaling group for the master
shows the following error message:
The
dask-r5-2xlarge
instancegroup had been asked to scale up to 500,and already had about 320 nodes. This had exhausted the CPU quota, causing
problems with bringing up the new master nodes.
3:00 PM
The
dask-r5-2xlarge
autoscalinggroup is manually set to 0 nodes, andAWS starts bringing these nodes down. This allows the master to come up,
and start bringing everything back.
3:11 PM
Everything is functional again.
What went wrong
enough CPU to take down the entire cluster.
so problems with the hub control plane can take down the whole
cluster.
might affect workflows.
Where we got lucky
at that time.
Action items
Process improvements
needed.
hub, to evaluate along with users' expected needs
tracked in #591
Documentation improvements
ssh
into a node in akops
clusteran instancegroup so you can look at its activity report.
for each instancegroup via the AWS console.
tracked in #590
Technical improvements
pods. Make carbonplan cluster more resilient #525
issues better. Setup prometheus + grafana for carbonplan #533
Make carbonplan cluster more resilient #525 Make carbonplan hub more resilient - part 2 #532
Actions
The text was updated successfully, but these errors were encountered: