Define escalation practices when there are hub outages #1118

choldgraf · 2022-03-15T22:14:25Z

Context

Hubs will experience outages of different magnitudes, and these should trigger varying degrees of response from our team. We want to find a balance between sustainable practices for our team, and ensuring that our communities don't feel too much pain from outages.

We have an Incident Commander-style process for handling the roles / communication / etc during incidents. However, we do not yet define a process for escalating alerts and notifications to specific people when two conditions are true:

We have an incident that requires immediate attention
A key expert is not available to resolve the incident

Proposal

We should define some kind of Pager-style mechanism that can actively ping certain team members during incidents where their time is needed. We should define this process in a way that:

Is efficient and quickly gets the right information to the right person
Spreads the load across team members in an equitable way
Is realistic about our capacity and promises to uptime/SLAs (AKA, we shouldn't be too hard on ourselves)

A rough approach is to define an on call engineer that makes themselves available to be actively pinged in the event that an incident is declared. This role would then cycle through our engineering team over time, so that no single team member must respond to incidents too often.

References

PagerDuty is one service that does this: https://www.pagerduty.com/
OpsGenie is another: https://www.atlassian.com/software/opsgenie

sgibson91 · 2022-03-16T10:30:16Z

I think the discussion in Slack @yuvipanda and I had around https://www.pagerduty.com/ and https://www.atlassian.com/software/opsgenie will be relevant here. The idea being that for major outages, there should be an engineer on-call to respond so that the support steward isn't the only one who shoulders responsibility for updating the client and exploring the issue.

damianavila · 2022-03-21T23:05:40Z

support steward isn't the only one who shoulders responsibility for updating the client and exploring the issue.

IMHO, the support steward should not "feel" responsible for exploring the issue (although it might be involved in updating the client), I guess this distinction is actually part of the discussion in #1068.

sgibson91 · 2022-03-22T06:49:59Z

@damianavila sure, but I've certainly been posting in the Slack channel about outages and just had to do my best until someone came online. I think having someone to page will help that feeling of "I can't do anything more right now".

choldgraf · 2022-06-22T11:58:28Z

Given that we have a PR open to define and incident commander and more complex response process:

Add incident commander role + more steps to support process team-compass#422

Should we re-scope this issue to explicitly be about "pager-style" escalation practices? E.g., some system to ping a specific person via a non-Slack method if a particular problem emerges?

Or, should we consider 2i2c-org/team-compass#422 to be enough, and we should close this and iterate with that system for a bit and decide if we need something like a dedicated Pager?

yuvipanda · 2022-06-23T02:56:03Z

Should we re-scope this issue to explicitly be about "pager-style" escalation practices?

Yes! I think this is important as otherwise the 'currently awake' people can feel pretty overwhelmed sometimes.

I think specifically for outages, as a short-term non-scaleable measure, I'm always happy to be alerted via non-slack methods (I think most people have my phone number). I think that's an important senior engineer responsibility.

damianavila · 2022-06-23T21:36:25Z

I think that's an important senior engineer responsibility.

I totally appreciate that @yuvipanda, BUT we need to find a way/process so we do not need to ping you at your personal phone number. So +1 on repurposing this one about the "pager-style" tool.

choldgraf · 2022-06-24T08:24:09Z

OK I've re-worked the top comment in this one to focus more around Pager-style updates. Also added some links!

yuvipanda · 2022-06-24T21:13:30Z

@damianavila totally agree this isn't long term sustainable! I just wanted to volunteer that right now as outages and escalations will continue to happen as we figure out the process.

yuvipanda · 2022-09-08T06:33:33Z

After #1687, me and @jmunroe are spending effort investigating using PagerDuty primarily for Incident response (not using any of the automated alerting features).

Stage 1: Incident Response

Describe process for initiating incidents Replace GitHub with PagerDuty in our Incident Response process team-compass#508
Describe process for switching incident commanders
Describe process for marking incidents as resolved
Describe process for using PagerDuty's incident report process https://response.pagerduty.com/after/post_mortem_process/

Stage 2: Escalation

Describe process for escalating issues when needed, describing a separate escalation team
Describe how our engineers should have local notifications set up (pagerduty app? sms? calls?) and describe the specific cases where they might get notified - I think if you aren't part of the escalation team you should never actually get paged via pagerduty.

yuvipanda · 2022-09-08T06:57:59Z

I'm also looking at OpsGenie - In particular, they have a 'quiet hours' features that seems more intuitive than Pagerduty's scheduling. However, the OpsGenie slack integration doesn't let you trigger a new incident from slack which is boo :(

yuvipanda · 2022-09-08T07:00:13Z

It's absolutely important that engineers can control how and when they are notified - we don't want this to become a traditional 'oncall' situation.

yuvipanda · 2022-09-08T07:54:46Z

Me and @jmunroe spent a bunch of time talking about this, and in particular role-playing how this should have played out with today's UToronto outage.

A proposed workflow is that after the incident is created, we want pagerduty to evaluate the current local timezone and stated preferences of all engineers and then send everyone who wants to be notified at that point a notification, via methods of their own choosing (SMS, App, Phone call). Engineers can then acknowledge the alert if they are able to provide assistance, or do nothing if they can't. After an hour (or a configurable time period later!), if nobody has acknowledged the alert, it'll automatically escalate to a (opt-in) group of second tier folks who can respond.

We're going to play with pagerduty rules to try make this possible.

However, even without any of this, it still provides value by streamlining the 'incident response' process and removing a medium (GitHub) that currently is a bit of a bottleneck - and instead moving it all to slack.

yuvipanda · 2022-11-09T04:40:13Z

Removing myself as I'm not currently working on it. #1804 is related though, and I am working on that.

yuvipanda · 2022-11-09T04:40:48Z

However, to prevent the perfect from being the enemy of the good, please do consider that anyone on the team can always reach out to me at any time to escalate an outage.

damianavila · 2023-08-15T20:14:31Z

Related: 2i2c-org/team-compass#763

This was referenced Mar 15, 2022

Define some first-line and second-line support processes #1068

Closed

[Incident] UToronto cluster ran out of disk space #1081

Closed

choldgraf mentioned this issue Jun 22, 2022

Add incident commander role + more steps to support process 2i2c-org/team-compass#422

Merged

2 tasks

damianavila added this to DEPRECATED Engineering and Product Backlog Jun 22, 2022

damianavila moved this to Needs Shaping / Refinement in DEPRECATED Engineering and Product Backlog Jun 22, 2022

This was referenced Sep 7, 2022

[Incident] University of Toronto 500 errors #1687

Closed

Replace GitHub with PagerDuty in our Incident Response process 2i2c-org/team-compass#508

Merged

damianavila moved this from Needs Shaping / Refinement to In progress in DEPRECATED Engineering and Product Backlog Sep 12, 2022

damianavila assigned yuvipanda and jmunroe Sep 12, 2022

yuvipanda removed their assignment Nov 9, 2022

damianavila moved this from In progress to Waiting in DEPRECATED Engineering and Product Backlog Nov 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define escalation practices when there are hub outages #1118

Define escalation practices when there are hub outages #1118

choldgraf commented Mar 15, 2022 •

edited

Loading

sgibson91 commented Mar 16, 2022

damianavila commented Mar 21, 2022

sgibson91 commented Mar 22, 2022

choldgraf commented Jun 22, 2022

yuvipanda commented Jun 23, 2022

damianavila commented Jun 23, 2022

choldgraf commented Jun 24, 2022

yuvipanda commented Jun 24, 2022

yuvipanda commented Sep 8, 2022 •

edited

Loading

yuvipanda commented Sep 8, 2022

yuvipanda commented Sep 8, 2022

yuvipanda commented Sep 8, 2022

yuvipanda commented Nov 9, 2022

yuvipanda commented Nov 9, 2022

damianavila commented Aug 15, 2023

Define escalation practices when there are hub outages #1118

Define escalation practices when there are hub outages #1118

Comments

choldgraf commented Mar 15, 2022 • edited Loading

Context

Proposal

References

sgibson91 commented Mar 16, 2022

damianavila commented Mar 21, 2022

sgibson91 commented Mar 22, 2022

choldgraf commented Jun 22, 2022

yuvipanda commented Jun 23, 2022

damianavila commented Jun 23, 2022

choldgraf commented Jun 24, 2022

yuvipanda commented Jun 24, 2022

yuvipanda commented Sep 8, 2022 • edited Loading

Stage 1: Incident Response

Stage 2: Escalation

yuvipanda commented Sep 8, 2022

yuvipanda commented Sep 8, 2022

yuvipanda commented Sep 8, 2022

yuvipanda commented Nov 9, 2022

yuvipanda commented Nov 9, 2022

damianavila commented Aug 15, 2023

choldgraf commented Mar 15, 2022 •

edited

Loading

yuvipanda commented Sep 8, 2022 •

edited

Loading