Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define escalation practices when there are hub outages #1118

Open
choldgraf opened this issue Mar 15, 2022 · 15 comments
Open

Define escalation practices when there are hub outages #1118

choldgraf opened this issue Mar 15, 2022 · 15 comments
Assignees

Comments

@choldgraf
Copy link
Member

choldgraf commented Mar 15, 2022

Context

Hubs will experience outages of different magnitudes, and these should trigger varying degrees of response from our team. We want to find a balance between sustainable practices for our team, and ensuring that our communities don't feel too much pain from outages.

We have an Incident Commander-style process for handling the roles / communication / etc during incidents. However, we do not yet define a process for escalating alerts and notifications to specific people when two conditions are true:

  • We have an incident that requires immediate attention
  • A key expert is not available to resolve the incident

Proposal

We should define some kind of Pager-style mechanism that can actively ping certain team members during incidents where their time is needed. We should define this process in a way that:

  • Is efficient and quickly gets the right information to the right person
  • Spreads the load across team members in an equitable way
  • Is realistic about our capacity and promises to uptime/SLAs (AKA, we shouldn't be too hard on ourselves)

A rough approach is to define an on call engineer that makes themselves available to be actively pinged in the event that an incident is declared. This role would then cycle through our engineering team over time, so that no single team member must respond to incidents too often.

References

@sgibson91
Copy link
Member

I think the discussion in Slack @yuvipanda and I had around https://www.pagerduty.com/ and https://www.atlassian.com/software/opsgenie will be relevant here. The idea being that for major outages, there should be an engineer on-call to respond so that the support steward isn't the only one who shoulders responsibility for updating the client and exploring the issue.

@damianavila
Copy link
Contributor

support steward isn't the only one who shoulders responsibility for updating the client and exploring the issue.

IMHO, the support steward should not "feel" responsible for exploring the issue (although it might be involved in updating the client), I guess this distinction is actually part of the discussion in #1068.

@sgibson91
Copy link
Member

@damianavila sure, but I've certainly been posting in the Slack channel about outages and just had to do my best until someone came online. I think having someone to page will help that feeling of "I can't do anything more right now".

@choldgraf
Copy link
Member Author

Given that we have a PR open to define and incident commander and more complex response process:

Should we re-scope this issue to explicitly be about "pager-style" escalation practices? E.g., some system to ping a specific person via a non-Slack method if a particular problem emerges?

Or, should we consider 2i2c-org/team-compass#422 to be enough, and we should close this and iterate with that system for a bit and decide if we need something like a dedicated Pager?

@yuvipanda
Copy link
Member

Should we re-scope this issue to explicitly be about "pager-style" escalation practices?

Yes! I think this is important as otherwise the 'currently awake' people can feel pretty overwhelmed sometimes.

I think specifically for outages, as a short-term non-scaleable measure, I'm always happy to be alerted via non-slack methods (I think most people have my phone number). I think that's an important senior engineer responsibility.

@damianavila
Copy link
Contributor

I think that's an important senior engineer responsibility.

I totally appreciate that @yuvipanda, BUT we need to find a way/process so we do not need to ping you at your personal phone number. So +1 on repurposing this one about the "pager-style" tool.

@choldgraf
Copy link
Member Author

OK I've re-worked the top comment in this one to focus more around Pager-style updates. Also added some links!

@yuvipanda
Copy link
Member

@damianavila totally agree this isn't long term sustainable! I just wanted to volunteer that right now as outages and escalations will continue to happen as we figure out the process.

@yuvipanda
Copy link
Member

yuvipanda commented Sep 8, 2022

After #1687, me and @jmunroe are spending effort investigating using PagerDuty primarily for Incident response (not using any of the automated alerting features).

Stage 1: Incident Response

Stage 2: Escalation

  • Describe process for escalating issues when needed, describing a separate escalation team
  • Describe how our engineers should have local notifications set up (pagerduty app? sms? calls?) and describe the specific cases where they might get notified - I think if you aren't part of the escalation team you should never actually get paged via pagerduty.

@yuvipanda
Copy link
Member

I'm also looking at OpsGenie - In particular, they have a 'quiet hours' features that seems more intuitive than Pagerduty's scheduling. However, the OpsGenie slack integration doesn't let you trigger a new incident from slack which is boo :(

@yuvipanda
Copy link
Member

It's absolutely important that engineers can control how and when they are notified - we don't want this to become a traditional 'oncall' situation.

@yuvipanda
Copy link
Member

Me and @jmunroe spent a bunch of time talking about this, and in particular role-playing how this should have played out with today's UToronto outage.

A proposed workflow is that after the incident is created, we want pagerduty to evaluate the current local timezone and stated preferences of all engineers and then send everyone who wants to be notified at that point a notification, via methods of their own choosing (SMS, App, Phone call). Engineers can then acknowledge the alert if they are able to provide assistance, or do nothing if they can't. After an hour (or a configurable time period later!), if nobody has acknowledged the alert, it'll automatically escalate to a (opt-in) group of second tier folks who can respond.

We're going to play with pagerduty rules to try make this possible.

However, even without any of this, it still provides value by streamlining the 'incident response' process and removing a medium (GitHub) that currently is a bit of a bottleneck - and instead moving it all to slack.

@damianavila damianavila moved this from Needs Shaping / Refinement to In progress in DEPRECATED Engineering and Product Backlog Sep 12, 2022
@yuvipanda
Copy link
Member

Removing myself as I'm not currently working on it. #1804 is related though, and I am working on that.

@yuvipanda
Copy link
Member

However, to prevent the perfect from being the enemy of the good, please do consider that anyone on the team can always reach out to me at any time to escalate an outage.

@yuvipanda yuvipanda removed their assignment Nov 9, 2022
@damianavila damianavila moved this from In progress to Waiting in DEPRECATED Engineering and Product Backlog Nov 23, 2022
@damianavila
Copy link
Contributor

Related: 2i2c-org/team-compass#763

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

No branches or pull requests

5 participants