Skip to content

Incident Management Protocol

Dani Donisa edited this page Feb 7, 2024 · 31 revisions

Important

To quote a SRE book: Effective incident management is key to limiting the disruption caused by an incident and restoring normal business operations as quickly as possible.

About our incident management...

Limiting the disruption to everyone involved:

The people operating production, the other developers and also the users affected by the problem. Managing an incident means documenting and resolving a service disruption in an orderly and stress free fashion despite the serious problem going on. It's the equivalent of the emergency evacuation plan for real-life buildings. It's the plan you can rely on to get out of this bad situation without having to think too much about how.

There are 3 things needed for effective incident management:

  • Everybody managing the incident knows their roles
  • The Incident State Document is kept up to date during the ongoing incident
  • The Post Mortem Report is written after the incident is resolved

You can probably guess from the amount of work: This is seldomly something you can do alone, incident management is a team sport!

Roles

To be effective it's very important that everybody working in the incident resolution knows their role. The role separation helps in knowing what the role should and should not do in order to avoid confusion and chaos around who's responsible for what. Here is who needs to do what, when the service is down or there are other major disruptions.

The Incident Manager

By default the Production-Squad is the incident manager.

However, everybody can declare themselves incident manager, especially if they notice that the Production-Squad is not responding for some reason.

The duties of the Incident Manager are:

  • Create an incident state document
  • Declare the incident to our team channel (:warning: We have an incident going on, follow it here: https://etherpad.opensuse.org/p/....)
  • Fulfill all the other roles (Communications & OPS) OR delegate roles to anyone in the team
  • Declare the incident as resolved

The Communications Role

The duties of the communications role are:

The Operator Role

The duties of operator role are:

  • Stop the bleeding and restore the service
  • Find the root-cause

Deliverables

The Incident State Document

For the time the incident is ongoing, we are updating an Incident State Document (template) on https://etherpad.opensuse.org. We do this to build a timeline and to keep people, who are affected by the incident, updated on what is going on.

The Post Mortem Report

To institutionalize improvement after the incident is under control, and we understood what has happened, we are writing a Post Mortem Report on our blog. Those reports are based on our Post Mortem Template and the Incident State Document.

We do this to assure we...

  • investigate the root cause of the failure
  • determine and track follow-up action items
  • create a continuous, transparent feedback loop for ourselves, our users/customers and all people in the wider community

FAQ

When to Declare an Incident?

It's better to declare an incident early and call it off later, than to spin up an incident response team when everything is messed up by unorganized tinkering.

When to Close an Incident?

The incident is closed when the involved services are back to normal operation. This does not include those long-term tasks created during the incident response.

What to do with alerts that are false positive?

Please add a reply to them (in a slack thread) why they are false.

What to communicate to people?

Having to come up with sentences to use in communication is hard during an incident. Find some templates below, you can find more on the internet.

Service Disruption

Title: Open Build Service Service Disruption
We are currently experiencing a service disruption.
Our team is working to identify the root cause and implement a solution. 
All build.opensuse.org users may be affected.
You can follow the current state on our incident document: https://etherpad.opensuse.org/p/KaqRIWahiQOthDdf1OeR
We will come back to you here once we resolved the incident.

General Unresponsiveness

Title: Open Build Service Page Unresponsiveness
The site is currently experiencing a higher than normal amount of load, and may be causing pages to be slow or unresponsive.
**ADD_GENERAL_IMPACT** users may be affected.
You can follow the current state on our incident document: https://etherpad.opensuse.org/p/KaqRIWahiQOthDdf1OeR
We will come back to you here once we resolved the incident.

Where to communicate with people?

Our communication channels include

Clone this wiki locally