-
Notifications
You must be signed in to change notification settings - Fork 441
Incident Management Protocol
Important
To quote a SRE book: Effective incident management is key to limiting the disruption caused by an incident and restoring normal business operations as quickly as possible.
About our incident management...
The people operating production, the other developers and also the users affected by the problem. Managing an incident means documenting and resolving a service disruption in an orderly and stress free fashion despite the serious problem going on. It's the equivalent of the emergency evacuation plan for real-life buildings. It's the plan you can rely on to get out of this bad situation without having to think too much about how.
There are 3 things needed for effective incident management:
- Everybody managing the incident knows their roles
- The Incident State Document is kept up to date during the ongoing incident
- The Post Mortem Report is written after the incident is resolved
You can probably guess from the amount of work: This is seldomly something you can do alone, incident management is a team sport!
To be effective it's very important that everybody working in the incident resolution knows their role. The role separation helps in knowing what the role should and should not do in order to avoid confusion and chaos around who's responsible for what. Here is who needs to do what, when the service is down or there are other major disruptions.
By default the Production-Squad is the incident manager.
However, everybody can declare themselves incident manager, especially if they notice that the Production-Squad is not responding for some reason.
The duties of the Incident Manager are:
- Create an incident state document
- Declare the incident to our team channel (
:warning: We have an incident going on, follow it here: https://etherpad.opensuse.org/p/....
) - Fulfill all the other roles (Communications & OPS) OR delegate roles to anyone in the team
- Declare the incident as resolved
The duties of the communications role are:
- Continuous update of the incident state document
- Continuous update of stakeholders through the communication channels
- Write the Post Mortem Report
The duties of operator role are:
- Stop the bleeding and restore the service
- Find the root-cause
For the time the incident is ongoing, we are updating an Incident State Document (template) on https://etherpad.opensuse.org. We do this to build a timeline and to keep people, who are affected by the incident, updated on what is going on.
After the incident is under control and we have understood what has happened, we are writing a Root Cause Analysis/Post Mortem Report on our blog to institutionalize improvement.
We do this to assure we...
- investigate the root cause of the failure
- determine and track follow-up action items
- create a continuous, transparent feedback loop for ourselves, our users/customers and all people in the wider community
To write up these reports we use out Post Mortem Template. We usually start with building the timeline, then derive the rest from this conversation. Check out the already published reports for inspiration.
It's better to declare an incident early and call it off later, than to spin up an incident response team when everything is messed up by unorganized tinkering.
The incident is closed when the involved services are back to normal operation. This does not include those long-term tasks created during the incident response.
Please add a reply to them (in a slack thread) why they are false.
Having to come up with sentences to use in communication is hard during an incident. Find some templates below, you can find more on the internet.
Title: Open Build Service Service Disruption
We are currently experiencing a service disruption.
Our team is working to identify the root cause and implement a solution.
All build.opensuse.org users may be affected.
You can follow the current state on our incident document: https://etherpad.opensuse.org/p/KaqRIWahiQOthDdf1OeR
We will come back to you here once we resolved the incident.
Title: Open Build Service Page Unresponsiveness
The site is currently experiencing a higher than normal amount of load, and may be causing pages to be slow or unresponsive.
**ADD_GENERAL_IMPACT** users may be affected.
You can follow the current state on our incident document: https://etherpad.opensuse.org/p/KaqRIWahiQOthDdf1OeR
We will come back to you here once we resolved the incident.
Our communication channels include
- Our mailing list opensuse-buildservice@opensuse.org
- IRC (irc://irc.libera.chat/openSUSE-buildservice)
- OBS Status Messages
- Slack (#help-obs & #team-build-solutions)
- Development Environment Overview
- Development Environment Tips & Tricks
- Spec-Tips
- Code Style
- Rubocop
- Testing with VCR
- Authentication
- Authorization
- Autocomplete
- BS Requests
- Events
- ProjectLog
- Notifications
- Feature Toggles
- Build Results
- Attrib classes
- Flags
- The BackendPackage Cache
- Maintenance classes
- Cloud uploader
- Delayed Jobs
- Staging Workflow
- StatusHistory
- OBS API
- Owner Search
- Search
- Links
- Distributions
- Repository
- Data Migrations
- next_rails
- Ruby Update
- Rails Profiling
- Installing a local LDAP-server
- Remote Pairing Setup Guide
- Factory Dashboard
- osc
- Setup an OBS Development Environment on macOS
- Run OpenQA smoketest locally
- Responsive Guidelines
- Importing database dumps
- Problem Statement & Solution
- Kickoff New Stuff
- New Swagger API doc
- Documentation and Communication
- GitHub Actions
- How to Introduce Software Design Patterns
- Query Objects
- Services
- View Components
- RFC: Core Components
- RFC: Decorator Pattern
- RFC: Backend models