-
Notifications
You must be signed in to change notification settings - Fork 38
/
MARKDOWN_TEMPLATE
executable file
·85 lines (66 loc) · 3.3 KB
/
MARKDOWN_TEMPLATE
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
---
layout: post
title: "Service degradation of OBS $FEATURE"
category: deployments
author: Hans Franz <hans@opensuse.org>
image: path/to/main/image.extension
---
<!--
Classify the severity of this problem. We usually say:
- service degradation: if only a few customers were impacted
- severe service degradation: if nearly every customer was impacted
- downtime: if every visit to the OBS ended up on some error page
-->
There was a service degradation of our reference server.
<!--
What happened, for how long and who was impacted by it?
For customers to be able to identify if their problem is related to this post-mortem or not.
-->
After a deployment of build.opensuse.org on Friday, December 23rd at 17:30 UTC,
$FEATURE was not accessible for twelve minutes for all users with the name Fred.
We want to give you some insight into what happened and what we are doing to
avoid similar problems in the future.
## Detection
<!--
How did we get alerted that the problem happened?
To demonstrate to customers how we are (hopefully automatically) alerted about problems.
-->
We received an automatic alert from our monitoring about $FEATURE throughput.
## Root Cause
<!--
What was it and what triggered it?
For customers and community to understand what happened technically.
-->
The root cause was a data migration that wrongly deleted all $FEATURE for users
with the name Fred. The data migration was triggered by the deployment.
## Resolution
<!--
How did you resolve or work around this problem?
For customers and community to understand what happened technically.
-->
We manually re-created the $FEATURE for all users with the name Fred from the backup.
### Action Items
<!--
Are there any actions we are going to do that are not done yet?
For customers and community to be able to follow up on this.
-->
Additionally, we have created some follow-up action items:
| Action Item | Owner |
|--- |--- |
| [Follow-up Issue #12345 to add a spec for the data migration](https://github.com/openSUSE/open-build-service/issues/12345]) | Developer Team |
| [Follow-up card in our backlog to add automated data migration tests](https://trello.com/c/blahblah) | Developer Team |
## Lessons Learned
<!--
Describe what went well, what went wrong and where we go lucky during the resolution of this problem.
-->
- Small deployments are easy to debug: As the code delta of the deployment was pretty small we were able to quickly identify the problem.
- Incremental backup of the database tables: As the $FEATURE table was not changing often in our database, restoring the backup was easy.
On a high-traffic table, this would have been a nightmare and probably would have led to data loss.
- Test your data migrations: Even the simplest data migration should have a corresponding spec.
- If the data migration is not tested, then code review *has to* include running the migration.
## Timeline (All Time in UTC)
- *17:35* We got an automatic alert about $FEATURE throughput from our monitoring
- *17:36* We declared the incident
- *17:40* After reviewing the deployment we noticed an erroneous data migration deleted all $FEATURE from users with the name Fred
- *17:42* We fetched the $FEATURES table from the database backup and restored it which made $FEATURE available again
- *17:45* We declared the incident as resolved