Skip to content

Commit

Permalink
Merge pull request #5935 from ministryofjustice/update-logging-incide…
Browse files Browse the repository at this point in the history
…nt-log

feat: add incident log for 25-07-24
  • Loading branch information
mikebell authored Jul 30, 2024
2 parents c70ba36 + 96f70bd commit 15580f2
Showing 1 changed file with 52 additions and 0 deletions.
52 changes: 52 additions & 0 deletions runbooks/source/incident-log.html.md.erb
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,58 @@ weight: 45

> Use the [mean-time-to-repair.rb] script to view performance metrics

## Q3 2024 (July-September)

- **Mean Time to Repair**: 3h 8m

- **Mean Time to Resolve**: 4h 9m

### Incident on 2024-07-25

- **Key events**
- First deteceted: 2024-07-25 12:10
- Incident declared: 2024-07-25 14:54
- Repaired declared: 2024-07-25 15:18
- Resolved 2024-07-25 16:19

- **Time to repair**: 3h 8m

- **Time to resolve**: 4h 9m

- **Identified**: User reported that Elasticsearch was no longer receiving logs

- **Impact**: Elasticsearch and Opensearch did not recieve logs, this meant that we lost users logs for the period of the incident. These logs have not been recovered.

- **Context**:
- 2024-07-25 12:10: cp-live-app-logs - ClusterIndexWritesBlocked starts
- 2024-07-25 12:30: cp-live-app-logs - ClusterIndexWritesBlocked recovers
- 2024-07-25 12:50: cp-live-app-logs - ClusterIndexWritesBlocked recovers
- 2024-07-25 12:35: cp-live-app-logs - ClusterIndexWritesBlocked starts
- 2024-07-25 12:55: cp-live-app-logs - ClusterIndexWritesBlocked starts
- 2024-07-25 13:15: cp-live-app-logs - ClusterIndexWritesBlocked recovers and starts
- 2024-07-25 13:40: cp-live-app-logs - ClusterIndexWritesBlocked recovers and starts
- 2024-07-25 13:45: Kibana no longer receiving any logs
- 2024-07-25 14:27: User notifies team via #ask-cloud-platform that Kibana has not been receiving logs since 13:45.
- 2024-07-25 14:32: Initial investigation shows no problems in live monitoring namespace
- 2024-07-25 14:42: Google meet call started to triage
- 2024-07-25 14:54: Incident declared
- 2024-07-25 14:55: Logs from fluent-bit containers show “could not enqueue into the ring buffer”
- 2024-07-25 14:59: rollout restart of all fluent-bit containers, logs partially start flowing but after a few minutes show the same error message
- 2024-07-25 15:18: It is noted that Opensearch is out of disk space, this is increased from 8000 to 12000
- 2024-07-25 15:58: Disk space increase is complete and we start seeing fluent-bit processing logs
- 2024-07-25 16:15: Remediation tasks are defined and started to action
- 2024-07-25 16:19: Incident declared resolved

- **Resolution**:
- Opensearch disk space is increased from 8000 to 12000
- Fluentbit is configured to not log to Opensearch as a temporary measure whilst follow-up investigation work into root cause is carried out.

- **Review actions**:
- [Opensearch and Elasticsearch index dating issues](https://github.com/ministryofjustice/cloud-platform/issues/5931)
- [High priority alerts for Elasticsearch and Opensearch](https://github.com/ministryofjustice/cloud-platform/issues/5928)
- [Re-introduce Opensearch in to Live logging](https://github.com/ministryofjustice/cloud-platform/issues/5929)
- [Investigate fluent-bit "failed to flush chunk"](https://github.com/ministryofjustice/cloud-platform/issues/5930)

## Q1 2024 (January-April)

- **Mean Time to Repair**: 3h 21m
Expand Down

0 comments on commit 15580f2

Please sign in to comment.