Description
There is a failure scenario that can be very disruptive for our customers.
If we have an outage in our ingestion of cron check-ins, specifically where we are dropping check-ins, then we may incorrectly mark customers cron monitors as having missed check-ins. This only happens if we drop check-ins, if we are delayed with check-ins, the clock ticks which drive missed and time-out detections will slow down to match the consumption of check-ins in our topic.
This is highly problematic as it means customers are unable to trust that cron alerts are accurate. This is, however, a difficult problem, since if check-ins never make it into the check-ins topic, how can we differentiate between a customers job failing and not sending a check-in, and us failing to ingest their check-in?
In most of our ingestion failure scenarios we have had a significant drop in check ins. That typically looks something like this:
Improved behavior
If we were able to detect this extreme drop in volume, we could create clock ticks that are marked as being unknown
ticks. meaning we are moving the clock forward, but we have a high certainty that we may have lost many check-ins. When this happens instead of creating missed
and timed-out
check-ins that trigger alerts, we can create missed check-ins that have a "unknown" status and mark in-progress check-ins that are past their latest check-in time as "unknown", again not alerting customers. Once we are certain that we have recovered from our incident, the clock will resume producing ticks that are regular, and not marked as unknown.
Detecting ingestion volume drops
The tricky part here is deciding if we are in an incident state. Ideally we are not relying on an external service telling us that we may be in an incident, since that service itself may be part of the incident (eg, if we had relay report that it was having problems, there's no guarantee that when it's having problems it will just fail to report to us).
My proposed detection solution is rather simple. As we consume check-ins, we keep a bucket for each minute worth of check-ins, that bucket is a counter of how many check-ins were consumed for that minute. We will keep these buckets for 7 days worth of data, that's 10080 buckets.
Each time the clock ticks across a minute, we will look at the last 7 days of that particular minute we ticked over, take some type of average of those 7 counts, and compare that with the count of the minute we just ticked past. If we find that the count here is some percentage different from the previous 7 days of that minute, we will produce our clock tick with a "unknown" marker, meaning we are unsure if we collected enough data for this minute and are likely in an incident. In which case we will create misses and time-outs as "unknown".
Ignoring previous incidents
When a minute was detected as having a abnormally low volume we should reset it's count to some sentinel value like -1
so that when we pick up this minute the next 7 days, we know to ignore the data, since it will not be an accurate representation of typical volume.
Not enough data
Warning
What should we do if we don't have enough data to determine if the past minute is within the expected volume?
Implementation
We should start by implementing this as a simple metric that we track, so we can understand what our typical difference looks like each day. It's possible some days may have many more check-ins, such as Monday's ad midnight. So maybe we will need a different way to evaluate anomalies.
Warning
The implementation described above has changed. See the comment later in this issue for a description of the new approach
Implementation
### PRs needed for both approaches
- [ ] https://github.com/getsentry/sentry/pull/79448
- [ ] https://github.com/getsentry/sentry/pull/79574
- [ ] https://github.com/getsentry/sentry/pull/80348
- [ ] https://github.com/getsentry/sentry-kafka-schemas/pull/340
- [ ] https://github.com/getsentry/sentry/pull/79735
- [ ] https://github.com/getsentry/sentry/pull/79785
- [ ] https://github.com/getsentry/sentry/pull/79783
### Outdated PRs for previous approach
- [ ] https://github.com/getsentry/sentry-kafka-schemas/pull/339
- [ ] https://github.com/getsentry/sentry-kafka-schemas/pull/341
- [ ] https://github.com/getsentry/sentry/pull/79729
- [ ] https://github.com/getsentry/sentry/pull/80347
- [ ] https://github.com/getsentry/sentry/pull/80355
### PRs needed for new approach
- [ ] https://github.com/getsentry/sentry-kafka-schemas/pull/348
- [ ] https://github.com/getsentry/sentry/pull/80527
- [ ] https://github.com/getsentry/sentry/pull/80528
- [ ] https://github.com/getsentry/sentry/pull/80605
- [ ] https://github.com/getsentry/sentry/pull/80640
- [ ] https://github.com/getsentry/sentry/pull/80703
- [ ] https://github.com/getsentry/getsentry/pull/15718
- [ ] https://github.com/getsentry/ops/pull/12854
- [ ] https://github.com/getsentry/sentry/pull/80773
- [ ] https://github.com/getsentry/sentry/pull/80842
- [ ] https://github.com/getsentry/sentry/pull/80844
- [ ] https://github.com/getsentry/sentry/pull/80845
- [ ] https://github.com/getsentry/sentry/pull/80911
- [ ] https://github.com/getsentry/sentry/pull/80930
- [ ] https://github.com/getsentry/sentry/pull/80864
- [ ] https://github.com/getsentry/sentry/pull/80916
- [ ] https://github.com/getsentry/sentry/pull/80949
- [ ] https://github.com/getsentry/sentry/pull/80955
- [ ] https://github.com/getsentry/sentry/pull/80985
- [ ] https://github.com/getsentry/sentry/pull/80954
- [ ] https://github.com/getsentry/sentry-options-automator/pull/2691
- [ ] https://github.com/getsentry/sentry-options-automator/pull/2692
- [ ] https://github.com/getsentry/sentry-options-automator/pull/2698
- [ ] https://github.com/getsentry/sentry-options-automator/pull/2699
- [ ] https://github.com/getsentry/sentry/pull/81012
- [ ] https://github.com/getsentry/sentry/pull/81016
- [ ] https://github.com/getsentry/sentry/pull/81018
- [x] Killswitch documentation / runbook / systems diagram
- [x] Alert in slack when entering incident
- [x] Cleanup old changes no longer required (old anomaly things)