-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crons: Detect ingestion outages during clock ticks #79328
Comments
Part of GH-79328 --------- Co-authored-by: getsantry[bot] <66042841+getsantry[bot]@users.noreply.github.com>
Part of GH-79328 --------- Co-authored-by: getsantry[bot] <66042841+getsantry[bot]@users.noreply.github.com>
This will be used to inform the clock tick tasks that the tick detected an abnormal amount of check-in volume for the previous minute. Part of getsentry/sentry#79328
This task will be triggered when we detect an anomaly in check-in volume during a clock tick. When this happens we are unable to know that all check-ins before that tick have been received and will need to mark all in-progress monitors as resulting in a 'unknown' state. Part of getsentry/sentry#79328
This will be used to inform the clock tick tasks that the tick detected an abnormal amount of check-in volume for the previous minute. Part of getsentry/sentry#79328
This task will be triggered when we detect an anomaly in check-in volume during a clock tick. When this happens we are unable to know that all check-ins before that tick have been received and will need to mark all in-progress monitors as resulting in a 'unknown' state. Part of getsentry/sentry#79328
This task will be triggered when we detect an anomaly in check-in volume during a clock tick. When this happens we are unable to know that all check-ins before that tick have been received and will need to mark all in-progress monitors as resulting in a 'unknown' state. Part of getsentry/sentry#79328
This will be used to inform the clock tick tasks that the tick detected an abnormal amount of check-in volume for the previous minute. Part of getsentry/sentry#79328
When a click tick is marked as having an abnormal volume we may have lost check-is that should have been processed during this minute. In this scenario we do not want to notify on misses, and instead should create them as unknown misses. Part of getsentry/sentry#79328
This task will be triggered when we detect an anomaly in check-in volume during a clock tick. When this happens we are unable to know that all check-ins before that tick have been received and will need to mark all in-progress monitors as resulting in a 'unknown' state. Part of getsentry/sentry#79328
This adds a function `_evaluate_tick_decision` which looks back at the last MONITOR_VOLUME_RETENTION days worth of history and compares the minute we just ticked past to that data. We record 3 metrics from this comparison - `z_value`: This is measured as a ratio of standard deviations from the mean value - `pct_deviation`: This is the percentage we've deviated from the mean - `count`: This is the number of historic data points we're considering The z_value and pct_deviation will be most helpful in making our decision as to whether we've entered an "incident" state or not. Part of #79328
This task will be triggered when we detect an anomaly in check-in volume during a clock tick. When this happens we are unable to know that all check-ins before that tick have been received and will need to mark all in-progress monitors as resulting in a 'unknown' state. Part of getsentry/sentry#79328
This task will be triggered when we detect an anomaly in check-in volume during a clock tick. When this happens we are unable to know that all check-ins before that tick have been received and will need to mark all in-progress monitors as resulting in a 'unknown' state. Part of getsentry/sentry#79328
This will be used to pass around the result of anomaly detection during clock ticks. Part of #79328
When a click tick is marked as having an abnormal volume we may have lost check-is that should have been processed during this minute. In this scenario we do not want to notify on misses, and instead should create them as unknown misses. Part of getsentry/sentry#79328
This will be used to pass around the result of anomaly detection during clock ticks. Part of #79328
This task will be triggered when we detect an anomaly in check-in volume during a clock tick. When this happens we are unable to know that all check-ins before that tick have been received and will need to mark all in-progress monitors as resulting in a 'unknown' state. Part of getsentry/sentry#79328
Produces an `IncidentOccurrence` message into the monitors-incident-occurrences topic for later issue occurrence dispatching. Part of GH-79328
…0864) When transition into and out of an incident we want to know the _true_ start time and true recovery time of the incident. We can do this with the following logic: - When starting an incident we backfill the ABNORMAL decisions to INCIDENT. Record the timestamp of the very first abnormal and report that back with the DecisionResult that is transition=INCIDENT_STARTED - When recovering from an incident we backfill the RECOVERING decisions as NORMAL. Record the timestamp of the very first normal decision and report that back with the DecisionResult that is transition=INCIDENT_RECOVERED For all other decisions (including abnormal recovery and incident recovery failed) the timestamp is always just the tick that we're making the decision for, since these do not have The resulting time range is an inclusive start and exclusive end to determine the datetime range of the incident. Part of GH-79328
When a system incident recovers we need to prune volume data recorded during the incident, since this data will be a large outlier and will skew future metric calculations for these minutes. Part of GH-79328
When we are in an incident it's possible that we'll never receive the closing check-in for in-progress check-ins. Even when some check-ins make it through we cannot know that the terminating check-in will be sent. For this reason we MUST continually mark in-progress check-ins as unknown. There is no reason to check for time-outs during an incident, since all in-progress check-ins will have been marked as unknown. Part of GH-79328
I spent some time writing a description for the final bit of how this system works. It's in an internal notion Run Book document, but I'll duplicate it here for posterity. Configuring the System Incident detectionsThe crons system is resilient to upstream outages (such as partial relay outages) that would typically cause erroneously created To understand this system let’s look at the DataDog dashboard we have configured. [Internal DataDog dashboard] The “Mean Deviation %” graph shows us the deviation percentage that each clock tick deviates from the average volume history for that particular minute over the last 30 days. This data is used to generate “tick decisions” for each tick. Decisions can be observed in the “Tick Decisions” graph Decision transitionsIn the previous graph we see all decisions made for each clock tick are When Should the metric deviate into the The “Tick Transitions” graph gives us insight into the transition events The architecture overview includes some examples of how this works: Internal architecture diagram Delayed Incident Occurrences (delaying alerts)Miss and Time-out failures cause the
Again we expect this consumer to backlog which may look something like this. Once the tick decision has resolved the consumer will burn down it’s queue of occurrences. Decision option configurationThere are a number of options that control this system
|
Closing this issue. This is now deployed and operating as expected. Customers should no longer receive false positive notifications. |
There is a failure scenario that can be very disruptive for our customers.
If we have an outage in our ingestion of cron check-ins, specifically where we are dropping check-ins, then we may incorrectly mark customers cron monitors as having missed check-ins. This only happens if we drop check-ins, if we are delayed with check-ins, the clock ticks which drive missed and time-out detections will slow down to match the consumption of check-ins in our topic.
This is highly problematic as it means customers are unable to trust that cron alerts are accurate. This is, however, a difficult problem, since if check-ins never make it into the check-ins topic, how can we differentiate between a customers job failing and not sending a check-in, and us failing to ingest their check-in?
In most of our ingestion failure scenarios we have had a significant drop in check ins. That typically looks something like this:
Improved behavior
If we were able to detect this extreme drop in volume, we could create clock ticks that are marked as being
unknown
ticks. meaning we are moving the clock forward, but we have a high certainty that we may have lost many check-ins. When this happens instead of creatingmissed
andtimed-out
check-ins that trigger alerts, we can create missed check-ins that have a "unknown" status and mark in-progress check-ins that are past their latest check-in time as "unknown", again not alerting customers. Once we are certain that we have recovered from our incident, the clock will resume producing ticks that are regular, and not marked as unknown.Detecting ingestion volume drops
The tricky part here is deciding if we are in an incident state. Ideally we are not relying on an external service telling us that we may be in an incident, since that service itself may be part of the incident (eg, if we had relay report that it was having problems, there's no guarantee that when it's having problems it will just fail to report to us).
My proposed detection solution is rather simple. As we consume check-ins, we keep a bucket for each minute worth of check-ins, that bucket is a counter of how many check-ins were consumed for that minute. We will keep these buckets for 7 days worth of data, that's 10080 buckets.
Each time the clock ticks across a minute, we will look at the last 7 days of that particular minute we ticked over, take some type of average of those 7 counts, and compare that with the count of the minute we just ticked past. If we find that the count here is some percentage different from the previous 7 days of that minute, we will produce our clock tick with a "unknown" marker, meaning we are unsure if we collected enough data for this minute and are likely in an incident. In which case we will create misses and time-outs as "unknown".
Ignoring previous incidents
When a minute was detected as having a abnormally low volume we should reset it's count to some sentinel value like
-1
so that when we pick up this minute the next 7 days, we know to ignore the data, since it will not be an accurate representation of typical volume.Not enough data
Warning
What should we do if we don't have enough data to determine if the past minute is within the expected volume?
Implementation
We should start by implementing this as a simple metric that we track, so we can understand what our typical difference looks like each day. It's possible some days may have many more check-ins, such as Monday's ad midnight. So maybe we will need a different way to evaluate anomalies.
Warning
The implementation described above has changed. See the comment later in this issue for a description of the new approach
Implementation
PRs needed for both approaches
mark_unknown
clock task sentry-kafka-schemas#340mark_unknown
clock tick task #79735mark_unknown
#79785Outdated PRs for previous approach
volume_anomaly_result
to mark_missing task sentry-kafka-schemas#341volume_anomaly_result
#79729PRs needed for new approach
make_clock_tick_decision
#80640The text was updated successfully, but these errors were encountered: