Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crons: Detect ingestion outages during clock ticks #79328

Closed
42 tasks done
evanpurkhiser opened this issue Oct 17, 2024 · 8 comments
Closed
42 tasks done

Crons: Detect ingestion outages during clock ticks #79328

evanpurkhiser opened this issue Oct 17, 2024 · 8 comments

Comments

@evanpurkhiser
Copy link
Member

evanpurkhiser commented Oct 17, 2024

There is a failure scenario that can be very disruptive for our customers.

If we have an outage in our ingestion of cron check-ins, specifically where we are dropping check-ins, then we may incorrectly mark customers cron monitors as having missed check-ins. This only happens if we drop check-ins, if we are delayed with check-ins, the clock ticks which drive missed and time-out detections will slow down to match the consumption of check-ins in our topic.

This is highly problematic as it means customers are unable to trust that cron alerts are accurate. This is, however, a difficult problem, since if check-ins never make it into the check-ins topic, how can we differentiate between a customers job failing and not sending a check-in, and us failing to ingest their check-in?

In most of our ingestion failure scenarios we have had a significant drop in check ins. That typically looks something like this:

Image

Improved behavior

If we were able to detect this extreme drop in volume, we could create clock ticks that are marked as being unknown ticks. meaning we are moving the clock forward, but we have a high certainty that we may have lost many check-ins. When this happens instead of creating missed and timed-out check-ins that trigger alerts, we can create missed check-ins that have a "unknown" status and mark in-progress check-ins that are past their latest check-in time as "unknown", again not alerting customers. Once we are certain that we have recovered from our incident, the clock will resume producing ticks that are regular, and not marked as unknown.

Detecting ingestion volume drops

The tricky part here is deciding if we are in an incident state. Ideally we are not relying on an external service telling us that we may be in an incident, since that service itself may be part of the incident (eg, if we had relay report that it was having problems, there's no guarantee that when it's having problems it will just fail to report to us).

My proposed detection solution is rather simple. As we consume check-ins, we keep a bucket for each minute worth of check-ins, that bucket is a counter of how many check-ins were consumed for that minute. We will keep these buckets for 7 days worth of data, that's 10080 buckets.

Each time the clock ticks across a minute, we will look at the last 7 days of that particular minute we ticked over, take some type of average of those 7 counts, and compare that with the count of the minute we just ticked past. If we find that the count here is some percentage different from the previous 7 days of that minute, we will produce our clock tick with a "unknown" marker, meaning we are unsure if we collected enough data for this minute and are likely in an incident. In which case we will create misses and time-outs as "unknown".

Ignoring previous incidents

When a minute was detected as having a abnormally low volume we should reset it's count to some sentinel value like -1 so that when we pick up this minute the next 7 days, we know to ignore the data, since it will not be an accurate representation of typical volume.

Not enough data

Warning

What should we do if we don't have enough data to determine if the past minute is within the expected volume?

Implementation

We should start by implementing this as a simple metric that we track, so we can understand what our typical difference looks like each day. It's possible some days may have many more check-ins, such as Monday's ad midnight. So maybe we will need a different way to evaluate anomalies.

Warning

The implementation described above has changed. See the comment later in this issue for a description of the new approach

Implementation

PRs needed for both approaches

  1. Scope: Backend
  2. Scope: Backend
  3. Scope: Backend
  4. Scope: Backend
  5. Scope: Backend
  6. Scope: Backend

Outdated PRs for previous approach

  1. Scope: Backend
  2. Scope: Backend
  3. Scope: Backend

PRs needed for new approach

  1. Scope: Backend
  2. Scope: Backend
  3. Scope: Backend
  4. Scope: Backend
  5. Scope: Backend
  6. Scope: Backend
  7. Scope: Backend
  8. Scope: Backend
  9. Scope: Backend
  10. Scope: Backend
  11. Scope: Backend
  12. Scope: Backend
  13. Scope: Backend
  14. Scope: Backend
  15. Scope: Frontend
  16. Scope: Frontend
  17. Scope: Backend
  18. Scope: Frontend
  19. Scope: Backend
  20. Scope: Backend
evanpurkhiser added a commit that referenced this issue Oct 22, 2024
Part of GH-79328

---------

Co-authored-by: getsantry[bot] <66042841+getsantry[bot]@users.noreply.github.com>
cmanallen pushed a commit that referenced this issue Oct 23, 2024
Part of GH-79328

---------

Co-authored-by: getsantry[bot] <66042841+getsantry[bot]@users.noreply.github.com>
evanpurkhiser added a commit to getsentry/sentry-kafka-schemas that referenced this issue Oct 23, 2024
This will be used to inform the clock tick tasks that the tick detected
an abnormal amount of check-in volume for the previous minute.

Part of getsentry/sentry#79328
evanpurkhiser added a commit to getsentry/sentry-kafka-schemas that referenced this issue Oct 23, 2024
This task will be triggered when we detect an anomaly in check-in
volume during a clock tick. When this happens we are unable to know that
all check-ins before that tick have been received and will need to mark
all in-progress monitors as resulting in a 'unknown' state.

Part of getsentry/sentry#79328
evanpurkhiser added a commit to getsentry/sentry-kafka-schemas that referenced this issue Oct 23, 2024
This will be used to inform the clock tick tasks that the tick detected
an abnormal amount of check-in volume for the previous minute.

Part of getsentry/sentry#79328
evanpurkhiser added a commit to getsentry/sentry-kafka-schemas that referenced this issue Oct 23, 2024
This task will be triggered when we detect an anomaly in check-in
volume during a clock tick. When this happens we are unable to know that
all check-ins before that tick have been received and will need to mark
all in-progress monitors as resulting in a 'unknown' state.

Part of getsentry/sentry#79328
evanpurkhiser added a commit to getsentry/sentry-kafka-schemas that referenced this issue Oct 23, 2024
This task will be triggered when we detect an anomaly in check-in
volume during a clock tick. When this happens we are unable to know that
all check-ins before that tick have been received and will need to mark
all in-progress monitors as resulting in a 'unknown' state.

Part of getsentry/sentry#79328
evanpurkhiser added a commit to getsentry/sentry-kafka-schemas that referenced this issue Oct 24, 2024
This will be used to inform the clock tick tasks that the tick detected
an abnormal amount of check-in volume for the previous minute.

Part of getsentry/sentry#79328
evanpurkhiser added a commit to getsentry/sentry-kafka-schemas that referenced this issue Oct 24, 2024
When a click tick is marked as having an abnormal volume we may have
lost check-is that should have been processed during this minute. In
this scenario we do not want to notify on misses, and instead should
create them as unknown misses.

Part of getsentry/sentry#79328
evanpurkhiser added a commit to getsentry/sentry-kafka-schemas that referenced this issue Oct 24, 2024
This task will be triggered when we detect an anomaly in check-in
volume during a clock tick. When this happens we are unable to know that
all check-ins before that tick have been received and will need to mark
all in-progress monitors as resulting in a 'unknown' state.

Part of getsentry/sentry#79328
evanpurkhiser added a commit that referenced this issue Oct 24, 2024
This adds a function `_evaluate_tick_decision` which looks back at the
last MONITOR_VOLUME_RETENTION days worth of history and compares the
minute we just ticked past to that data.

We record 3 metrics from this comparison

- `z_value`: This is measured as a ratio of standard deviations from the
mean value
- `pct_deviation`: This is the percentage we've deviated from the mean
- `count`: This is the number of historic data points we're considering

The z_value and pct_deviation will be most helpful in making our
decision as to whether we've entered an "incident" state or not.

Part of #79328
evanpurkhiser added a commit to getsentry/sentry-kafka-schemas that referenced this issue Oct 24, 2024
This task will be triggered when we detect an anomaly in check-in
volume during a clock tick. When this happens we are unable to know that
all check-ins before that tick have been received and will need to mark
all in-progress monitors as resulting in a 'unknown' state.

Part of getsentry/sentry#79328
evanpurkhiser added a commit to getsentry/sentry-kafka-schemas that referenced this issue Oct 24, 2024
This task will be triggered when we detect an anomaly in check-in
volume during a clock tick. When this happens we are unable to know that
all check-ins before that tick have been received and will need to mark
all in-progress monitors as resulting in a 'unknown' state.

Part of getsentry/sentry#79328
evanpurkhiser added a commit that referenced this issue Oct 24, 2024
This will be used to pass around the result of anomaly detection during
clock ticks.

Part of #79328
evanpurkhiser added a commit to getsentry/sentry-kafka-schemas that referenced this issue Oct 24, 2024
When a click tick is marked as having an abnormal volume we may have
lost check-is that should have been processed during this minute. In
this scenario we do not want to notify on misses, and instead should
create them as unknown misses.

Part of getsentry/sentry#79328
evanpurkhiser added a commit that referenced this issue Oct 24, 2024
This will be used to pass around the result of anomaly detection during
clock ticks.

Part of #79328
evanpurkhiser added a commit to getsentry/sentry-kafka-schemas that referenced this issue Oct 25, 2024
This task will be triggered when we detect an anomaly in check-in
volume during a clock tick. When this happens we are unable to know that
all check-ins before that tick have been received and will need to mark
all in-progress monitors as resulting in a 'unknown' state.

Part of getsentry/sentry#79328
evanpurkhiser added a commit that referenced this issue Nov 15, 2024
Produces an `IncidentOccurrence` message into the
monitors-incident-occurrences topic for later issue occurrence
dispatching.

Part of GH-79328
evanpurkhiser added a commit that referenced this issue Nov 15, 2024
…80845)

This adds the logic to delay dispatching an incident occurrence based on
if we're in an abnormal or recovering state.

Part of GH-79328
evanpurkhiser added a commit that referenced this issue Nov 15, 2024
…0864)

When transition into and out of an incident we want to know the _true_
start time and true recovery time of the incident. We can do this with
the following logic:

- When starting an incident we backfill the ABNORMAL decisions to
INCIDENT. Record the timestamp of the very first abnormal and report
that back with the DecisionResult that is transition=INCIDENT_STARTED

- When recovering from an incident we backfill the RECOVERING decisions
as NORMAL. Record the timestamp of the very first normal decision and
report that back with the DecisionResult that is
transition=INCIDENT_RECOVERED

For all other decisions (including abnormal recovery and incident
recovery failed) the timestamp is always just the tick that we're making
the decision for, since these do not have

The resulting time range is an inclusive start and exclusive end to
determine the datetime range of the incident.

Part of GH-79328
evanpurkhiser added a commit that referenced this issue Nov 18, 2024
When a system incident recovers we need to prune volume data recorded
during the incident, since this data will be a large outlier and will
skew future metric calculations for these minutes.

Part of GH-79328
evanpurkhiser added a commit that referenced this issue Nov 18, 2024
When we are in an incident it's possible that we'll never receive the
closing check-in for in-progress check-ins. Even when some check-ins
make it through we cannot know that the terminating check-in will be
sent.

For this reason we MUST continually mark in-progress check-ins as
unknown.

There is no reason to check for time-outs during an incident, since all
in-progress check-ins will have been marked as unknown.

Part of GH-79328
@evanpurkhiser
Copy link
Member Author

I spent some time writing a description for the final bit of how this system works. It's in an internal notion Run Book document, but I'll duplicate it here for posterity.

[Internal runbook document]


Configuring the System Incident detections

The crons system is resilient to upstream outages (such as partial relay outages) that would typically cause erroneously created MISSED or TIMEOUT check-ins. It does this by monitoring the volume of check-ins it consumes each minute. If the volume drops below the average for the last 30 days during that minute, the system will delay sending “incident occurrences” (which produce issue occurrences). We delay until we can determine if we are truly experiencing an outage, once the system determines there is an outage, it will drop those incident occurrences and any future occurrences until we recover, as well as mark affected check-ins as UNKNOWN.

To understand this system let’s look at the DataDog dashboard we have configured. [Internal DataDog dashboard]

The “Mean Deviation %” graph shows us the deviation percentage that each clock tick deviates from the average volume history for that particular minute over the last 30 days.

Image

This data is used to generate “tick decisions” for each tick. Decisions can be observed in the “Tick Decisions” graph

Image

Decision transitions

In the previous graph we see all decisions made for each clock tick are NORMAL. However, if the system detects a deviation into the ABNORMAL range the tick will be decided as ABNORMAL. All future ticks will be abnormal until we see 5 ticks that are below the abnormal threshold in a row, in which case the decisions will transition back to NORMAL.

When ABNORMAL transitions back to NORMAL all abnormal decisions are backfilled to NORMAL. This is to say that ABNORMAL is a intermediate state.

Should the metric deviate into the INCIDENT threshold, the system will mark those ticks as INCIDENT. Similar to ABNORMAL once a tick is detected below the abnormality threshold the system will begin making RECOVERING decisions. There must be 5 recovering decisions in a row below the abnormality threshold, at which point the system will begin making NORMAL decisions again and the RECOVERING decisions will be backfilled as NORMAL.

The “Tick Transitions” graph gives us insight into the transition events

Image

The architecture overview includes some examples of how this works: Internal architecture diagram

Delayed Incident Occurrences (delaying alerts)

Miss and Time-out failures cause the mark_failed logic to dispatch “incident occurrences” into a queue. These occurrences will check the Tick Decision made for the clock tick that dispatched these check-in failures.

  • NORMAL Ticks - The consumer will immediately dispatch an issue occurrences to notify customers of their cron failure.
  • ABNORMAL and RECOVERING - The consumer will not process the occurrence and will wait for the tick decision to resolve. During this time it is expected that this consumer will backlog, as we are waiting to determine if we have entered an incident, recovered from an abnormality, or recovered from an incident.
  • INCIDENT Ticks - The consumer will drop occurrences (they will never be sent). In this scenario these occurrences may be false negatives. The check-ins will be marked as UNKNOWN

Again we expect this consumer to backlog which may look something like this. Once the tick decision has resolved the consumer will burn down it’s queue of occurrences.

Image

Decision option configuration

There are a number of options that control this system

  • crons.system_incidents.collect_metrics: boolean
    Enables recording volume and metrics. Without this enabled the decision detection system will do nothing.
  • crons.system_incidents.use_decisions: boolean
    Enables using tick decisions to delay / drop incident occurrences (intents to create issue occurrences). Also enables marking in-progress as unknown during incidents. This may be disabled should we incorrectly detect an abnormality or incident.
  • crons.system_incidents.pct_deviation_anomaly_threshold: int
    The threshold the metric must deviate for a tick to be considered an anomaly. The closer we can get this to the typical metric the better, since this means we will catch incidents earlier. This value is also used to determine when we have recovered. Should be adjusted if we get stuck in long running anomaly states.
  • crons.system_incidents.pct_deviation_incident_threshold: int
    Similar to the anomaly threshold, this is the % deviation that we must surpass to consider a tick having entered an incident. Should be adjusted if we fail to detect an incident with too low of a volume drop-off.
  • crons.system_incidents.tick_decision_window: int
    The number of ticks that we will look back at when recovering from an abnormality or incident. This may be adjusted if we recover too soon / too late from incidents or abnormalities.

@evanpurkhiser
Copy link
Member Author

Closing this issue. This is now deployed and operating as expected.

Customers should no longer receive false positive notifications.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants