Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(crons): Refactor to record_clock_tick_volume_metric #80605

Conversation

evanpurkhiser
Copy link
Member

@evanpurkhiser evanpurkhiser commented Nov 12, 2024

This changes how we're using the volume history. Previously we were intending to use the volume history to make a decision for a specific tick, we cannot do that since we'll actually need to look at historic volume metrics to determine if we've entered an incident or if we just had an abnormality in mean deviation.

  • The clock_dispatch no longer includes a volume_anomaly_result. I will remove this from the sentry-kafka-schema package in an upcoming PR (ref(crons): Remove unused volume_anomaly_result from clock_tick sentry-kafka-schemas#349).

  • Instead of evaluating a tick decision during dispatch, we now record the metrics for the timestamp we just ticked past into redis. This is done during the processing of the clock tick in the clock_tick_consumer.

  • The clock_tick_consumer no longer reads the volume_anomaly_result into a TickVolumeAnomolyResult. We'll still do something with this since in the future we'll be evaluating a tick result decision based on the tick metrics and will need to dispatch mark_unknown when entering an incident. But for now I've removed this logic.

  • I've also updated the pct_deviation metric (which is the one recorded into the redis key) to not be an absolute value, since we want to know which direction we've deviated in, we do not want to produce an incident in the scenario that we've increased in volume.

  • I've removed the safe_evaluate_tick_decision instead of creating a safe_record_clock_tick_volume_metric since we're now running this logic in a consumer which can backlog if we do have some kind of issue. This wrapper only existed since it was in a hot path that could fail in an unrecoverable way. We've also had this code running for a while now with no problems, so it's safe to not be overly cautious.

Part of GH-79328

@evanpurkhiser evanpurkhiser requested a review from a team as a code owner November 12, 2024 17:24
@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Nov 12, 2024
Copy link

codecov bot commented Nov 12, 2024

Codecov Report

Attention: Patch coverage is 95.23810% with 1 line in your changes missing coverage. Please review.

✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/sentry/monitors/system_incidents.py 93.75% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #80605      +/-   ##
==========================================
- Coverage   78.36%   78.36%   -0.01%     
==========================================
  Files        7207     7207              
  Lines      318744   318737       -7     
  Branches    43909    43908       -1     
==========================================
- Hits       249782   249770      -12     
- Misses      62600    62603       +3     
- Partials     6362     6364       +2     

This changes how we're using the volume history. Previously we were
intending to use the volume history to make a decision for a specific
tick, we cannot do that since we'll actually need to look at historic
volume metrics to determine if we've entered an incident or if we just
had an abnormality in mean deviation.

- The clock_dispatch no longer includes a volume_anomaly_result. I will
  remove this from the sentry-kafka-schema in an upcoming PR.

- Instead of evaluating a tick decision during dispatch, we now record
  the metrics for the timestamp we just ticked past into redis. This is
  done during the processing of the clock tick in the
  clock_tick_consumer.

- The clock_tick_consumer no longer reads the volume_anomaly_result into
  a TickVolumeAnomolyResult. We'll still do something with this since in
  the future we'll be evaluating a tick result decision based on the
  tick metrics and will need to dispatch mark_unknown when entering an
  incident. But for now I've removed this logic.

- I've also updated the pct_deviation metric (which is the one recorded
  into the redis key) to not be an absolute value, since we want to know
  which direction we've deviated in, we do not want to produce an
  incident in the scenario that we've increased in volume.

- I've removed the safe_evaluate_tick_decision instead of creating a
  safe_record_clock_tick_volume_metric since we're now running this
  logic in a consumer which can backlog if we do have some kind of
  issue. This wrapper only existed since it was in a hot path that could
  fail in an unrecoverable way. We've also had this code running for a
  while now with no problems, so it's safe to not be overly cautious.
@evanpurkhiser evanpurkhiser force-pushed the evanpurkhiser/feat-crons-refactor-to-record-clock-tick-volume-metric branch from 777ce1d to e5cd6b5 Compare November 12, 2024 18:03
@evanpurkhiser evanpurkhiser enabled auto-merge (squash) November 12, 2024 18:59
Comment on lines -126 to +128
pct_deviation = (abs(past_minute_volume - historic_mean) / historic_mean) * 100
pct_deviation = (past_minute_volume - historic_mean) / historic_mean * 100
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's where we adjust the pct mean deviation to allow negative values

@evanpurkhiser evanpurkhiser merged commit de77156 into master Nov 13, 2024
50 checks passed
@evanpurkhiser evanpurkhiser deleted the evanpurkhiser/feat-crons-refactor-to-record-clock-tick-volume-metric branch November 13, 2024 01:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Scope: Backend Automatically applied to PRs that change backend components
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants