-
Notifications
You must be signed in to change notification settings - Fork 12.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alerting: Add setting to distribute rule group evaluations over time #80766
Conversation
c861e42
to
d3ed74f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't have time to test it, but otherwise looks great! I will test later this afternoon, and until then just a couple of comments! Nice work 👍
// JitterStrategyFromToggles returns the JitterStrategy indicated by the current Grafana feature toggles. | ||
func JitterStrategyFromToggles(toggles featuremgmt.FeatureToggles) JitterStrategy { | ||
strategy := JitterNever | ||
if toggles.IsEnabledGlobally(featuremgmt.FlagJitterAlertRules) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to set both toggles at once? Should we return an error if this happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is, jitter by rule will win, and I think this is desirable. Especially with how we are positioning this with jitter by rule being an additional escape hatch.
Given cloud's feature toggle admin page (which we might make this togglable in), I don't want to give a path where someone can create invalid config.
} | ||
|
||
func jitterHash(r *ngmodels.AlertRule, strategy JitterStrategy) uint64 { | ||
l := labels.New( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to create a slice of labels for this? Could we create a fnv64 and just write to it instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we could do that, I simply ripped the basic hashing logic off of Prometheus 😆
for _, r := range rules2 { | ||
offset := jitterOffsetInTicks(r, baseInterval, JitterByGroup) | ||
require.Equal(t, group2Offset, offset) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps just add a third assertion here where you take a rule from rules1
and another rule from rules2
and assert that their offsets are different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assertion isn't always true though - there are a relatively small number of buckets. We are not guaranteed each group will hash into a different bucket, there is a small chance they roll the same one. Writing a test expecting hashes to be different seems misleading, they aren't always different by design.
I had to step very carefully in these tests to avoid flakes, I avoided asserting on the hash actually being different for this reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about writing some test that asserted on the distribution of a lot of rules to somewhat cover this case, but the result was pretty complex and still would flake sometimes since it relies on statistics. It felt like more trouble than it was worth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right! Let's keep it as it is
upperLimit := r.IntervalSeconds / int64(baseInterval.Seconds()) | ||
require.Less(t, offset, upperLimit, "offset cannot be equal to or greater than interval/baseInterval of %d", upperLimit) | ||
} | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, perhaps just add another test that shows the offset is different for different rules?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as the previous comment, I think we can ignore this. I wasn't thinking when I wrote that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - very clean and clear. Great job! My comments are nits and I don't have any expectations of having them addressed here. I don't need to see this again.
offset := jitterOffsetInTicks(rule, baseInterval, JitterByRule) | ||
require.Equal(t, original, offset, "jitterOffsetInTicks should return the same value for the same rule") | ||
} | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
A very helpful test for me would be to get an understanding of the spread of the distribution with the current jittering strategy.
For example, I'd love to see both by rule and by rule group -- what is the spread with 10 groups or 10 rules that have the exact same evaluation internal?
I've just started testing with the feature flag I have 5 alerts spread across 4 evaluations groups, all in the same folder. These are their computed offsets:
If rules Test 1 and Test 2 have the same offset (2 ticks), wouldn't I expect to see those two rules evaluate at the same time? What I'm seeing instead is Test 1 evaluates 5 seconds before Test 2. I would expect that with the
Here are the additional log lines I added: diff --git a/pkg/services/ngalert/schedule/schedule.go b/pkg/services/ngalert/schedule/schedule.go
index a428b6d1bda..5c73694ca6b 100644
--- a/pkg/services/ngalert/schedule/schedule.go
+++ b/pkg/services/ngalert/schedule/schedule.go
@@ -297,6 +297,7 @@ func (sch *schedule) processTick(ctx context.Context, dispatcherGroup *errgroup.
itemFrequency := item.IntervalSeconds / int64(sch.baseInterval.Seconds())
offset := jitterOffsetInTicks(item, sch.baseInterval, sch.jitterEvaluations)
+ sch.log.Info("offset in ticks", "ticks", offset, "rule", item.Title, "group", item.RuleGroup)
isReadyToRun := item.IntervalSeconds != 0 && (tickNum%itemFrequency)-offset == 0
var folderTitle string
@@ -405,7 +406,7 @@ func (sch *schedule) ruleRoutine(grafanaCtx context.Context, key ngmodels.AlertR
evaluate := func(ctx context.Context, f fingerprint, attempt int64, e *evaluation, span trace.Span, retry bool) error {
logger := logger.New("version", e.rule.Version, "fingerprint", f, "attempt", attempt, "now", e.scheduledAt).FromContext(ctx)
start := sch.clock.Now()
-
+ sch.log.Info("evaluating rule", "rule", e.rule.Title)
evalCtx := eval.NewContextWithPreviousResults(ctx, SchedulerUserFor(e.rule.OrgID), sch.newLoadedMetricsReader(e.rule))
if sch.evaluatorFactory == nil {
panic("evalfactory nil") |
@grobinson-grafana This is expected - 5 seconds is less than one baseInterval (10s) so they are still scheduled on the same tick. Check this snippet: grafana/pkg/services/ngalert/schedule/schedule.go Lines 338 to 346 in db83eb3
If N rules are scheduled on a given tick, the scheduler spreads them out evenly over the baseInterval. So, if you get two rules on a tick, they will run 5 seconds apart. This PR:
|
OK cool! I tested the other feature flags and those worked as I expected so LGTM 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just see the comments from Josh and I, but LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Great job!
…80766) * Simple, per-base-interval jitter * Add log just for test purposes * Add strategy approach, allow choosing between group or rule * Add flag to jitter rules * Add second toggle for jittering within a group * Wire up toggles to strategy * Slightly improve comment ordering * Add tests for offset generation * Rename JitterStrategyFrom * Improve debug log message * Use grafana SDK labels rather than prometheus labels
…80766) * Simple, per-base-interval jitter * Add log just for test purposes * Add strategy approach, allow choosing between group or rule * Add flag to jitter rules * Add second toggle for jittering within a group * Wire up toggles to strategy * Slightly improve comment ordering * Add tests for offset generation * Rename JitterStrategyFrom * Improve debug log message * Use grafana SDK labels rather than prometheus labels
…over time (#81404) * Alerting: Add setting to distribute rule group evaluations over time (#80766) * Simple, per-base-interval jitter * Add log just for test purposes * Add strategy approach, allow choosing between group or rule * Add flag to jitter rules * Add second toggle for jittering within a group * Wire up toggles to strategy * Slightly improve comment ordering * Add tests for offset generation * Rename JitterStrategyFrom * Improve debug log message * Use grafana SDK labels rather than prometheus labels * Fix API change in registry.go * empty commit to kick build
What is this feature?
Currently, rule evaluations always run at the beginning of their interval. A rule running every minute, will always execute near the start of the minute, a rule with interval of 5m will always execute near 1:00, 1:05, 1:10, and so on. This is true for all Grafana instances with rules, meaning that queries are aligned, resulting in spiky network traffic and database load.
There exists some minor jitter within each baseInterval, but this is hardcoded to 10 seconds. So, evaluations will all happen within roughly 10 seconds of each other.
This PR jitters rule evaluations over their entire interval. Given a rule scheduled every minute, which would originally run on ticks 0, 6, 12, 18..., the evaluation tick is now modified with an offset based on a hash value. So, with offset of 1, the rule might run on ticks 1, 7, 13, 19..., or with an offset of 5, the rule might run on ticks 5, 11, 17, 23...
1:00:10
as the query time.This approach works together with the minor jitter linked above - the offset will distribute evaluations across each baseInterval bucket, then the existing jitter inside the baseInterval provides additional smoothing within those buckets. The result, is that rule evaluations are completely evenly distributed across the evaluation window, down to the nanosecond.
This PR provides two toggles, allowing choice of distributing by group or by rule. When distributing by group, all rules in the same group will have the same offset - resulting in groups being scheduled together and a not totally smooth distribution. This seems to be the preferred choice as we move toward a group-based concept in Grafana Alerting. For users with excessively large groups, distribution by rule is provided as an escape hatch.
Why do we need this feature?
Reduces database query load, processing spikes, and spiky network traffic.
On average, this seems to reduce resource overhead by around 5%. Rule evaluations, from an aggregated mixture of sources, happen 5-10% faster because of the lack of load spikes.
Who is this feature for?
Operators of systems that Grafana is querying via Alert rules.
Which issue(s) does this PR fix?:
Fixes https://github.com/grafana/alerting-squad/issues/625
Fixes #53744
Special notes for your reviewer:
Please check that: