Alerting: Add setting to distribute rule group evaluations over time #80766

alexweav · 2024-01-17T22:03:39Z

What is this feature?

Currently, rule evaluations always run at the beginning of their interval. A rule running every minute, will always execute near the start of the minute, a rule with interval of 5m will always execute near 1:00, 1:05, 1:10, and so on. This is true for all Grafana instances with rules, meaning that queries are aligned, resulting in spiky network traffic and database load.

There exists some minor jitter within each baseInterval, but this is hardcoded to 10 seconds. So, evaluations will all happen within roughly 10 seconds of each other.

This PR jitters rule evaluations over their entire interval. Given a rule scheduled every minute, which would originally run on ticks 0, 6, 12, 18..., the evaluation tick is now modified with an offset based on a hash value. So, with offset of 1, the rule might run on ticks 1, 7, 13, 19..., or with an offset of 5, the rule might run on ticks 5, 11, 17, 23...

The distribution considers the interval of the rule in question, so rules with a long interval have more buckets.
We use a stable hash to calculate the offset, so rules loaded onto multiple HA-replicas still evaluate together.
All queries use a shifted timestamp that respects their calculated offset. For example, a query running each minute with an offset of 1 tick, will contain 1:00:10 as the query time.

This approach works together with the minor jitter linked above - the offset will distribute evaluations across each baseInterval bucket, then the existing jitter inside the baseInterval provides additional smoothing within those buckets. The result, is that rule evaluations are completely evenly distributed across the evaluation window, down to the nanosecond.

This PR provides two toggles, allowing choice of distributing by group or by rule. When distributing by group, all rules in the same group will have the same offset - resulting in groups being scheduled together and a not totally smooth distribution. This seems to be the preferred choice as we move toward a group-based concept in Grafana Alerting. For users with excessively large groups, distribution by rule is provided as an escape hatch.

Why do we need this feature?

Reduces database query load, processing spikes, and spiky network traffic.
On average, this seems to reduce resource overhead by around 5%. Rule evaluations, from an aggregated mixture of sources, happen 5-10% faster because of the lack of load spikes.

Who is this feature for?

Operators of systems that Grafana is querying via Alert rules.

Which issue(s) does this PR fix?:

Fixes https://github.com/grafana/alerting-squad/issues/625
Fixes #53744

Special notes for your reviewer:

Please check that:

It works as expected from a user's perspective.
If this is a pre-GA feature, it is behind a feature toggle.
The docs are updated, and if this is a notable improvement, it's added to our What's New doc.

grobinson-grafana

Didn't have time to test it, but otherwise looks great! I will test later this afternoon, and until then just a couple of comments! Nice work 👍

grobinson-grafana · 2024-01-18T11:21:39Z

pkg/services/ngalert/schedule/jitter.go

+// JitterStrategyFromToggles returns the JitterStrategy indicated by the current Grafana feature toggles.
+func JitterStrategyFromToggles(toggles featuremgmt.FeatureToggles) JitterStrategy {
+	strategy := JitterNever
+	if toggles.IsEnabledGlobally(featuremgmt.FlagJitterAlertRules) {


Is it possible to set both toggles at once? Should we return an error if this happens?

It is, jitter by rule will win, and I think this is desirable. Especially with how we are positioning this with jitter by rule being an additional escape hatch.

Given cloud's feature toggle admin page (which we might make this togglable in), I don't want to give a path where someone can create invalid config.

grobinson-grafana · 2024-01-18T11:26:02Z

pkg/services/ngalert/schedule/jitter.go

+}
+
+func jitterHash(r *ngmodels.AlertRule, strategy JitterStrategy) uint64 {
+	l := labels.New(


Do we need to create a slice of labels for this? Could we create a fnv64 and just write to it instead?

Yeah, we could do that, I simply ripped the basic hashing logic off of Prometheus 😆

grobinson-grafana · 2024-01-18T11:28:37Z

pkg/services/ngalert/schedule/jitter_test.go

+			for _, r := range rules2 {
+				offset := jitterOffsetInTicks(r, baseInterval, JitterByGroup)
+				require.Equal(t, group2Offset, offset)
+			}


Perhaps just add a third assertion here where you take a rule from rules1 and another rule from rules2 and assert that their offsets are different?

This assertion isn't always true though - there are a relatively small number of buckets. We are not guaranteed each group will hash into a different bucket, there is a small chance they roll the same one. Writing a test expecting hashes to be different seems misleading, they aren't always different by design.

I had to step very carefully in these tests to avoid flakes, I avoided asserting on the hash actually being different for this reason.

I was thinking about writing some test that asserted on the distribution of a lot of rules to somewhat cover this case, but the result was pretty complex and still would flake sometimes since it relies on statistics. It felt like more trouble than it was worth.

You're right! Let's keep it as it is

grobinson-grafana · 2024-01-18T11:29:15Z

pkg/services/ngalert/schedule/jitter_test.go

+				upperLimit := r.IntervalSeconds / int64(baseInterval.Seconds())
+				require.Less(t, offset, upperLimit, "offset cannot be equal to or greater than interval/baseInterval of %d", upperLimit)
+			}
+		})


Same here, perhaps just add another test that shows the offset is different for different rules?

Same as the previous comment, I think we can ignore this. I wasn't thinking when I wrote that.

pkg/services/ngalert/schedule/jitter_test.go

pkg/services/ngalert/schedule/schedule.go

gotjosh

LGTM - very clean and clear. Great job! My comments are nits and I don't have any expectations of having them addressed here. I don't need to see this again.

pkg/services/ngalert/schedule/jitter.go

gotjosh · 2024-01-18T13:24:52Z

pkg/services/ngalert/schedule/jitter_test.go

+				offset := jitterOffsetInTicks(rule, baseInterval, JitterByRule)
+				require.Equal(t, original, offset, "jitterOffsetInTicks should return the same value for the same rule")
+			}
+		})


nit:

A very helpful test for me would be to get an understanding of the spread of the distribution with the current jittering strategy.

For example, I'd love to see both by rule and by rule group -- what is the spread with 10 groups or 10 rules that have the exact same evaluation internal?

grobinson-grafana · 2024-01-18T14:51:25Z

I've just started testing with the feature flag jitterAlertRules and there is something that doesn't quite make sense to me.

I have 5 alerts spread across 4 evaluations groups, all in the same folder. These are their computed offsets:

INFO [01-18|14:43:30] offset in ticks                          logger=ngalert.scheduler ticks=2 rule="Test 1" group="Group 1"
INFO [01-18|14:43:30] offset in ticks                          logger=ngalert.scheduler ticks=2 rule="Test 2" group="Group 1"
INFO [01-18|14:43:30] offset in ticks                          logger=ngalert.scheduler ticks=1 rule="Test 3" group="Group 2"
INFO [01-18|14:43:30] offset in ticks                          logger=ngalert.scheduler ticks=3 rule="Test 4" group="Group 3"

If rules Test 1 and Test 2 have the same offset (2 ticks), wouldn't I expect to see those two rules evaluate at the same time? What I'm seeing instead is Test 1 evaluates 5 seconds before Test 2. I would expect that with the jitterAlertRulesWithGroups feature flag but not jitterAlertRules?

INFO [01-18|14:45:20] evaluating rule                          logger=ngalert.scheduler rule="Test 1"
INFO [01-18|14:45:20] Sending alerts to local notifier         logger=ngalert.sender.router rule_uid=f065f544-dd24-48a1-af8e-f1b139f76fa8 org_id=1 count=1
INFO [01-18|14:45:25] evaluating rule                          logger=ngalert.scheduler rule="Test 2"
INFO [01-18|14:45:25] Sending alerts to local notifier         logger=ngalert.sender.router rule_uid=d8d68bc8-50da-4acf-9fde-66b41008ce53 org_id=1 count=1

Here are the additional log lines I added:

diff --git a/pkg/services/ngalert/schedule/schedule.go b/pkg/services/ngalert/schedule/schedule.go
index a428b6d1bda..5c73694ca6b 100644
--- a/pkg/services/ngalert/schedule/schedule.go
+++ b/pkg/services/ngalert/schedule/schedule.go
@@ -297,6 +297,7 @@ func (sch *schedule) processTick(ctx context.Context, dispatcherGroup *errgroup.

                itemFrequency := item.IntervalSeconds / int64(sch.baseInterval.Seconds())
                offset := jitterOffsetInTicks(item, sch.baseInterval, sch.jitterEvaluations)
+               sch.log.Info("offset in ticks", "ticks", offset, "rule", item.Title, "group", item.RuleGroup)
                isReadyToRun := item.IntervalSeconds != 0 && (tickNum%itemFrequency)-offset == 0

                var folderTitle string
@@ -405,7 +406,7 @@ func (sch *schedule) ruleRoutine(grafanaCtx context.Context, key ngmodels.AlertR
        evaluate := func(ctx context.Context, f fingerprint, attempt int64, e *evaluation, span trace.Span, retry bool) error {
                logger := logger.New("version", e.rule.Version, "fingerprint", f, "attempt", attempt, "now", e.scheduledAt).FromContext(ctx)
                start := sch.clock.Now()
-
+               sch.log.Info("evaluating rule", "rule", e.rule.Title)
                evalCtx := eval.NewContextWithPreviousResults(ctx, SchedulerUserFor(e.rule.OrgID), sch.newLoadedMetricsReader(e.rule))
                if sch.evaluatorFactory == nil {
                        panic("evalfactory nil")

alexweav · 2024-01-18T15:09:37Z

@grobinson-grafana This is expected - 5 seconds is less than one baseInterval (10s) so they are still scheduled on the same tick.

Check this snippet:

grafana/pkg/services/ngalert/schedule/schedule.go

Lines 338 to 346 in db83eb3

    
           var step int64 = 0 
        
           if len(readyToRun) > 0 { 
        
           	step = sch.baseInterval.Nanoseconds() / int64(len(readyToRun)) 
        
           } 
        
           for i := range readyToRun { 
        
           	item := readyToRun[i] 
        
           	time.AfterFunc(time.Duration(int64(i)*step), func() {

If N rules are scheduled on a given tick, the scheduler spreads them out evenly over the baseInterval. So, if you get two rules on a tick, they will run 5 seconds apart.

This PR:

Only affects the per-baseInterval bucketing of rules, and does not touch this finer spreading logic.
Works in conjunction with that finer spreading logic to get a smooth distribution over time.

grobinson-grafana · 2024-01-18T15:12:20Z

OK cool! I tested the other feature flags and those worked as I expected so LGTM 👍

grobinson-grafana

Just see the comments from Josh and I, but LGTM!

pkg/services/ngalert/schedule/jitter.go

yuri-tceretian

LGTM. Great job!

…80766) * Simple, per-base-interval jitter * Add log just for test purposes * Add strategy approach, allow choosing between group or rule * Add flag to jitter rules * Add second toggle for jittering within a group * Wire up toggles to strategy * Slightly improve comment ordering * Add tests for offset generation * Rename JitterStrategyFrom * Improve debug log message * Use grafana SDK labels rather than prometheus labels

…over time (#81404) * Alerting: Add setting to distribute rule group evaluations over time (#80766) * Simple, per-base-interval jitter * Add log just for test purposes * Add strategy approach, allow choosing between group or rule * Add flag to jitter rules * Add second toggle for jittering within a group * Wire up toggles to strategy * Slightly improve comment ordering * Add tests for offset generation * Rename JitterStrategyFrom * Improve debug log message * Use grafana SDK labels rather than prometheus labels * Fix API change in registry.go * empty commit to kick build

alexweav added area/alerting Grafana Alerting area/backend add to changelog no-backport Skip backport of PR add to what's new labels Jan 17, 2024

alexweav added this to the 10.4.x milestone Jan 17, 2024

alexweav requested review from grafanabot and a team as code owners January 17, 2024 22:03

alexweav requested review from a team, rwwiv, JacobsonMT, yuri-tceretian and grobinson-grafana and removed request for a team January 17, 2024 22:03

grafana-pr-automation bot added type/docs Flags the technical writing team for documentation support; auto adds to org-wide docs project area/frontend labels Jan 17, 2024

alexweav added 8 commits January 17, 2024 16:05

Simple, per-base-interval jitter

82111fa

Add log just for test purposes

89e51a4

Add strategy approach, allow choosing between group or rule

1689f8a

Add flag to jitter rules

f67b87c

Add second toggle for jittering within a group

7d4f13b

Wire up toggles to strategy

325bac8

Slightly improve comment ordering

7d95901

Add tests for offset generation

d3ed74f

alexweav force-pushed the alexweav/simple-jitter branch from c861e42 to d3ed74f Compare January 17, 2024 22:07

grafana-pr-automation bot requested review from a team and oshirohugo and removed request for a team January 17, 2024 22:11

grobinson-grafana reviewed Jan 18, 2024

View reviewed changes

gotjosh reviewed Jan 18, 2024

View reviewed changes

gotjosh approved these changes Jan 18, 2024

View reviewed changes

grobinson-grafana approved these changes Jan 18, 2024

View reviewed changes

alexweav added 2 commits January 18, 2024 09:47

Rename JitterStrategyFrom

88b9b1b

Improve debug log message

e78bca3

grafana-pr-automation bot requested review from a team and removed request for a team January 18, 2024 15:55

yuri-tceretian reviewed Jan 18, 2024

View reviewed changes

pkg/services/ngalert/schedule/jitter.go Outdated Show resolved Hide resolved

yuri-tceretian approved these changes Jan 18, 2024

View reviewed changes

Use grafana SDK labels rather than prometheus labels

fc8598a

grafana-pr-automation bot requested review from a team and removed request for a team January 18, 2024 17:42

alexweav merged commit 00a260e into main Jan 18, 2024
19 checks passed

alexweav deleted the alexweav/simple-jitter branch January 18, 2024 18:48

alexweav mentioned this pull request Jan 26, 2024

[v10.3.x] Alerting: Add setting to distribute rule group evaluations over time #81403

Closed

alexweav mentioned this pull request Jan 26, 2024

[v10.2.x] Alerting: Add setting to distribute rule group evaluations over time #81404

Merged

alexweav mentioned this pull request Feb 8, 2024

Alerting: Enable group-level rule evaluation jittering by default, remove feature toggle #82212

Merged

3 tasks

aangelisc modified the milestones: 10.4.x, 10.4.0 Mar 6, 2024

BrewTestBot mentioned this pull request Mar 6, 2024

grafana 10.4.0 Homebrew/homebrew-core#165251

Closed

BrewTestBot mentioned this pull request Mar 14, 2024

grafana 10.4.0 Homebrew/homebrew-core#166070

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alerting: Add setting to distribute rule group evaluations over time #80766

Alerting: Add setting to distribute rule group evaluations over time #80766

alexweav commented Jan 17, 2024

grobinson-grafana left a comment

grobinson-grafana Jan 18, 2024

alexweav Jan 18, 2024

grobinson-grafana Jan 18, 2024

alexweav Jan 18, 2024

grobinson-grafana Jan 18, 2024

alexweav Jan 18, 2024

alexweav Jan 18, 2024

grobinson-grafana Jan 18, 2024

grobinson-grafana Jan 18, 2024

grobinson-grafana Jan 18, 2024

gotjosh left a comment

gotjosh Jan 18, 2024

grobinson-grafana commented Jan 18, 2024

alexweav commented Jan 18, 2024

grobinson-grafana commented Jan 18, 2024

grobinson-grafana left a comment

yuri-tceretian left a comment

Alerting: Add setting to distribute rule group evaluations over time #80766

Alerting: Add setting to distribute rule group evaluations over time #80766

Conversation

alexweav commented Jan 17, 2024

grobinson-grafana left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gotjosh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

grobinson-grafana commented Jan 18, 2024

alexweav commented Jan 18, 2024

grobinson-grafana commented Jan 18, 2024

grobinson-grafana left a comment

Choose a reason for hiding this comment

yuri-tceretian left a comment

Choose a reason for hiding this comment