Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alerting: Add setting to distribute rule group evaluations over time #80766

Merged
merged 11 commits into from
Jan 18, 2024

Conversation

alexweav
Copy link
Contributor

What is this feature?

Currently, rule evaluations always run at the beginning of their interval. A rule running every minute, will always execute near the start of the minute, a rule with interval of 5m will always execute near 1:00, 1:05, 1:10, and so on. This is true for all Grafana instances with rules, meaning that queries are aligned, resulting in spiky network traffic and database load.

There exists some minor jitter within each baseInterval, but this is hardcoded to 10 seconds. So, evaluations will all happen within roughly 10 seconds of each other.

This PR jitters rule evaluations over their entire interval. Given a rule scheduled every minute, which would originally run on ticks 0, 6, 12, 18..., the evaluation tick is now modified with an offset based on a hash value. So, with offset of 1, the rule might run on ticks 1, 7, 13, 19..., or with an offset of 5, the rule might run on ticks 5, 11, 17, 23...

  • The distribution considers the interval of the rule in question, so rules with a long interval have more buckets.
  • We use a stable hash to calculate the offset, so rules loaded onto multiple HA-replicas still evaluate together.
  • All queries use a shifted timestamp that respects their calculated offset. For example, a query running each minute with an offset of 1 tick, will contain 1:00:10 as the query time.

This approach works together with the minor jitter linked above - the offset will distribute evaluations across each baseInterval bucket, then the existing jitter inside the baseInterval provides additional smoothing within those buckets. The result, is that rule evaluations are completely evenly distributed across the evaluation window, down to the nanosecond.

This PR provides two toggles, allowing choice of distributing by group or by rule. When distributing by group, all rules in the same group will have the same offset - resulting in groups being scheduled together and a not totally smooth distribution. This seems to be the preferred choice as we move toward a group-based concept in Grafana Alerting. For users with excessively large groups, distribution by rule is provided as an escape hatch.

Why do we need this feature?

Reduces database query load, processing spikes, and spiky network traffic.
On average, this seems to reduce resource overhead by around 5%. Rule evaluations, from an aggregated mixture of sources, happen 5-10% faster because of the lack of load spikes.

Who is this feature for?

Operators of systems that Grafana is querying via Alert rules.

Which issue(s) does this PR fix?:

Fixes https://github.com/grafana/alerting-squad/issues/625
Fixes #53744

Special notes for your reviewer:

Please check that:

  • It works as expected from a user's perspective.
  • If this is a pre-GA feature, it is behind a feature toggle.
  • The docs are updated, and if this is a notable improvement, it's added to our What's New doc.

@alexweav alexweav added this to the 10.4.x milestone Jan 17, 2024
@alexweav alexweav requested review from grafanabot and a team as code owners January 17, 2024 22:03
@alexweav alexweav requested review from a team, rwwiv, JacobsonMT, yuri-tceretian and grobinson-grafana and removed request for a team January 17, 2024 22:03
@grafana-pr-automation grafana-pr-automation bot added type/docs Flags the technical writing team for documentation support; auto adds to org-wide docs project area/frontend labels Jan 17, 2024
@alexweav alexweav force-pushed the alexweav/simple-jitter branch from c861e42 to d3ed74f Compare January 17, 2024 22:07
@grafana-pr-automation grafana-pr-automation bot requested review from a team and oshirohugo and removed request for a team January 17, 2024 22:11
Copy link
Contributor

@grobinson-grafana grobinson-grafana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't have time to test it, but otherwise looks great! I will test later this afternoon, and until then just a couple of comments! Nice work 👍

// JitterStrategyFromToggles returns the JitterStrategy indicated by the current Grafana feature toggles.
func JitterStrategyFromToggles(toggles featuremgmt.FeatureToggles) JitterStrategy {
strategy := JitterNever
if toggles.IsEnabledGlobally(featuremgmt.FlagJitterAlertRules) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to set both toggles at once? Should we return an error if this happens?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, jitter by rule will win, and I think this is desirable. Especially with how we are positioning this with jitter by rule being an additional escape hatch.

Given cloud's feature toggle admin page (which we might make this togglable in), I don't want to give a path where someone can create invalid config.

}

func jitterHash(r *ngmodels.AlertRule, strategy JitterStrategy) uint64 {
l := labels.New(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to create a slice of labels for this? Could we create a fnv64 and just write to it instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we could do that, I simply ripped the basic hashing logic off of Prometheus 😆

for _, r := range rules2 {
offset := jitterOffsetInTicks(r, baseInterval, JitterByGroup)
require.Equal(t, group2Offset, offset)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps just add a third assertion here where you take a rule from rules1 and another rule from rules2 and assert that their offsets are different?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion isn't always true though - there are a relatively small number of buckets. We are not guaranteed each group will hash into a different bucket, there is a small chance they roll the same one. Writing a test expecting hashes to be different seems misleading, they aren't always different by design.

I had to step very carefully in these tests to avoid flakes, I avoided asserting on the hash actually being different for this reason.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about writing some test that asserted on the distribution of a lot of rules to somewhat cover this case, but the result was pretty complex and still would flake sometimes since it relies on statistics. It felt like more trouble than it was worth.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right! Let's keep it as it is

upperLimit := r.IntervalSeconds / int64(baseInterval.Seconds())
require.Less(t, offset, upperLimit, "offset cannot be equal to or greater than interval/baseInterval of %d", upperLimit)
}
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, perhaps just add another test that shows the offset is different for different rules?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the previous comment, I think we can ignore this. I wasn't thinking when I wrote that.

pkg/services/ngalert/schedule/jitter_test.go Show resolved Hide resolved
pkg/services/ngalert/schedule/schedule.go Outdated Show resolved Hide resolved
Copy link
Contributor

@gotjosh gotjosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - very clean and clear. Great job! My comments are nits and I don't have any expectations of having them addressed here. I don't need to see this again.

pkg/services/ngalert/schedule/jitter.go Outdated Show resolved Hide resolved
pkg/services/ngalert/schedule/jitter.go Show resolved Hide resolved
offset := jitterOffsetInTicks(rule, baseInterval, JitterByRule)
require.Equal(t, original, offset, "jitterOffsetInTicks should return the same value for the same rule")
}
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

A very helpful test for me would be to get an understanding of the spread of the distribution with the current jittering strategy.

For example, I'd love to see both by rule and by rule group -- what is the spread with 10 groups or 10 rules that have the exact same evaluation internal?

@grobinson-grafana
Copy link
Contributor

I've just started testing with the feature flag jitterAlertRules and there is something that doesn't quite make sense to me.

I have 5 alerts spread across 4 evaluations groups, all in the same folder. These are their computed offsets:

INFO [01-18|14:43:30] offset in ticks                          logger=ngalert.scheduler ticks=2 rule="Test 1" group="Group 1"
INFO [01-18|14:43:30] offset in ticks                          logger=ngalert.scheduler ticks=2 rule="Test 2" group="Group 1"
INFO [01-18|14:43:30] offset in ticks                          logger=ngalert.scheduler ticks=1 rule="Test 3" group="Group 2"
INFO [01-18|14:43:30] offset in ticks                          logger=ngalert.scheduler ticks=3 rule="Test 4" group="Group 3"

If rules Test 1 and Test 2 have the same offset (2 ticks), wouldn't I expect to see those two rules evaluate at the same time? What I'm seeing instead is Test 1 evaluates 5 seconds before Test 2. I would expect that with the jitterAlertRulesWithGroups feature flag but not jitterAlertRules?

INFO [01-18|14:45:20] evaluating rule                          logger=ngalert.scheduler rule="Test 1"
INFO [01-18|14:45:20] Sending alerts to local notifier         logger=ngalert.sender.router rule_uid=f065f544-dd24-48a1-af8e-f1b139f76fa8 org_id=1 count=1
INFO [01-18|14:45:25] evaluating rule                          logger=ngalert.scheduler rule="Test 2"
INFO [01-18|14:45:25] Sending alerts to local notifier         logger=ngalert.sender.router rule_uid=d8d68bc8-50da-4acf-9fde-66b41008ce53 org_id=1 count=1

Here are the additional log lines I added:

diff --git a/pkg/services/ngalert/schedule/schedule.go b/pkg/services/ngalert/schedule/schedule.go
index a428b6d1bda..5c73694ca6b 100644
--- a/pkg/services/ngalert/schedule/schedule.go
+++ b/pkg/services/ngalert/schedule/schedule.go
@@ -297,6 +297,7 @@ func (sch *schedule) processTick(ctx context.Context, dispatcherGroup *errgroup.

                itemFrequency := item.IntervalSeconds / int64(sch.baseInterval.Seconds())
                offset := jitterOffsetInTicks(item, sch.baseInterval, sch.jitterEvaluations)
+               sch.log.Info("offset in ticks", "ticks", offset, "rule", item.Title, "group", item.RuleGroup)
                isReadyToRun := item.IntervalSeconds != 0 && (tickNum%itemFrequency)-offset == 0

                var folderTitle string
@@ -405,7 +406,7 @@ func (sch *schedule) ruleRoutine(grafanaCtx context.Context, key ngmodels.AlertR
        evaluate := func(ctx context.Context, f fingerprint, attempt int64, e *evaluation, span trace.Span, retry bool) error {
                logger := logger.New("version", e.rule.Version, "fingerprint", f, "attempt", attempt, "now", e.scheduledAt).FromContext(ctx)
                start := sch.clock.Now()
-
+               sch.log.Info("evaluating rule", "rule", e.rule.Title)
                evalCtx := eval.NewContextWithPreviousResults(ctx, SchedulerUserFor(e.rule.OrgID), sch.newLoadedMetricsReader(e.rule))
                if sch.evaluatorFactory == nil {
                        panic("evalfactory nil")

@alexweav
Copy link
Contributor Author

@grobinson-grafana This is expected - 5 seconds is less than one baseInterval (10s) so they are still scheduled on the same tick.

Check this snippet:

var step int64 = 0
if len(readyToRun) > 0 {
step = sch.baseInterval.Nanoseconds() / int64(len(readyToRun))
}
for i := range readyToRun {
item := readyToRun[i]
time.AfterFunc(time.Duration(int64(i)*step), func() {

If N rules are scheduled on a given tick, the scheduler spreads them out evenly over the baseInterval. So, if you get two rules on a tick, they will run 5 seconds apart.

This PR:

  • Only affects the per-baseInterval bucketing of rules, and does not touch this finer spreading logic.
  • Works in conjunction with that finer spreading logic to get a smooth distribution over time.

@grobinson-grafana
Copy link
Contributor

OK cool! I tested the other feature flags and those worked as I expected so LGTM 👍

Copy link
Contributor

@grobinson-grafana grobinson-grafana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just see the comments from Josh and I, but LGTM!

@grafana-pr-automation grafana-pr-automation bot requested review from a team and removed request for a team January 18, 2024 15:55
Copy link
Contributor

@yuri-tceretian yuri-tceretian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Great job!

@grafana-pr-automation grafana-pr-automation bot requested review from a team and removed request for a team January 18, 2024 17:42
@alexweav alexweav merged commit 00a260e into main Jan 18, 2024
19 checks passed
@alexweav alexweav deleted the alexweav/simple-jitter branch January 18, 2024 18:48
alexweav added a commit that referenced this pull request Jan 26, 2024
…80766)

* Simple, per-base-interval jitter

* Add log just for test purposes

* Add strategy approach, allow choosing between group or rule

* Add flag to jitter rules

* Add second toggle for jittering within a group

* Wire up toggles to strategy

* Slightly improve comment ordering

* Add tests for offset generation

* Rename JitterStrategyFrom

* Improve debug log message

* Use grafana SDK labels rather than prometheus labels
alexweav added a commit that referenced this pull request Jan 26, 2024
…80766)

* Simple, per-base-interval jitter

* Add log just for test purposes

* Add strategy approach, allow choosing between group or rule

* Add flag to jitter rules

* Add second toggle for jittering within a group

* Wire up toggles to strategy

* Slightly improve comment ordering

* Add tests for offset generation

* Rename JitterStrategyFrom

* Improve debug log message

* Use grafana SDK labels rather than prometheus labels
tolzhabayev pushed a commit that referenced this pull request Feb 28, 2024
…over time (#81404)

* Alerting: Add setting to distribute rule group evaluations over time (#80766)

* Simple, per-base-interval jitter

* Add log just for test purposes

* Add strategy approach, allow choosing between group or rule

* Add flag to jitter rules

* Add second toggle for jittering within a group

* Wire up toggles to strategy

* Slightly improve comment ordering

* Add tests for offset generation

* Rename JitterStrategyFrom

* Improve debug log message

* Use grafana SDK labels rather than prometheus labels

* Fix API change in registry.go

* empty commit to kick build
@aangelisc aangelisc modified the milestones: 10.4.x, 10.4.0 Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
add to changelog add to what's new area/alerting Grafana Alerting area/backend area/frontend no-backport Skip backport of PR type/docs Flags the technical writing team for documentation support; auto adds to org-wide docs project
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Improve alert scheduling in Grafana Alerting
5 participants