-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stateless ruler restores alert state #5230
Conversation
3e2375c
to
cdc28e3
Compare
2481793
to
0a7e3a2
Compare
It would be great to fix this issue. Is this ready for review? |
e9c86ae
to
b1bce0f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yeya24 is anything still missing from this PR? It's looking good to me overall. Would be also great to run it against the alert compliance test to make sure alerting will now also properly work with stateless ruler.
test/e2e/rule_test.go
Outdated
testutil.Ok(t, e2e.StartAndWaitReady(rulers[1])) | ||
|
||
// Wait for 4 * rule evaluation interval to make sure the alert state is restored. | ||
time.Sleep(time.Second * 8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way we could use metrics for synchronization instead of hard-coding time duration? My concern is that this will be flaky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah... I am still looking for which metrics to use to indicate that the alerts states are restored but failed to really find one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to wait for 4 times of rule evaluation intervals. Let's see if it is better this time...
test/e2e/rule_test.go
Outdated
// Wait for 4 * rule evaluation interval to make sure the alert state is restored. | ||
time.Sleep(time.Second * 8) | ||
// Wait until the alert is firing on the second ruler. | ||
for { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest using runutil.Repeat
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
logger log.Logger | ||
promClients []*promclient.Client | ||
queryClients []*httpconfig.Client | ||
restoreIgnoreLabels []string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could make this a bit more generic and rename it to ignoredLabelNames
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
@@ -189,6 +190,8 @@ func (qc *queryConfig) registerFlag(cmd extkingpin.FlagClause) *queryConfig { | |||
Default("POST").EnumVar(&qc.httpMethod, "GET", "POST") | |||
cmd.Flag("query.sd-dns-resolver", "Resolver to use. Possible options: [golang, miekgdns]"). | |||
Default("golang").Hidden().StringVar(&qc.dnsSDResolver) | |||
cmd.Flag("query.default-step", "Default range query step to use. This is only used in stateless Ruler and alert state restoration."). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this so small by default? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure whether we want to expose this to users or not. The step can be calculated simply by max((maxt - mint) / 250, 1s)
. But in E2E test we might still want to modify it.
For the default value 1s
, I just use the same default query step set in query.go
. Do you have any suggested value?
Thanks for the review Matej. I am still struggling fixing the E2E test and the results are flaky. Besides, unit tests are still missing for the new queryable. |
FWIW, I deployed this branch and verified that |
Is there anything left to do here? |
24859f5
to
a41ea98
Compare
Sorry for the delay. I will focus on this pr next week to fix the broken E2E test. |
@yeya24 It'll be great if we can release v0.28.2 to include this important fix! |
1f86600
to
d865211
Compare
@GiedriusS @matej-g This is ready for another review now. E2E test passed finally. |
@yeya24 have you also tried to run it through the compliance test? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, thanks @yeya24! It would be great to also see results of the compliance test, but this has been opened already long enough and if we still find any discrepancy with compliance, we can iterate on this.
Yeah probably let's iterate on this. I tried the compliance test on both stateful ruler and stateless ruler but all got errors like below:
Need more investigation on this. |
Can we merge this PR now? @yeya24 @GiedriusS @matej-g |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just fix the changelog conflict and get this in, cc @yeya24
Signed-off-by: Ben Ye <benye@amazon.com>
Signed-off-by: Ben Ye <benye@amazon.com>
Signed-off-by: Ben Ye <benye@amazon.com>
3ee2810
to
6d3d6fb
Compare
Signed-off-by: Ben Ye <benye@amazon.com>
6d3d6fb
to
b3b97d7
Compare
I will merge it then and follow up later. |
* stateless ruler restores alert state Signed-off-by: Ben Ye <benye@amazon.com> * update e2e Signed-off-by: Ben Ye <benye@amazon.com> * update compatibility test Signed-off-by: Ben Ye <benye@amazon.com> * update changelog Signed-off-by: Ben Ye <benye@amazon.com> Signed-off-by: Ben Ye <benye@amazon.com> Co-authored-by: Ben Ye <ben.ye@bytedance.com>
Hello, I am not able to restore alert state from my setup. After looking at e2e tests from the PR I suspect that the test Does anyone succeeded to restore their alert state? logs from e2e:
|
Edit, it works! |
* stateless ruler restores alert state Signed-off-by: Ben Ye <benye@amazon.com> * update e2e Signed-off-by: Ben Ye <benye@amazon.com> * update compatibility test Signed-off-by: Ben Ye <benye@amazon.com> * update changelog Signed-off-by: Ben Ye <benye@amazon.com> Signed-off-by: Ben Ye <benye@amazon.com> Co-authored-by: Ben Ye <ben.ye@bytedance.com>
Signed-off-by: Ben Ye ben.ye@bytedance.com
Changes
The idea here is still reusing the upstream code and implement a queryable for thanos queriers.
That way the prometheus rule manager can get the
ALERTS_FOR_STATE
series from thanos queriers and restore the state.Verification
E2E test added.