-
Notifications
You must be signed in to change notification settings - Fork 527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update mixin for TempoIngesterFlushes thresholds #1354
Conversation
668a33d
to
cc59827
Compare
severity: 'warning', | ||
}, | ||
annotations: { | ||
message: 'Greater than %s flushes have failed in the past hour.' % $._config.alerts.flushes_per_hour_failed, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to tweak the message so the warning and critical are different. Don't have a recommendation though... any ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I'll wordsmith around a little.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated a little. How's that read to you now?
cc59827
to
2df00cf
Compare
expr: ||| | ||
sum by (%s) (increase(tempo_ingester_failed_flushes_total{}[1h])) > %s and | ||
sum by (%s) (increase(tempo_ingester_failed_flushes_total{}[5m])) > 0 | ||
sum by (%s) (increase(tempo_ingester_flush_failed_retries_total{}[1h])) > %s and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do we think about keying the "unhealthy" warning alert on the old metric and the "failing" critical alert on the retries metric?
then for consistency keep the for: 5m
on both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That could be reasonable. The goal being to always know when a flush failed at a warning level, but only get paged when a retry fails for the 5m
duration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's what i was thinking. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I've made the update.
70b2613
to
0ca1ccd
Compare
What this PR does:
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]