-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#3513: Add GroupMarker interface #3792
#3513: Add GroupMarker interface #3792
Conversation
types/types.go
Outdated
// Muted returns true if the group is muted, otherwise false. If the group | ||
// is muted then it also returns the names of the time intervals that muted | ||
// it. | ||
Muted(groupKey string) ([]string, bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the original version of GroupMarker
I had:
type GroupMarker interface {
// Muted returns true if the group is muted, otherwise false. If the group
// is muted then it also returns the names of the time intervals that muted
// it.
Muted(groupKey string, fingerprint model.Fingerprint) ([]string, bool)
...
}
but then I realized we didn't need to store the fingerprint at all.
The reason here is that active and mute timings work against routes. That means either all alerts in an aggregation group are suppressed because of time intervals, or none of them are. It is not possible to have two alerts A and B in the same aggregation group, where A is muted from a mute time interval and B is not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have found an issue where because groupKey
is not guaranteed to be unique, it's possible for two (or more) different groups to have the same groupKey
. In such cases, where one group is muted and another group is not, both will be marked as muted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following YAML shows a configuration containing two routes with the same groupKey
. This happens because the groupKey
is calculated using:
- The matchers in the route
- The
group_by
labels (default is empty)
And both routes have the same matchers and group_by
labels.
receivers:
- name: test1
- name: test2
route:
receiver: test1
routes:
- receiver: test1
matchers:
- foo=bar
- receiver: test2
matchers:
- foo=bar
mute_time_intervals:
- name: weekends
The reason a user might have such a configuration is to mute notifications on the weekends, but still send webhooks to an issue tracker.
7fb35af
to
f3b9659
Compare
e5b9050
to
4b094e9
Compare
// GroupMarker helps to mark groups as active or muted. | ||
// All methods are goroutine-safe. | ||
// | ||
// TODO(grobinson): routeID is used in Muted and SetMuted because groupKey |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a TODO in the source code to make it clear that this is not how I want the interface to look, but fixing this is blocked due to #3817 that we will fix in another PR.
4b094e9
to
0d47e10
Compare
// All methods are goroutine-safe. | ||
type Marker interface { | ||
type AlertMarker interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed this interface to AlertMarker
so it could be better differentiated from GroupMarker
.
func (m *MemMarker) Muted(routeID, groupKey string) ([]string, bool) { | ||
m.mtx.Lock() | ||
defer m.mtx.Unlock() | ||
status, ok := m.groups[routeID+groupKey] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since routeID
will be removed in the future I have chosen to just concatenate the two strings together.
This commit adds a new GroupMarker interface that marks the status of groups. For example, whether an alert is muted because or one or more active or mute time intervals. It renames the existing Marker interface to AlertMarker to avoid confusion. Signed-off-by: George Robinson <george.robinson@grafana.com>
Signed-off-by: George Robinson <george.robinson@grafana.com>
This commit changes memMarker to MemMarker as it now implements both the AlertMarker and GroupMarker interfaces. We can return *memMarker, but it causes lint to fail. Signed-off-by: George Robinson <george.robinson@grafana.com>
This commit fixes a bug in SetMuted where a marker could not be removed. The method now works as documented in the interface. Signed-off-by: George Robinson <george.robinson@grafana.com>
I realized that since active and mute timings are applied to whole group rather than individual alerts within a group, we can remove fingerprints from the GroupMarker interface. This will make the code much simpler and also reduce the amount of data that needs to be tracked. Signed-off-by: George Robinson <george.robinson@grafana.com>
Signed-off-by: George Robinson <george.robinson@grafana.com>
Signed-off-by: George Robinson <george.robinson@grafana.com>
Signed-off-by: George Robinson <george.robinson@grafana.com>
Signed-off-by: George Robinson <george.robinson@grafana.com>
0d47e10
to
4a2781c
Compare
// Muted returns true if the group is muted, otherwise false. If the group | ||
// is muted then it also returns the names of the time intervals that muted | ||
// it. | ||
Muted(routeID, groupKey string) ([]string, bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added routeID
to avoid the issue of non unique group keys from causing groups to be incorrectly marked as muted. See #3817 for more information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, but please see my comment.
// groupStatus stores the state of the group, and, as applicable, the names | ||
// of all active and mute time intervals that are muting it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This claims to store the state of the group but its only attribute is mutedBy
is this correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! I copied it from here:
// AlertStatus stores the state of an alert and, as applicable, the IDs of
// silences silencing the alert and of other alerts inhibiting the alert.
status = &groupStatus{} | ||
m.groups[routeID+groupKey] = status | ||
} | ||
status.mutedBy = timeIntervalNames |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we need to aggregate all the time-intervals that this has been muted by?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't – the reason is that the Mutes
method we just merged returns them all:
alertmanager/timeinterval/timeinterval.go
Lines 37 to 53 in dacbf00
func (i *Intervener) Mutes(names []string, now time.Time) (bool, []string, error) { | |
var in []string | |
for _, name := range names { | |
interval, ok := i.intervals[name] | |
if !ok { | |
return false, nil, fmt.Errorf("time interval %s doesn't exist in config", name) | |
} | |
for _, ti := range interval { | |
if ti.ContainsTime(now.UTC()) { | |
in = append(in, name) | |
} | |
} | |
} | |
return len(in) > 0, in, nil | |
} |
func NewMarker(r prometheus.Registerer) *MemMarker { | ||
m := &MemMarker{ | ||
alerts: map[model.Fingerprint]*AlertStatus{}, | ||
groups: map[string]*groupStatus{}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this gargage collected? As in, how does the marker now that it no longer needs to store the status of an alert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #3793 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks for clarifying my doubts!
* Add GroupMarker interface This commit adds a new GroupMarker interface that marks the status of groups. For example, whether an alert is muted because or one or more active or mute time intervals. It renames the existing Marker interface to AlertMarker to avoid confusion. Signed-off-by: George Robinson <george.robinson@grafana.com> --------- Signed-off-by: George Robinson <george.robinson@grafana.com>
* Add GroupMarker interface This commit adds a new GroupMarker interface that marks the status of groups. For example, whether an alert is muted because or one or more active or mute time intervals. It renames the existing Marker interface to AlertMarker to avoid confusion. Signed-off-by: George Robinson <george.robinson@grafana.com> --------- Signed-off-by: George Robinson <george.robinson@grafana.com>
* Release v0.28.0-rc.0 * [CHANGE] Templating errors in the SNS integration now return an error. #3531 #3879 * [FEATURE] Add a new Microsoft Teams integration based on Flows #4024 * [FEATURE] Add a new Rocket.Chat integration #3600 * [FEATURE] Add a new Jira integration #3590 #3931 * [FEATURE] Add support for `GOMEMLIMIT`, enable it via the feature flag `--enable-feature=auto-gomemlimit`. #3895 * [FEATURE] Add support for `GOMAXPROCS`, enable it via the feature flag `--enable-feature=auto-gomaxprocs`. #3837 * [FEATURE] Add support for limits of silences including the maximum number of active and pending silences, and the maximum size per silence (in bytes). You can use the flags `--silences.max-silences` and `--silences.max-silence-size-bytes` to set them accordingly #3852 #3862 #3866 #3885 #3886 #3877 * [FEATURE] Muted alerts now show whether they are suppressed or not in both the `/api/v2/alerts` endpoint and the Alertmanager UI. #3793 #3797 #3792 * [ENHANCEMENT] Add support for `content`, `username` and `avatar_url` in the Discord integration. `content` and `username` also support templating. #4007 * [ENHANCEMENT] Only invalidate the silences cache if a new silence is created or an existing silence replaced - should improve latency on both `GET api/v2/alerts` and `POST api/v2/alerts` API endpoint. #3961 * [ENHANCEMENT] Add image source label to Dockerfile. To get changelogs shown when using Renovate #4062 * [ENHANCEMENT] Build using go 1.23 #4071 * [ENHANCEMENT] Support setting a global SMTP TLS configuration. #3732 * [ENHANCEMENT] The setting `room_id` in the WebEx integration can now be templated to allow for dynamic room IDs. #3801 * [ENHANCEMENT] Enable setting `message_thread_id` for the Telegram integration. #3638 * [ENHANCEMENT] Support the `since` and `humanizeDuration` functions to templates. This means users can now format time to more human-readable text. #3863 * [ENHANCEMENT] Support the `date` and `tz` functions to templates. This means users can now format time in a specified format and also change the timezone to their specific locale. #3812 * [ENHANCEMENT] Latency metrics now support native histograms. #3737 * [BUGFIX] Fix the SMTP integration not correctly closing an SMTP submission, which may lead to unsuccessful dispatches being marked as successful. #4006 * [BUGFIX] The `ParseMode` option is now set explicitly in the Telegram integration. If we don't HTML tags had not been parsed by default. #4027 * [BUGFIX] Fix a memory leak that was caused by updates silences continuously. #3930 * [BUGFIX] Fix hiding secret URLs when the URL is incorrect. #3887 * [BUGFIX] Fix a race condition in the alerts - it was more of a hypothetical race condition that could have occurred in the alert reception pipeline. #3648 * [BUGFIX] Fix a race condition in the alert delivery pipeline that would cause a firing alert that was delivered earlier to be deleted from the aggregation group when instead it should have been delivered again. #3826 * [BUGFIX] Fix version in APIv1 deprecation notice. #3815 * [BUGFIX] Fix crash errors when using `url_file` in the Webhook integration. #3800 * [BUGFIX] fix `Route.ID()` returns conflicting IDs. #3803 * [BUGFIX] Fix deadlock on the alerts memory store. #3715 * [BUGFIX] Fix `amtool template render` when using the default values. #3725 * [BUGFIX] Fix `webhook_url_file` for both the Discord and Microsoft Teams integrations. #3728 #3745 --------- Signed-off-by: SuperQ <superq@gmail.com> Signed-off-by: gotjosh <josue.abreu@gmail.com> Co-authored-by: gotjosh <josue.abreu@gmail.com>
* [CHANGE] Templating errors in the SNS integration now return an error. #3531 #3879 * [CHANGE] Adopt log/slog, drop go-kit/log #4089 * [FEATURE] Add a new Microsoft Teams integration based on Flows #4024 * [FEATURE] Add a new Rocket.Chat integration #3600 * [FEATURE] Add a new Jira integration #3590 #3931 * [FEATURE] Add support for `GOMEMLIMIT`, enable it via the feature flag `--enable-feature=auto-gomemlimit`. #3895 * [FEATURE] Add support for `GOMAXPROCS`, enable it via the feature flag `--enable-feature=auto-gomaxprocs`. #3837 * [FEATURE] Add support for limits of silences including the maximum number of active and pending silences, and the maximum size per silence (in bytes). You can use the flags `--silences.max-silences` and `--silences.max-silence-size-bytes` to set them accordingly #3852 #3862 #3866 #3885 #3886 #3877 * [FEATURE] Muted alerts now show whether they are suppressed or not in both the `/api/v2/alerts` endpoint and the Alertmanager UI. #3793 #3797 #3792 * [ENHANCEMENT] Add support for `content`, `username` and `avatar_url` in the Discord integration. `content` and `username` also support templating. #4007 * [ENHANCEMENT] Only invalidate the silences cache if a new silence is created or an existing silence replaced - should improve latency on both `GET api/v2/alerts` and `POST api/v2/alerts` API endpoint. #3961 * [ENHANCEMENT] Add image source label to Dockerfile. To get changelogs shown when using Renovate #4062 * [ENHANCEMENT] Build using go 1.23 #4071 * [ENHANCEMENT] Support setting a global SMTP TLS configuration. #3732 * [ENHANCEMENT] The setting `room_id` in the WebEx integration can now be templated to allow for dynamic room IDs. #3801 * [ENHANCEMENT] Enable setting `message_thread_id` for the Telegram integration. #3638 * [ENHANCEMENT] Support the `since` and `humanizeDuration` functions to templates. This means users can now format time to more human-readable text. #3863 * [ENHANCEMENT] Support the `date` and `tz` functions to templates. This means users can now format time in a specified format and also change the timezone to their specific locale. #3812 * [ENHANCEMENT] Latency metrics now support native histograms. #3737 * [ENHANCEMENT] Add timeout option for webhook notifier. #4137 * [BUGFIX] Fix the SMTP integration not correctly closing an SMTP submission, which may lead to unsuccessful dispatches being marked as successful. #4006 * [BUGFIX] The `ParseMode` option is now set explicitly in the Telegram integration. If we don't HTML tags had not been parsed by default. #4027 * [BUGFIX] Fix a memory leak that was caused by updates silences continuously. #3930 * [BUGFIX] Fix hiding secret URLs when the URL is incorrect. #3887 * [BUGFIX] Fix a race condition in the alerts - it was more of a hypothetical race condition that could have occurred in the alert reception pipeline. #3648 * [BUGFIX] Fix a race condition in the alert delivery pipeline that would cause a firing alert that was delivered earlier to be deleted from the aggregation group when instead it should have been delivered again. #3826 * [BUGFIX] Fix version in APIv1 deprecation notice. #3815 * [BUGFIX] Fix crash errors when using `url_file` in the Webhook integration. #3800 * [BUGFIX] fix `Route.ID()` returns conflicting IDs. #3803 * [BUGFIX] Fix deadlock on the alerts memory store. #3715 * [BUGFIX] Fix `amtool template render` when using the default values. #3725 * [BUGFIX] Fix `webhook_url_file` for both the Discord and Microsoft Teams integrations. #3728 #3745 * [BUGFIX] Fix wechat api link #4084 * [BUGFIX] Fix build info metric #4166 Signed-off-by: SuperQ <superq@gmail.com>
This commit adds a new GroupMarker interface that marks the status of groups. For example, whether a group is muted because or one or more active or mute time intervals. It renames the existing Marker interface to AlertMarker to avoid confusion.
It is based on #3791.