-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bugfix: only increment Silences version if a silence is added #3961
Conversation
11408f7
to
680057c
Compare
Can you share some numbers so we can get a sense of how big an improvement this is? |
Just to learn more about how you are using Alertmanager, as it helps us improve the software, it sounds like from what you've written you do a lot of frequent create/updates to silences, is that correct? There is nothing wrong with that, it's more just an observation as you are the first user I've seen that has reported performance issues due to high churn rate of silences. |
Yep, that's correct. We have something similar to https://github.com/prymitive/kthxbye that updates silences using data from external sources. I imagine that anyone using kthxbye or something like it has a similar amount of churn, but I couldn't guess how common that is. I suspect that a part of why we're hitting these performance problems is that we're running at (what I think is) a larger scale than the average Alertmanager deployment. For example, we have tens of thousands of alerts and tens of thousands of silences. For what it's worth, I believe that this bugfix (and the other improvements we've made to the silence query logic) are beneficial to everyone even if they have the most impact for users who have similar workloads to us. |
Signed-off-by: Ethan Hunter <ehunter@hudson-trading.com>
680057c
to
eb7b9f8
Compare
I read through both the silences and marker code again, and I think this change is safe. At least – I couldn't think of any situations where re-using the version would break existing behavior. @beorn7 as the original author of this code, is it possible I've missed something? |
This is so far ago that I don't remember any details like this. Let's go with it and see if it works. 🦆 |
Is there anything else I can provide here to help this one move forward? |
I think the ball is in the court of the maintainers. @gotjosh @simonpasquier could you make a call here if you want to merge this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* Release v0.28.0-rc.0 * [CHANGE] Templating errors in the SNS integration now return an error. #3531 #3879 * [FEATURE] Add a new Microsoft Teams integration based on Flows #4024 * [FEATURE] Add a new Rocket.Chat integration #3600 * [FEATURE] Add a new Jira integration #3590 #3931 * [FEATURE] Add support for `GOMEMLIMIT`, enable it via the feature flag `--enable-feature=auto-gomemlimit`. #3895 * [FEATURE] Add support for `GOMAXPROCS`, enable it via the feature flag `--enable-feature=auto-gomaxprocs`. #3837 * [FEATURE] Add support for limits of silences including the maximum number of active and pending silences, and the maximum size per silence (in bytes). You can use the flags `--silences.max-silences` and `--silences.max-silence-size-bytes` to set them accordingly #3852 #3862 #3866 #3885 #3886 #3877 * [FEATURE] Muted alerts now show whether they are suppressed or not in both the `/api/v2/alerts` endpoint and the Alertmanager UI. #3793 #3797 #3792 * [ENHANCEMENT] Add support for `content`, `username` and `avatar_url` in the Discord integration. `content` and `username` also support templating. #4007 * [ENHANCEMENT] Only invalidate the silences cache if a new silence is created or an existing silence replaced - should improve latency on both `GET api/v2/alerts` and `POST api/v2/alerts` API endpoint. #3961 * [ENHANCEMENT] Add image source label to Dockerfile. To get changelogs shown when using Renovate #4062 * [ENHANCEMENT] Build using go 1.23 #4071 * [ENHANCEMENT] Support setting a global SMTP TLS configuration. #3732 * [ENHANCEMENT] The setting `room_id` in the WebEx integration can now be templated to allow for dynamic room IDs. #3801 * [ENHANCEMENT] Enable setting `message_thread_id` for the Telegram integration. #3638 * [ENHANCEMENT] Support the `since` and `humanizeDuration` functions to templates. This means users can now format time to more human-readable text. #3863 * [ENHANCEMENT] Support the `date` and `tz` functions to templates. This means users can now format time in a specified format and also change the timezone to their specific locale. #3812 * [ENHANCEMENT] Latency metrics now support native histograms. #3737 * [BUGFIX] Fix the SMTP integration not correctly closing an SMTP submission, which may lead to unsuccessful dispatches being marked as successful. #4006 * [BUGFIX] The `ParseMode` option is now set explicitly in the Telegram integration. If we don't HTML tags had not been parsed by default. #4027 * [BUGFIX] Fix a memory leak that was caused by updates silences continuously. #3930 * [BUGFIX] Fix hiding secret URLs when the URL is incorrect. #3887 * [BUGFIX] Fix a race condition in the alerts - it was more of a hypothetical race condition that could have occurred in the alert reception pipeline. #3648 * [BUGFIX] Fix a race condition in the alert delivery pipeline that would cause a firing alert that was delivered earlier to be deleted from the aggregation group when instead it should have been delivered again. #3826 * [BUGFIX] Fix version in APIv1 deprecation notice. #3815 * [BUGFIX] Fix crash errors when using `url_file` in the Webhook integration. #3800 * [BUGFIX] fix `Route.ID()` returns conflicting IDs. #3803 * [BUGFIX] Fix deadlock on the alerts memory store. #3715 * [BUGFIX] Fix `amtool template render` when using the default values. #3725 * [BUGFIX] Fix `webhook_url_file` for both the Discord and Microsoft Teams integrations. #3728 #3745 --------- Signed-off-by: SuperQ <superq@gmail.com> Signed-off-by: gotjosh <josue.abreu@gmail.com> Co-authored-by: gotjosh <josue.abreu@gmail.com>
* [CHANGE] Templating errors in the SNS integration now return an error. #3531 #3879 * [CHANGE] Adopt log/slog, drop go-kit/log #4089 * [FEATURE] Add a new Microsoft Teams integration based on Flows #4024 * [FEATURE] Add a new Rocket.Chat integration #3600 * [FEATURE] Add a new Jira integration #3590 #3931 * [FEATURE] Add support for `GOMEMLIMIT`, enable it via the feature flag `--enable-feature=auto-gomemlimit`. #3895 * [FEATURE] Add support for `GOMAXPROCS`, enable it via the feature flag `--enable-feature=auto-gomaxprocs`. #3837 * [FEATURE] Add support for limits of silences including the maximum number of active and pending silences, and the maximum size per silence (in bytes). You can use the flags `--silences.max-silences` and `--silences.max-silence-size-bytes` to set them accordingly #3852 #3862 #3866 #3885 #3886 #3877 * [FEATURE] Muted alerts now show whether they are suppressed or not in both the `/api/v2/alerts` endpoint and the Alertmanager UI. #3793 #3797 #3792 * [ENHANCEMENT] Add support for `content`, `username` and `avatar_url` in the Discord integration. `content` and `username` also support templating. #4007 * [ENHANCEMENT] Only invalidate the silences cache if a new silence is created or an existing silence replaced - should improve latency on both `GET api/v2/alerts` and `POST api/v2/alerts` API endpoint. #3961 * [ENHANCEMENT] Add image source label to Dockerfile. To get changelogs shown when using Renovate #4062 * [ENHANCEMENT] Build using go 1.23 #4071 * [ENHANCEMENT] Support setting a global SMTP TLS configuration. #3732 * [ENHANCEMENT] The setting `room_id` in the WebEx integration can now be templated to allow for dynamic room IDs. #3801 * [ENHANCEMENT] Enable setting `message_thread_id` for the Telegram integration. #3638 * [ENHANCEMENT] Support the `since` and `humanizeDuration` functions to templates. This means users can now format time to more human-readable text. #3863 * [ENHANCEMENT] Support the `date` and `tz` functions to templates. This means users can now format time in a specified format and also change the timezone to their specific locale. #3812 * [ENHANCEMENT] Latency metrics now support native histograms. #3737 * [ENHANCEMENT] Add timeout option for webhook notifier. #4137 * [BUGFIX] Fix the SMTP integration not correctly closing an SMTP submission, which may lead to unsuccessful dispatches being marked as successful. #4006 * [BUGFIX] The `ParseMode` option is now set explicitly in the Telegram integration. If we don't HTML tags had not been parsed by default. #4027 * [BUGFIX] Fix a memory leak that was caused by updates silences continuously. #3930 * [BUGFIX] Fix hiding secret URLs when the URL is incorrect. #3887 * [BUGFIX] Fix a race condition in the alerts - it was more of a hypothetical race condition that could have occurred in the alert reception pipeline. #3648 * [BUGFIX] Fix a race condition in the alert delivery pipeline that would cause a firing alert that was delivered earlier to be deleted from the aggregation group when instead it should have been delivered again. #3826 * [BUGFIX] Fix version in APIv1 deprecation notice. #3815 * [BUGFIX] Fix crash errors when using `url_file` in the Webhook integration. #3800 * [BUGFIX] fix `Route.ID()` returns conflicting IDs. #3803 * [BUGFIX] Fix deadlock on the alerts memory store. #3715 * [BUGFIX] Fix `amtool template render` when using the default values. #3725 * [BUGFIX] Fix `webhook_url_file` for both the Discord and Microsoft Teams integrations. #3728 #3745 * [BUGFIX] Fix wechat api link #4084 * [BUGFIX] Fix build info metric #4166 Signed-off-by: SuperQ <superq@gmail.com>
This is a small change which adds a second return value to
state.merge
to differentiate between cases where a silence is updated in place and where a silence is added. This is used by callers to decide if theSilences.version
ought to be incremented. It fixes a bug where theSilences.Version()
returns a incremented value for operations which don't need to increment the value. In our environment this change results in a significant performance improvement for production workloads because many fewerQMatches
queries are issued bySilencer.Mutes
.I believe this is a just a bug for three reasons:
Silences
reads "Increments whenever silences are added" (and other operations that mutate the silence state, such as the GC, do not increment the version) (https://github.com/prometheus/alertmanager/blob/main/silence/silence.go#L200)pb.Silence
could've changed even if theSilences.Version()
hasn't changed. For example, inSilencer.Mutes
, we issue aQIDs
query to recheck every silence even if the marker version matches theSilences
version (https://github.com/prometheus/alertmanager/blob/main/silence/silence.go#L139-L142).Silences
object, so no future querier can ever assume that an unchanged version means that the results of a query will not change. Instead, they can only assume that no new silences have been added. This extends further than just the Silence's state since theSilences
API does not guarantee that an expired silence will be retained between calls toQuery
.All existing tests pass without modification after this change. I believe this exercises some querying logic which further indicates that there's no change to guarantees of the
Silences
API.Additionally, I've added conditions to
TestSielnceSet
which check the result ofSilences.Version()
before and after each operation. I could not find any existing tests which validate the behavior ofSilences.Version()
, but this test is the place where it makes the most sense to me. I've limited the guarentee from the API to just "the version changes if and only if a silence with a new ID is added." The test does not enforce that the version increases or changes by only one. The modified test fails without this change, but passes with it:We've also been running this patch in our environment for a while without any issues.