Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugfix: only increment Silences version if a silence is added #3961

Merged
merged 1 commit into from
Oct 23, 2024

Conversation

Spaceman1701
Copy link
Contributor

@Spaceman1701 Spaceman1701 commented Aug 16, 2024

This is a small change which adds a second return value to state.merge to differentiate between cases where a silence is updated in place and where a silence is added. This is used by callers to decide if the Silences.version ought to be incremented. It fixes a bug where the Silences.Version() returns a incremented value for operations which don't need to increment the value. In our environment this change results in a significant performance improvement for production workloads because many fewer QMatches queries are issued by Silencer.Mutes.

I believe this is a just a bug for three reasons:

  1. Silences reads "Increments whenever silences are added" (and other operations that mutate the silence state, such as the GC, do not increment the version) (https://github.com/prometheus/alertmanager/blob/main/silence/silence.go#L200)
  2. All application code assumes that the underlying data in a pb.Silence could've changed even if the Silences.Version() hasn't changed. For example, in Silencer.Mutes, we issue a QIDs query to recheck every silence even if the marker version matches the Silences version (https://github.com/prometheus/alertmanager/blob/main/silence/silence.go#L139-L142).
  3. Logically, an individual Silence's state can change without any modification to the Silences object, so no future querier can ever assume that an unchanged version means that the results of a query will not change. Instead, they can only assume that no new silences have been added. This extends further than just the Silence's state since the Silences API does not guarantee that an expired silence will be retained between calls to Query.

All existing tests pass without modification after this change. I believe this exercises some querying logic which further indicates that there's no change to guarantees of the Silences API.

Additionally, I've added conditions to TestSielnceSet which check the result of Silences.Version() before and after each operation. I could not find any existing tests which validate the behavior of Silences.Version(), but this test is the place where it makes the most sense to me. I've limited the guarentee from the API to just "the version changes if and only if a silence with a new ID is added." The test does not enforce that the version increases or changes by only one. The modified test fails without this change, but passes with it:

go test ./silence/
--- FAIL: TestSilenceSet (0.00s)
    silence_test.go:413:
        	Error Trace:	/home/ehunter/nfs-de/oss/forks/alertmanager/silence/silence_test.go:413
        	Error:      	Not equal:
        	            	expected: 3
        	            	actual  : 2
        	Test:       	TestSilenceSet
FAIL
FAIL	github.com/prometheus/alertmanager/silence	0.030s
FAIL

We've also been running this patch in our environment for a while without any issues.

@grobinson-grafana
Copy link
Contributor

In our environment this change results in a significant performance improvement for production workloads because many fewer QMatches queries are issued by Silencer.Mutes.

Can you share some numbers so we can get a sense of how big an improvement this is?

@Spaceman1701
Copy link
Contributor Author

Can you share some numbers so we can get a sense of how big an improvement this is?

Yes, I've attached some graphs which show the performance improvements we've measured from this change. Unfortunately, I'm not able to include the time axis labels, but these graphs all show a period of about a week. The line I've added to mark the deployment is approximate - we run Alertmanger in an HA cluster and not all nodes in are deployed simultaneously. The workload stayed very consistent over this period and we are confident that almost all of the performance changes shown are a result of this patch.

This change had the most impact on our P95 observations. As an example 95th percentile GET /alerts latency would frequently spike up to 5s-10s pre-patch. After the patch, P95 latency was more stable around 1s with occasional spikes between 2s-5s. I think this is because lots of simultaneous silence updates and queries cause contention around the Silences lock. The improvements from this patch are big, but limited to specific workloads which have frequent in-place silence updates. We actually have some more patches that I'm working to get untangled which provide more general silence query performance improvements, but this bugfix is a per-requisite for those.

95thPercentilePostAlerts 95thPercentileGetAlerts 95thPercentileSilencesQueryDuration AveragePostAlertsDuration

@grobinson-grafana
Copy link
Contributor

frequent in-place silence updates. We actually have some more patches that I'm working to get untangled which provide more general silence query performance improvements, but this bugfix is a per-requisite for those.

Just to learn more about how you are using Alertmanager, as it helps us improve the software, it sounds like from what you've written you do a lot of frequent create/updates to silences, is that correct?

There is nothing wrong with that, it's more just an observation as you are the first user I've seen that has reported performance issues due to high churn rate of silences.

@Spaceman1701
Copy link
Contributor Author

Just to learn more about how you are using Alertmanager, as it helps us improve the software, it sounds like from what you've written you do a lot of frequent create/updates to silences, is that correct?

There is nothing wrong with that, it's more just an observation as you are the first user I've seen that has reported performance issues due to high churn rate of silences.

Yep, that's correct. We have something similar to https://github.com/prymitive/kthxbye that updates silences using data from external sources. I imagine that anyone using kthxbye or something like it has a similar amount of churn, but I couldn't guess how common that is. I suspect that a part of why we're hitting these performance problems is that we're running at (what I think is) a larger scale than the average Alertmanager deployment. For example, we have tens of thousands of alerts and tens of thousands of silences.

For what it's worth, I believe that this bugfix (and the other improvements we've made to the silence query logic) are beneficial to everyone even if they have the most impact for users who have similar workloads to us.

Signed-off-by: Ethan Hunter <ehunter@hudson-trading.com>
@grobinson-grafana
Copy link
Contributor

I read through both the silences and marker code again, and I think this change is safe. At least – I couldn't think of any situations where re-using the version would break existing behavior. @beorn7 as the original author of this code, is it possible I've missed something?

@beorn7
Copy link
Member

beorn7 commented Aug 27, 2024

This is so far ago that I don't remember any details like this. Let's go with it and see if it works. 🦆

@Spaceman1701
Copy link
Contributor Author

Is there anything else I can provide here to help this one move forward?

@beorn7
Copy link
Member

beorn7 commented Sep 10, 2024

I think the ball is in the court of the maintainers. @gotjosh @simonpasquier could you make a call here if you want to merge this?

Copy link
Member

@gotjosh gotjosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gotjosh gotjosh merged commit 69fe3f8 into prometheus:main Oct 23, 2024
11 checks passed
gotjosh added a commit that referenced this pull request Oct 24, 2024
* Release v0.28.0-rc.0

* [CHANGE] Templating errors in the SNS integration now return an error. #3531 #3879
* [FEATURE] Add a new Microsoft Teams integration based on Flows #4024
* [FEATURE] Add a new Rocket.Chat integration #3600
* [FEATURE] Add a new Jira integration #3590 #3931
* [FEATURE] Add support for `GOMEMLIMIT`, enable it via the feature flag `--enable-feature=auto-gomemlimit`. #3895
* [FEATURE] Add support for `GOMAXPROCS`, enable it via the feature flag `--enable-feature=auto-gomaxprocs`. #3837
* [FEATURE] Add support for limits of silences including the maximum number of active and pending silences, and the maximum size per silence (in bytes). You can use the flags `--silences.max-silences` and `--silences.max-silence-size-bytes` to set them accordingly #3852 #3862 #3866 #3885 #3886 #3877
* [FEATURE] Muted alerts now show whether they are suppressed or not in both the `/api/v2/alerts` endpoint and the Alertmanager UI. #3793 #3797 #3792
* [ENHANCEMENT] Add support for `content`, `username` and `avatar_url` in the Discord integration. `content` and `username` also support templating. #4007
* [ENHANCEMENT] Only invalidate the silences cache if a new silence is created or an existing silence replaced - should improve latency on both `GET api/v2/alerts` and `POST api/v2/alerts` API endpoint. #3961
* [ENHANCEMENT] Add image source label to Dockerfile. To get changelogs shown when using Renovate #4062
* [ENHANCEMENT] Build using go 1.23 #4071
* [ENHANCEMENT] Support setting a global SMTP TLS configuration. #3732
* [ENHANCEMENT] The setting `room_id` in the WebEx integration can now be templated to allow for dynamic room IDs. #3801
* [ENHANCEMENT] Enable setting `message_thread_id` for the Telegram integration. #3638
* [ENHANCEMENT] Support the `since` and `humanizeDuration` functions to templates. This means users can now format time to more human-readable text. #3863
* [ENHANCEMENT] Support the `date` and `tz` functions to templates. This means users can now format time in a specified format and also change the timezone to their specific locale. #3812
* [ENHANCEMENT] Latency metrics now support native histograms. #3737
* [BUGFIX] Fix the SMTP integration not correctly closing an SMTP submission, which may lead to unsuccessful dispatches being marked as successful. #4006
* [BUGFIX]  The `ParseMode` option is now set explicitly in the Telegram integration. If we don't HTML tags had not been parsed by default. #4027
* [BUGFIX] Fix a memory leak that was caused by updates silences continuously. #3930
* [BUGFIX] Fix hiding secret URLs when the URL is incorrect. #3887
* [BUGFIX] Fix a race condition in the alerts - it was more of a hypothetical race condition that could have occurred in the alert reception pipeline. #3648
* [BUGFIX] Fix a race condition in the alert delivery pipeline that would cause a firing alert that was delivered earlier to be deleted from the aggregation group when instead it should have been delivered again. #3826
* [BUGFIX] Fix version in APIv1 deprecation notice. #3815
* [BUGFIX] Fix crash errors when using `url_file` in the Webhook integration. #3800
* [BUGFIX] fix `Route.ID()` returns conflicting IDs. #3803
* [BUGFIX] Fix deadlock on the alerts memory store. #3715
* [BUGFIX] Fix `amtool template render` when using the default values. #3725
* [BUGFIX] Fix `webhook_url_file` for both the Discord and Microsoft Teams integrations. #3728 #3745

---------

Signed-off-by: SuperQ <superq@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Co-authored-by: gotjosh <josue.abreu@gmail.com>
SuperQ added a commit that referenced this pull request Dec 19, 2024
* [CHANGE] Templating errors in the SNS integration now return an error. #3531 #3879
* [CHANGE] Adopt log/slog, drop go-kit/log #4089
* [FEATURE] Add a new Microsoft Teams integration based on Flows #4024
* [FEATURE] Add a new Rocket.Chat integration #3600
* [FEATURE] Add a new Jira integration #3590 #3931
* [FEATURE] Add support for `GOMEMLIMIT`, enable it via the feature flag `--enable-feature=auto-gomemlimit`. #3895
* [FEATURE] Add support for `GOMAXPROCS`, enable it via the feature flag `--enable-feature=auto-gomaxprocs`. #3837
* [FEATURE] Add support for limits of silences including the maximum number of active and pending silences, and the maximum size per silence (in bytes). You can use the flags `--silences.max-silences` and `--silences.max-silence-size-bytes` to set them accordingly #3852 #3862 #3866 #3885 #3886 #3877
* [FEATURE] Muted alerts now show whether they are suppressed or not in both the `/api/v2/alerts` endpoint and the Alertmanager UI. #3793 #3797 #3792
* [ENHANCEMENT] Add support for `content`, `username` and `avatar_url` in the Discord integration. `content` and `username` also support templating. #4007
* [ENHANCEMENT] Only invalidate the silences cache if a new silence is created or an existing silence replaced - should improve latency on both `GET api/v2/alerts` and `POST api/v2/alerts` API endpoint. #3961
* [ENHANCEMENT] Add image source label to Dockerfile. To get changelogs shown when using Renovate #4062
* [ENHANCEMENT] Build using go 1.23 #4071
* [ENHANCEMENT] Support setting a global SMTP TLS configuration. #3732
* [ENHANCEMENT] The setting `room_id` in the WebEx integration can now be templated to allow for dynamic room IDs. #3801
* [ENHANCEMENT] Enable setting `message_thread_id` for the Telegram integration. #3638
* [ENHANCEMENT] Support the `since` and `humanizeDuration` functions to templates. This means users can now format time to more human-readable text. #3863
* [ENHANCEMENT] Support the `date` and `tz` functions to templates. This means users can now format time in a specified format and also change the timezone to their specific locale. #3812
* [ENHANCEMENT] Latency metrics now support native histograms. #3737
* [ENHANCEMENT] Add timeout option for webhook notifier. #4137
* [BUGFIX] Fix the SMTP integration not correctly closing an SMTP submission, which may lead to unsuccessful dispatches being marked as successful. #4006
* [BUGFIX]  The `ParseMode` option is now set explicitly in the Telegram integration. If we don't HTML tags had not been parsed by default. #4027
* [BUGFIX] Fix a memory leak that was caused by updates silences continuously. #3930
* [BUGFIX] Fix hiding secret URLs when the URL is incorrect. #3887
* [BUGFIX] Fix a race condition in the alerts - it was more of a hypothetical race condition that could have occurred in the alert reception pipeline. #3648
* [BUGFIX] Fix a race condition in the alert delivery pipeline that would cause a firing alert that was delivered earlier to be deleted from the aggregation group when instead it should have been delivered again. #3826
* [BUGFIX] Fix version in APIv1 deprecation notice. #3815
* [BUGFIX] Fix crash errors when using `url_file` in the Webhook integration. #3800
* [BUGFIX] fix `Route.ID()` returns conflicting IDs. #3803
* [BUGFIX] Fix deadlock on the alerts memory store. #3715
* [BUGFIX] Fix `amtool template render` when using the default values. #3725
* [BUGFIX] Fix `webhook_url_file` for both the Discord and Microsoft Teams integrations. #3728 #3745
* [BUGFIX] Fix wechat api link #4084
* [BUGFIX] Fix build info metric #4166

Signed-off-by: SuperQ <superq@gmail.com>
@SuperQ SuperQ mentioned this pull request Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants