WIP: bindata/alerts/slo: improve burnrate calculation #1744
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The problem that I recently noticed with the existing expression is that when we compute the overall burnrate from write and read requests, we take the ratio of successful read requests and we sum it to the one of write requests. But both of these ratios are calculated against their relevant request type, not the total number of requests. This is only correct when the proportion of write and read requests is equal.
For example, let's imagine a scenario where 40% of requests are write requests and their success during a disruption is only 50%. Whilst for read requests we have 90% of success.
apiserver_request:burnrate1h{verb="write"} would be equal to
2/4
and apiserver_request:burnrate1h{verb="read"} would be1/6
.The sum of these as these by the alert today would be equal to
2/4+1/6=2/3
when in reality, the ratio of successful requests should be2/10*1/10=3/10
. So there is quite a huge difference today when we don't account for the total number of requests.The only problem we will face with this change is that the we won't be able to use the recording rules to setup different SLOs depending on the type of requests.
But this could always be addressed by changing the burn rate alert expression to the following instead of modifying the recording rules: