Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slo: include all request cancellations within SLO #4355

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

electron0zero
Copy link
Member

@electron0zero electron0zero commented Nov 20, 2024

What this PR does:

We count all request cancellations outside SLO, but that's not right because request cancellations are done by users and should fall within SLO.

this PR is changing the way to count request cancellations, and now correctly counts then within SLOs.

I also made some changes to send back codes.Canceled gRPC error from collectors.

Which issue(s) this PR fixes:
Fixes #

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@mdisibio
Copy link
Contributor

Tested this out locally and I think there are still edge cases not handled.

(1) What works: if you cancel quickly, within 1-2s, it works. The query is counted both in _total and _within_slo_total, and confirmed the new logic catches it.

(2) What doesn't work: if you cancel further into the query, ~4s, then the query continues running and eventually returns in the frontend. But err is nil so it is counted in _total, and not necessarily _within_slo. It still has a chance to be considered outside of SLO.

Thoughts
(1) Here is a debugger screenshot from the (2) case, where even though err == nil, there is req.Context().Err == context.Canceled. Maybe we could catch this case by checking it here, in addition to err. Or maybe it should be somewhere else, within the roundtripper and bubbled up.

image

(2) In the case above the breakpoint is hit long after pressing the cancel button, so it seems to be the difference between canceling while frontend is still farming out jobs, or too late after all jobs are already issued. This means maybe the querier is also missing a context check.

Test setup
To test this I am using metrics.duration_slo=100ms, and for a slow query: {} | rate() by (resource.service.name, name, span:id) which takes about 9s, so there is time to test canceling at different points.

@electron0zero
Copy link
Member Author

for case 2, I think the Handlers should pass correct error to postSLOHook, and they should should the error in the request context and the postSLOHook shouldn't be aware of the request.

let me update the handlers to handle these late cancellation.

@mdisibio
Copy link
Contributor

@electron0zero Adding some more thoughts about the default behavior of counting canceled requests within SLO: I think we could also support an operator that wants to exclude them from SLO by adding an additional label that indicates if the request was completed or canceled like {result="completed|canceled"}. Then the operator can exclude canceled requests from SLO calculations, or leave them in. The default behavior sum(rate(_within_slo))/sum(rate(_total)) would leave them in, which fits our current direction.

An additional benefit is that this would give us metrics on canceled requests in general, which I don't think we have today. Could be useful.

I think this should be feasible and localized to the postSLOHook method where it is checking for cancelations already. But let me know if there are considerations I missed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants