Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix request cancellation when frontend worker is busy #740

Closed
colega opened this issue Jan 13, 2022 · 0 comments · Fixed by #741
Closed

Fix request cancellation when frontend worker is busy #740

colega opened this issue Jan 13, 2022 · 0 comments · Fixed by #741
Assignees

Comments

@colega
Copy link
Contributor

colega commented Jan 13, 2022

Describe the bug

When we try to cancel an enqueued request:

select {
case <-ctx.Done():
if cancelCh != nil {
select {
case cancelCh <- freq.queryID:
// cancellation sent.
default:
// failed to cancel, ignore.
}
}

We have a default clause which we will follow if worker is busy sending or cancelling another request:

http://github.com/grafana/mimir/blob/e2e2e100ec14e7707d1733908e61c83c8b87ef40/pkg/frontend/v2/frontend_scheduler_worker.go#L285-L286

So sometimes the requests are processed in the queriers even when the client has canceled the request.

This is the same as fixed in loki in grafana/loki#5113

We should apply a similar fix.

To Reproduce

See test in grafana/loki#5113

Expected behavior

All upstream requests should be canceled when downstream cancels the originating request.

@colega colega self-assigned this Jan 13, 2022
colega added a commit that referenced this issue Jan 13, 2022
With previous implementation, if worker was busy talking to scheduler,
we didn't push the cancellation, keeping that query running.

When cancelling a query, all its subqueries are cancelled at the same
time, so this was most likely happening all the time (first subquery
scheduled on this worker was canceled, the rest were not because worker
was busy cancelling the first one).

Also removed the `<-ctx.Done()` escape point when waiting for the
enqueueing ACK and modified the enqueueing method to ensure that it
always responds something.

Fixes: #740
Inspired by: grafana/loki#5113

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
pstibrany added a commit that referenced this issue Jan 14, 2022
* Increase scheduler worker cancellation chan cap

With previous implementation, if worker was busy talking to scheduler,
we didn't push the cancellation, keeping that query running.

When cancelling a query, all its subqueries are cancelled at the same
time, so this was most likely happening all the time (first subquery
scheduled on this worker was canceled, the rest were not because worker
was busy cancelling the first one).

Also removed the `<-ctx.Done()` escape point when waiting for the
enqueueing ACK and modified the enqueueing method to ensure that it
always responds something.

Fixes: #740
Inspired by: grafana/loki#5113

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>

* Update CHANGELOG.md

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>

* Remove comment about chan memory usage

Co-authored-by: Peter Štibraný <peter.stibrany@grafana.com>

* Update test comment

Co-authored-by: Peter Štibraný <peter.stibrany@grafana.com>

* Add resp.Error to the log when response is unknown

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>

* Log the entire uknown response

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>

Co-authored-by: Peter Štibraný <peter.stibrany@grafana.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant