-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bugfix] scheduler: Gracefully shutdown querier when using query-scheduler #7735
Conversation
./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.
|
./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great work but I need to understand the synchronization of inflightQuery
a little better.
How do you feel about using a WaitGroup
instead. This would also avoid using a busy loop. A mutex would be easier to reason about as well.
|
||
sp.metrics.inflightRequests.Dec() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did this go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, this was my mistake, I simply ‘copied’ scheduler mimir's code to loki, ignoring the inflightRequests metrics.
I have submitted a commit which fixes this issue.
case <-workerCtx.Done(): | ||
level.Debug(logger).Log("msg", "querier worker context has been canceled, waiting until there's no inflight query") | ||
|
||
for inflightQuery.Load() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when the query is never processed? Also, isn't there a potential race condition between testing the flag and setting it in the querier loop? It could be false here but then the next query is received.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/grafana/mimir/blob/main/pkg/querier/worker/util.go
This util.go is completely copied from mimir. I will deploy this PR to my loki cluster and run it for a while to verify whether this PR will cause unexpected race risks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. We somehow need to document this. I'll try to find the original author.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, isn't there a potential race condition between testing the flag and setting it in the querier loop? It could be false here but then the next query is received.
When the querier shutdowns it's expected to cancel the context and so the call to request, err := c.Recv()
(done in schedulerProcessor.querierLoop()
) to return error because of the canceled context (I mean the querier context, not the query execution context).
Is there a race? Yes, there's a race between the call to c.Recv()
and the sequent call to inflightQuery.Store(true)
, but the time window is very short and we ignored it in Mimir (all in all we want to gracefully handle the 99.9% of cases).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when the query is never processed?
Can you elaborate this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering if we can end up in a state were the query is inflight but we shut down. I guess it times out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that race condition still exists (I found it very hard to guarantee to never happen) but in practice should be very unlikely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liguozhong would you mind add a small comment summarizing Marco's answer?
Co-authored-by: Karsten Jeschkies <k@jeschkies.xyz>
… into scheduler-deadlock
./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.
|
hi, Thanks for your timely review. this pr is really important to me .I've been trying to fix #7722 for 18 days. I prefer to keep the current code, which will make loki and mimir use the same scheduler code, even if there is a problem with this code, it can be fixed together with the mimir community. |
@jeschkies can you take another pass at this pls? |
Good news, I deployed this PR to my loki cluster and Fixes #7722. At present, the recording rule has been running stably for 1 day. This PR seems to be useful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for you hard work and patience. Could you add a comment on the possible race condition.
case <-workerCtx.Done(): | ||
level.Debug(logger).Log("msg", "querier worker context has been canceled, waiting until there's no inflight query") | ||
|
||
for inflightQuery.Load() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liguozhong would you mind add a small comment summarizing Marco's answer?
./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.
|
done |
./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.
|
What this PR does / why we need it:
Gracefully shutdown querier when using query-scheduler
This PR is an attempt to fix a bug that my loki cluster is unavailable for logql. The source code and ideas are from mimir of LGTM. Thanks mimir and pr author @pracucci
Which issue(s) this PR fixes:
Fixes #7722
Special notes for your reviewer:
mimir PR
grafana/mimir#1756
grafana/mimir#1767
Checklist
CONTRIBUTING.md
guideCHANGELOG.md
updateddocs/sources/upgrading/_index.md