Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bugfix] scheduler: Gracefully shutdown querier when using query-scheduler #7735

Merged
merged 7 commits into from
Dec 1, 2022

Conversation

liguozhong
Copy link
Contributor

What this PR does / why we need it:
Gracefully shutdown querier when using query-scheduler
This PR is an attempt to fix a bug that my loki cluster is unavailable for logql. The source code and ideas are from mimir of LGTM. Thanks mimir and pr author @pracucci

Which issue(s) this PR fixes:
Fixes #7722

Special notes for your reviewer:
mimir PR
grafana/mimir#1756
grafana/mimir#1767

Checklist

  • Reviewed the CONTRIBUTING.md guide
  • Documentation added
  • Tests updated
  • CHANGELOG.md updated
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/upgrading/_index.md

@liguozhong liguozhong requested a review from a team as a code owner November 21, 2022 12:58
@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0.1%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
- querier/queryrange	-0.1%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

Copy link
Contributor

@jeschkies jeschkies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great work but I need to understand the synchronization of inflightQuery a little better.

How do you feel about using a WaitGroup instead. This would also avoid using a busy loop. A mutex would be easier to reason about as well.

pkg/querier/worker/scheduler_processor.go Outdated Show resolved Hide resolved

sp.metrics.inflightRequests.Dec()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did this go?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, this was my mistake, I simply ‘copied’ scheduler mimir's code to loki, ignoring the inflightRequests metrics.
I have submitted a commit which fixes this issue.

case <-workerCtx.Done():
level.Debug(logger).Log("msg", "querier worker context has been canceled, waiting until there's no inflight query")

for inflightQuery.Load() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when the query is never processed? Also, isn't there a potential race condition between testing the flag and setting it in the querier loop? It could be false here but then the next query is received.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/grafana/mimir/blob/main/pkg/querier/worker/util.go

This util.go is completely copied from mimir. I will deploy this PR to my loki cluster and run it for a while to verify whether this PR will cause unexpected race risks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. We somehow need to document this. I'll try to find the original author.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, isn't there a potential race condition between testing the flag and setting it in the querier loop? It could be false here but then the next query is received.

When the querier shutdowns it's expected to cancel the context and so the call to request, err := c.Recv() (done in schedulerProcessor.querierLoop()) to return error because of the canceled context (I mean the querier context, not the query execution context).

Is there a race? Yes, there's a race between the call to c.Recv() and the sequent call to inflightQuery.Store(true), but the time window is very short and we ignored it in Mimir (all in all we want to gracefully handle the 99.9% of cases).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when the query is never processed?

Can you elaborate this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if we can end up in a state were the query is inflight but we shut down. I guess it times out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that race condition still exists (I found it very hard to guarantee to never happen) but in practice should be very unlikely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liguozhong would you mind add a small comment summarizing Marco's answer?

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@liguozhong
Copy link
Contributor Author

liguozhong commented Nov 23, 2022

How do you feel about using a WaitGroup instead. This would also avoid using a busy loop. A mutex would be easier to reason about as well.

hi, Thanks for your timely review. this pr is really important to me .I've been trying to fix #7722 for 18 days.

I prefer to keep the current code, which will make loki and mimir use the same scheduler code, even if there is a problem with this code, it can be fixed together with the mimir community.

@dannykopping
Copy link
Contributor

@jeschkies can you take another pass at this pls?

@liguozhong
Copy link
Contributor Author

Good news, I deployed this PR to my loki cluster and Fixes #7722.

At present, the recording rule has been running stably for 1 day. This PR seems to be useful.

image

Copy link
Contributor

@jeschkies jeschkies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for you hard work and patience. Could you add a comment on the possible race condition.

case <-workerCtx.Done():
level.Debug(logger).Log("msg", "querier worker context has been canceled, waiting until there's no inflight query")

for inflightQuery.Load() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liguozhong would you mind add a small comment summarizing Marco's answer?

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@liguozhong
Copy link
Contributor Author

Thanks for you hard work and patience. Could you add a comment on the possible race condition.

done

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@jeschkies jeschkies merged commit 63a57c7 into grafana:main Dec 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[deadlock] scheduler: if querier OOM restart .
5 participants