-
Notifications
You must be signed in to change notification settings - Fork 543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Receiving failed to enqueue request 500s #8178
Comments
Hi, to add a bit of context, when upgrading from Mimir 2.10 to Mimir 2.12 we started to see increased latency and small error rate on the read path. At the same time we noticed the number of tcp connections to query-scheduler went from stable to going up and down. This seems to have been caused by this change https://github.com/grafana/mimir/pull/7269/files#diff-7fd5824797e825650064e35cfdea31cf25162114e24bc754f648de77cff4ff06L53
Which were previously added as part of https://github.com/grafana/mimir/pull/3262/files. Looking at sample traces where the requests ended with http status code 500 seems like retries were exhausted before new connection was established. Another example would be when queries just take roughly 1s to enqueue, succeeding but increasing latency even on light queries. As mention on slack in this thread https://grafana.slack.com/archives/C039863E8P7/p1715625953274669?thread_ts=1714333917.446309&cid=C039863E8P7, adding this args back to query-scheduler seems to fix / minimize the issue.
We made the same change today for now it seems to work, but will update tomorrow / the day after if the issue truly went away for us too. |
I can confirm now after having the change deployed in production for a few days it fully fixed the issue for us. |
This comment claims that the change in #7269 is effectively noop for the scheduler - it changes the connection age limit on the query-scheduler from 100,000 days to 106,751.9911673006 days (MaxInt64 nanoseconds). The usage of these two settings suggests that MaxInt64 should do the same as 100K days. Can you try to modify only the grace period or only the connection age so we can narrow down the offending setting? Can you share some of the error logs in the query-frontend when it starts returning |
Hi, thank you for looking into this, will test overriding only one of the settings today. Also here are examples of errors / warnings in the logs we would see when the issue is happening:
|
these errors look a lot like being caused by regular connection terminations |
can you find a pattern in the errors @jmichalek132 @j-sokol ? I'm curious how often the occur for each query-scheduler (something like this logQL query |
So update, removing this setting from
but leaving in
and the issue doesn't seem to happen. Screenshot of the logQL query. The thing to keep in mind with our setup is we are not running ruler in remote mode so alerts don't hit the query pipeline and we don't have that many users right now so we get small sporadic rates of queries and the errors happen when queries are executed. So the actual pattern might be different with constant rate of queries. |
We've also ran into this issue when we upgraded to 2.12, for us we saw it from notifications from Grafana managed alerts, as those returned a |
Describe the bug
Raising issue regarding #8067 (comment),
In grafana we see failed to enqueue request errors, after a moment the query retries and succeeds.
Details taken from the network browser inspector
REQ
RESP
Logs
To Reproduce
Steps to reproduce the behavior:
Expected behavior
500s should not happen
Environment
Additional Context
let me know if you need any additional details
The text was updated successfully, but these errors were encountered: