Receiving failed to enqueue request 500s #8178

j-sokol · 2024-05-24T09:42:30Z

Describe the bug

Raising issue regarding #8067 (comment),

In grafana we see failed to enqueue request errors, after a moment the query retries and succeeds.

Details taken from the network browser inspector

REQ

GET /api/datasources/uid/mimir/resources/api/v1/label/cluster/values?match%5B%5D=cortex_build_info&start=1716465547&end=1716469147 HTTP/2

RESP

HTTP/2 500 Internal Server Error
date: Thu, 23 May 2024 12:59:06 GMT
content-type: text/plain; charset=utf-8
...
failed to enqueue request

Logs

ts=2024-04-28T18:53:10.959228333Z caller=logging.go:126 level=warn traceID=7d41f545d9deae29 msg="GET /prometheus/api/v1/label/scrape_job/values?start=1714330091&end=1714330391 (500) 1.149977ms Response: \"failed to enqueue request\" ws: false; Accept: application/json, text/plain, */*; Accept-Encoding: gzip, deflate; Accept-Language: en-US,en;q=0.9; Connection: close; User-Agent: Grafana/10.4.2; X-Forwarded-For: 10.226.5.1; X-Grafana-Org-Id: 1; X-Grafana-Referer: http://example.com:3000/d/OZ6xeJqVk/controller-metrics?orgId=1; X-Scope-Orgid: anonymous; "
ts=2024-04-28T18:53:11.069380654Z caller=spanlogger.go:109 method=frontendSchedulerWorker.enqueueRequest user=anonymous level=warn msg="received error while sending request to scheduler" err=EOF
ts=2024-04-28T18:53:11.069471535Z caller=frontend_scheduler_worker.go:291 level=error msg="error sending requests to scheduler" err=EOF addr=10.226.4.53:9095
ts=2024-04-28T18:53:11.069546419Z caller=spanlogger.go:109 method=frontendSchedulerWorker.enqueueRequest user=anonymous level=warn msg="received error while sending request to scheduler" err=EOF
ts=2024-04-28T18:53:11.069652937Z caller=frontend_scheduler_worker.go:291 level=error msg="error sending requests to scheduler" err=EOF addr=10.226.4.53:9095
ts=2024-04-28T18:53:11.069658194Z caller=spanlogger.go:109 method=frontendSchedulerWorker.enqueueRequest user=anonymous level=warn msg="received error while sending request to scheduler" err=EOF
ts=2024-04-28T18:53:11.069707439Z caller=frontend_scheduler_worker.go:291 level=error msg="error sending requests to scheduler" err=EOF addr=10.226.6.52:9095
ts=2024-04-28T18:53:11.069750502Z caller=spanlogger.go:109 method=frontendSchedulerWorker.enqueueRequest user=anonymous level=warn msg="received error while sending request to scheduler" err=EOF
ts=2024-04-28T18:53:11.069805732Z caller=frontend_scheduler_worker.go:291 level=error msg="error sending requests to scheduler" err=EOF addr=10.226.4.53:9095

To Reproduce

Steps to reproduce the behavior:

Start Mimir distributed mimir:2.12.0
add datasource i.e

{
	"2": {
		"id": 1,
		"uid": "mimir",
		"orgId": 1,
		"name": "Mimir",
		"type": "prometheus",
		"typeName": "Prometheus",
		"access": "proxy",
		"url": "http://metrics-mimir-gateway.metrics.svc.cluster.local/prometheus",
		"jsonData": {
			"httpHeaderName1": "X-Scope-OrgID",
			"prometheusType": "Mimir",
			"prometheusVersion": "2.4.0",
			"timeInterval": "60s"
		}
	}
}

open any dashboard

Expected behavior

500s should not happen

Environment

Infrastructure: Kubernetes
Deployment tool: helm

Additional Context

let me know if you need any additional details

The text was updated successfully, but these errors were encountered:

jmichalek132 · 2024-06-05T09:15:15Z

Hi, to add a bit of context, when upgrading from Mimir 2.10 to Mimir 2.12 we started to see increased latency and small error rate on the read path. At the same time we noticed the number of tcp connections to query-scheduler went from stable to going up and down.

This seems to have been caused by this change https://github.com/grafana/mimir/pull/7269/files#diff-7fd5824797e825650064e35cfdea31cf25162114e24bc754f648de77cff4ff06L53
removing extra args from the query-scheduler.

- "-server.grpc.keepalive.max-connection-age=2562047h" # 100000 days, effectively infinity
- "-server.grpc.keepalive.max-connection-age-grace=2562047h" # 100000 days, effectively infinity

Which were previously added as part of https://github.com/grafana/mimir/pull/3262/files.

Looking at sample traces where the requests ended with http status code 500 seems like retries were exhausted before new connection was established.

Another example would be when queries just take roughly 1s to enqueue, succeeding but increasing latency even on light queries.

As mention on slack in this thread https://grafana.slack.com/archives/C039863E8P7/p1715625953274669?thread_ts=1714333917.446309&cid=C039863E8P7, adding this args back to query-scheduler seems to fix / minimize the issue.

- "-server.grpc.keepalive.max-connection-age=2562047h" # 100000 days, effectively infinity
- "-server.grpc.keepalive.max-connection-age-grace=2562047h" # 100000 days, effectively infinity

fyi @dimitarvdimitrov

We made the same change today for now it seems to work, but will update tomorrow / the day after if the issue truly went away for us too.

jmichalek132 · 2024-06-07T07:01:15Z

I can confirm now after having the change deployed in production for a few days it fully fixed the issue for us.

dimitarvdimitrov · 2024-06-11T15:00:34Z

This comment claims that the change in #7269 is effectively noop for the scheduler - it changes the connection age limit on the query-scheduler from 100,000 days to 106,751.9911673006 days (MaxInt64 nanoseconds). The usage of these two settings suggests that MaxInt64 should do the same as 100K days.

Can you try to modify only the grace period or only the connection age so we can narrow down the offending setting?

Can you share some of the error logs in the query-frontend when it starts returning failed to enqueue request?

jmichalek132 · 2024-06-12T08:48:16Z

This comment claims that the change in #7269 is effectively noop for the scheduler - it changes the connection age limit on the query-scheduler from 100,000 days to 106,751.9911673006 days (MaxInt64 nanoseconds). The usage of these two settings suggests that MaxInt64 should do the same as 100K days.

Can you try to modify only the grace period or only the connection age so we can narrow down the offending setting?

Can you share some of the error logs in the query-frontend when it starts returning failed to enqueue request?

Hi, thank you for looking into this, will test overriding only one of the settings today.

Also here are examples of errors / warnings in the logs we would see when the issue is happening:

ts=2024-05-31T14:09:02.428253747Z caller=spanlogger.go:109 user=****** trace_id=df988ccccaa9076b29df4396d447ab19 level=error user=****** traceID=df988ccccaa9076b29df4396d447ab19 msg="error processing request" try=0 err="rpc error: code = Unavailable desc = transport is closing"
ts=2024-05-31T14:09:02.427997645Z caller=spanlogger.go:109 user=****** trace_id=df988ccccaa9076b29df4396d447ab19 level=error user=****** traceID=df988ccccaa9076b29df4396d447ab19 msg="error processing request" try=0 err="rpc error: code = Unavailable desc = transport is closing"
ts=2024-05-31T14:08:03.86889945Z caller=frontend_scheduler_worker.go:325 level=error msg="error sending requests to scheduler" err=EOF addr=10.97.152.4:9095
ts=2024-05-31T14:08:03.86887235Z caller=spanlogger.go:109 method=frontendSchedulerWorker.enqueueRequest user=****** trace_id=7a7f58c6d7d8e7a1bc1fe6a8cad95b19 level=warn msg="received error while sending request to scheduler" err=EOF
ts=2024-05-31T14:08:03.868775149Z caller=frontend_scheduler_worker.go:325 level=error msg="error sending requests to scheduler" err=EOF addr=10.97.152.4:9095
ts=2024-05-31T14:08:03.868636447Z caller=spanlogger.go:109 method=frontendSchedulerWorker.enqueueRequest user=****** trace_id=7a7f58c6d7d8e7a1bc1fe6a8cad95b19 level=warn msg="received error while sending request to scheduler" err=EOF
ts=2024-05-31T14:08:03.868543146Z caller=frontend_scheduler_worker.go:325 level=error msg="error sending requests to scheduler" err=EOF addr=10.97.152.4:9095
ts=2024-05-31T14:08:03.868517546Z caller=spanlogger.go:109 method=frontendSchedulerWorker.enqueueRequest user=****** trace_id=7a7f58c6d7d8e7a1bc1fe6a8cad95b19 level=warn msg="received error while sending request to scheduler" err=EOF
ts=2024-05-31T14:08:03.868429644Z caller=frontend_scheduler_worker.go:325 level=error msg="error sending requests to scheduler" err=EOF addr=10.97.152.4:9095
ts=2024-05-31T14:08:03.868387344Z caller=spanlogger.go:109 method=frontendSchedulerWorker.enqueueRequest user=****** trace_id=7a7f58c6d7d8e7a1bc1fe6a8cad95b19 level=warn msg="received error while sending request to scheduler" err=EOF
ts=2024-05-31T14:08:03.868312443Z caller=frontend_scheduler_worker.go:325 level=error msg="error sending requests to scheduler" err=EOF addr=10.97.152.4:9095
ts=2024-05-31T14:08:03.868285643Z caller=spanlogger.go:109 method=frontendSchedulerWorker.enqueueRequest user=****** trace_id=7a7f58c6d7d8e7a1bc1fe6a8cad95b19 level=warn msg="received error while sending request to scheduler" err=EOF
ts=2024-05-31T14:08:03.868210642Z caller=frontend_scheduler_worker.go:325 level=error msg="error sending requests to scheduler" err=EOF addr=10.97.154.35:9095
ts=2024-05-31T14:08:03.868184742Z caller=spanlogger.go:109 method=frontendSchedulerWorker.enqueueRequest user=****** trace_id=7a7f58c6d7d8e7a1bc1fe6a8cad95b19 level=warn msg="received error while sending request to scheduler" err=EOF
ts=2024-05-31T14:08:03.868104841Z caller=frontend_scheduler_worker.go:325 level=error msg="error sending requests to scheduler" err=EOF addr=10.97.154.35:9095
ts=2024-05-31T14:08:03.86808114Z caller=spanlogger.go:109 method=frontendSchedulerWorker.enqueueRequest user=****** trace_id=7a7f58c6d7d8e7a1bc1fe6a8cad95b19 level=warn msg="received error while sending request to scheduler" err=EOF
ts=2024-05-31T14:08:03.868004739Z caller=frontend_scheduler_worker.go:325 level=error msg="error sending requests to scheduler" err=EOF addr=10.97.154.35:9095
ts=2024-05-31T14:08:03.867977539Z caller=spanlogger.go:109 method=frontendSchedulerWorker.enqueueRequest user=****** trace_id=7a7f58c6d7d8e7a1bc1fe6a8cad95b19 level=warn msg="received error while sending request to scheduler" err=EOF
ts=2024-05-31T14:08:03.867898238Z caller=frontend_scheduler_worker.go:325 level=error msg="error sending requests to scheduler" err=EOF addr=10.97.154.35:9095
ts=2024-05-31T14:08:03.867883138Z caller=frontend_scheduler_worker.go:325 level=error msg="error sending requests to scheduler" err=EOF addr=10.97.154.35:9095

dimitarvdimitrov · 2024-06-13T11:35:58Z

these errors look a lot like being caused by regular connection terminations

dimitarvdimitrov · 2024-06-13T16:50:37Z

can you find a pattern in the errors @jmichalek132 @j-sokol ? I'm curious how often the occur for each query-scheduler (something like this logQL query sum by (addr) (rate({}[$__auto])))

jmichalek132 · 2024-06-17T07:44:59Z

can you find a pattern in the errors @jmichalek132 @j-sokol ? I'm curious how often the occur for each query-scheduler (something like this logQL query sum by (addr) (rate({}[$__auto])))

So update, removing this setting from query-frontend

- "-server.grpc.keepalive.max-connection-age-grace=2562047h" # 100000 days, effectively infinity

but leaving in

- "-server.grpc.keepalive.max-connection-age=2562047h" # 100000 days, effectively infinity

and the issue doesn't seem to happen.

Screenshot of the logQL query.

The thing to keep in mind with our setup is we are not running ruler in remote mode so alerts don't hit the query pipeline and we don't have that many users right now so we get small sporadic rates of queries and the errors happen when queries are executed. So the actual pattern might be different with constant rate of queries.

lukas-unity · 2024-07-01T22:31:44Z

We've also ran into this issue when we upgraded to 2.12, for us we saw it from notifications from Grafana managed alerts, as those returned a [no value] error with Error = failed to execute query A: canceled: query was canceled in query execution message, setting max-connection-age and max-connection-age-grace to 2562047h seems to have helped, as we're no longer see warnings in the scheduler/query-frontend logs

dimitarvdimitrov added component/query-scheduler component/query-frontend helm labels Jun 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Receiving failed to enqueue request 500s #8178

Receiving failed to enqueue request 500s #8178

j-sokol commented May 24, 2024

jmichalek132 commented Jun 5, 2024 •

edited

Loading

jmichalek132 commented Jun 7, 2024

dimitarvdimitrov commented Jun 11, 2024 •

edited

Loading

jmichalek132 commented Jun 12, 2024

dimitarvdimitrov commented Jun 13, 2024

dimitarvdimitrov commented Jun 13, 2024

jmichalek132 commented Jun 17, 2024 •

edited

Loading

lukas-unity commented Jul 1, 2024

Receiving failed to enqueue request 500s #8178

Receiving failed to enqueue request 500s #8178

Comments

j-sokol commented May 24, 2024

Describe the bug

To Reproduce

Expected behavior

Environment

Additional Context

jmichalek132 commented Jun 5, 2024 • edited Loading

jmichalek132 commented Jun 7, 2024

dimitarvdimitrov commented Jun 11, 2024 • edited Loading

jmichalek132 commented Jun 12, 2024

dimitarvdimitrov commented Jun 13, 2024

dimitarvdimitrov commented Jun 13, 2024

jmichalek132 commented Jun 17, 2024 • edited Loading

lukas-unity commented Jul 1, 2024

jmichalek132 commented Jun 5, 2024 •

edited

Loading

dimitarvdimitrov commented Jun 11, 2024 •

edited

Loading

jmichalek132 commented Jun 17, 2024 •

edited

Loading