Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Receiving failed to enqueue request 500s #8178

Open
j-sokol opened this issue May 24, 2024 · 8 comments
Open

Receiving failed to enqueue request 500s #8178

j-sokol opened this issue May 24, 2024 · 8 comments

Comments

@j-sokol
Copy link

j-sokol commented May 24, 2024

Describe the bug

Raising issue regarding #8067 (comment),

In grafana we see failed to enqueue request errors, after a moment the query retries and succeeds.

image

Details taken from the network browser inspector

REQ

GET /api/datasources/uid/mimir/resources/api/v1/label/cluster/values?match%5B%5D=cortex_build_info&start=1716465547&end=1716469147 HTTP/2

RESP

HTTP/2 500 Internal Server Error
date: Thu, 23 May 2024 12:59:06 GMT
content-type: text/plain; charset=utf-8
...
failed to enqueue request

Logs

ts=2024-04-28T18:53:10.959228333Z caller=logging.go:126 level=warn traceID=7d41f545d9deae29 msg="GET /prometheus/api/v1/label/scrape_job/values?start=1714330091&end=1714330391 (500) 1.149977ms Response: \"failed to enqueue request\" ws: false; Accept: application/json, text/plain, */*; Accept-Encoding: gzip, deflate; Accept-Language: en-US,en;q=0.9; Connection: close; User-Agent: Grafana/10.4.2; X-Forwarded-For: 10.226.5.1; X-Grafana-Org-Id: 1; X-Grafana-Referer: http://example.com:3000/d/OZ6xeJqVk/controller-metrics?orgId=1; X-Scope-Orgid: anonymous; "
ts=2024-04-28T18:53:11.069380654Z caller=spanlogger.go:109 method=frontendSchedulerWorker.enqueueRequest user=anonymous level=warn msg="received error while sending request to scheduler" err=EOF
ts=2024-04-28T18:53:11.069471535Z caller=frontend_scheduler_worker.go:291 level=error msg="error sending requests to scheduler" err=EOF addr=10.226.4.53:9095
ts=2024-04-28T18:53:11.069546419Z caller=spanlogger.go:109 method=frontendSchedulerWorker.enqueueRequest user=anonymous level=warn msg="received error while sending request to scheduler" err=EOF
ts=2024-04-28T18:53:11.069652937Z caller=frontend_scheduler_worker.go:291 level=error msg="error sending requests to scheduler" err=EOF addr=10.226.4.53:9095
ts=2024-04-28T18:53:11.069658194Z caller=spanlogger.go:109 method=frontendSchedulerWorker.enqueueRequest user=anonymous level=warn msg="received error while sending request to scheduler" err=EOF
ts=2024-04-28T18:53:11.069707439Z caller=frontend_scheduler_worker.go:291 level=error msg="error sending requests to scheduler" err=EOF addr=10.226.6.52:9095
ts=2024-04-28T18:53:11.069750502Z caller=spanlogger.go:109 method=frontendSchedulerWorker.enqueueRequest user=anonymous level=warn msg="received error while sending request to scheduler" err=EOF
ts=2024-04-28T18:53:11.069805732Z caller=frontend_scheduler_worker.go:291 level=error msg="error sending requests to scheduler" err=EOF addr=10.226.4.53:9095

To Reproduce

Steps to reproduce the behavior:

  1. Start Mimir distributed mimir:2.12.0
  2. add datasource i.e
{
	"2": {
		"id": 1,
		"uid": "mimir",
		"orgId": 1,
		"name": "Mimir",
		"type": "prometheus",
		"typeName": "Prometheus",
		"access": "proxy",
		"url": "http://metrics-mimir-gateway.metrics.svc.cluster.local/prometheus",
		"jsonData": {
			"httpHeaderName1": "X-Scope-OrgID",
			"prometheusType": "Mimir",
			"prometheusVersion": "2.4.0",
			"timeInterval": "60s"
		}
	}
}
  1. open any dashboard

Expected behavior

500s should not happen

Environment

  • Infrastructure: Kubernetes
  • Deployment tool: helm

Additional Context

let me know if you need any additional details

@jmichalek132
Copy link
Contributor

jmichalek132 commented Jun 5, 2024

Hi, to add a bit of context, when upgrading from Mimir 2.10 to Mimir 2.12 we started to see increased latency and small error rate on the read path. At the same time we noticed the number of tcp connections to query-scheduler went from stable to going up and down.

image

This seems to have been caused by this change https://github.com/grafana/mimir/pull/7269/files#diff-7fd5824797e825650064e35cfdea31cf25162114e24bc754f648de77cff4ff06L53
removing extra args from the query-scheduler.

- "-server.grpc.keepalive.max-connection-age=2562047h" # 100000 days, effectively infinity
- "-server.grpc.keepalive.max-connection-age-grace=2562047h" # 100000 days, effectively infinity

Which were previously added as part of https://github.com/grafana/mimir/pull/3262/files.

Looking at sample traces where the requests ended with http status code 500 seems like retries were exhausted before new connection was established.

image

Another example would be when queries just take roughly 1s to enqueue, succeeding but increasing latency even on light queries.

image

image

As mention on slack in this thread https://grafana.slack.com/archives/C039863E8P7/p1715625953274669?thread_ts=1714333917.446309&cid=C039863E8P7, adding this args back to query-scheduler seems to fix / minimize the issue.

- "-server.grpc.keepalive.max-connection-age=2562047h" # 100000 days, effectively infinity
- "-server.grpc.keepalive.max-connection-age-grace=2562047h" # 100000 days, effectively infinity

fyi @dimitarvdimitrov

We made the same change today for now it seems to work, but will update tomorrow / the day after if the issue truly went away for us too.

@jmichalek132
Copy link
Contributor

I can confirm now after having the change deployed in production for a few days it fully fixed the issue for us.

@dimitarvdimitrov
Copy link
Contributor

dimitarvdimitrov commented Jun 11, 2024

This comment claims that the change in #7269 is effectively noop for the scheduler - it changes the connection age limit on the query-scheduler from 100,000 days to 106,751.9911673006 days (MaxInt64 nanoseconds). The usage of these two settings suggests that MaxInt64 should do the same as 100K days.

Can you try to modify only the grace period or only the connection age so we can narrow down the offending setting?

Can you share some of the error logs in the query-frontend when it starts returning failed to enqueue request?

@jmichalek132
Copy link
Contributor

This comment claims that the change in #7269 is effectively noop for the scheduler - it changes the connection age limit on the query-scheduler from 100,000 days to 106,751.9911673006 days (MaxInt64 nanoseconds). The usage of these two settings suggests that MaxInt64 should do the same as 100K days.

Can you try to modify only the grace period or only the connection age so we can narrow down the offending setting?

Can you share some of the error logs in the query-frontend when it starts returning failed to enqueue request?

Hi, thank you for looking into this, will test overriding only one of the settings today.

Also here are examples of errors / warnings in the logs we would see when the issue is happening:

ts=2024-05-31T14:09:02.428253747Z caller=spanlogger.go:109 user=****** trace_id=df988ccccaa9076b29df4396d447ab19 level=error user=****** traceID=df988ccccaa9076b29df4396d447ab19 msg="error processing request" try=0 err="rpc error: code = Unavailable desc = transport is closing"
ts=2024-05-31T14:09:02.427997645Z caller=spanlogger.go:109 user=****** trace_id=df988ccccaa9076b29df4396d447ab19 level=error user=****** traceID=df988ccccaa9076b29df4396d447ab19 msg="error processing request" try=0 err="rpc error: code = Unavailable desc = transport is closing"
ts=2024-05-31T14:08:03.86889945Z caller=frontend_scheduler_worker.go:325 level=error msg="error sending requests to scheduler" err=EOF addr=10.97.152.4:9095
ts=2024-05-31T14:08:03.86887235Z caller=spanlogger.go:109 method=frontendSchedulerWorker.enqueueRequest user=****** trace_id=7a7f58c6d7d8e7a1bc1fe6a8cad95b19 level=warn msg="received error while sending request to scheduler" err=EOF
ts=2024-05-31T14:08:03.868775149Z caller=frontend_scheduler_worker.go:325 level=error msg="error sending requests to scheduler" err=EOF addr=10.97.152.4:9095
ts=2024-05-31T14:08:03.868636447Z caller=spanlogger.go:109 method=frontendSchedulerWorker.enqueueRequest user=****** trace_id=7a7f58c6d7d8e7a1bc1fe6a8cad95b19 level=warn msg="received error while sending request to scheduler" err=EOF
ts=2024-05-31T14:08:03.868543146Z caller=frontend_scheduler_worker.go:325 level=error msg="error sending requests to scheduler" err=EOF addr=10.97.152.4:9095
ts=2024-05-31T14:08:03.868517546Z caller=spanlogger.go:109 method=frontendSchedulerWorker.enqueueRequest user=****** trace_id=7a7f58c6d7d8e7a1bc1fe6a8cad95b19 level=warn msg="received error while sending request to scheduler" err=EOF
ts=2024-05-31T14:08:03.868429644Z caller=frontend_scheduler_worker.go:325 level=error msg="error sending requests to scheduler" err=EOF addr=10.97.152.4:9095
ts=2024-05-31T14:08:03.868387344Z caller=spanlogger.go:109 method=frontendSchedulerWorker.enqueueRequest user=****** trace_id=7a7f58c6d7d8e7a1bc1fe6a8cad95b19 level=warn msg="received error while sending request to scheduler" err=EOF
ts=2024-05-31T14:08:03.868312443Z caller=frontend_scheduler_worker.go:325 level=error msg="error sending requests to scheduler" err=EOF addr=10.97.152.4:9095
ts=2024-05-31T14:08:03.868285643Z caller=spanlogger.go:109 method=frontendSchedulerWorker.enqueueRequest user=****** trace_id=7a7f58c6d7d8e7a1bc1fe6a8cad95b19 level=warn msg="received error while sending request to scheduler" err=EOF
ts=2024-05-31T14:08:03.868210642Z caller=frontend_scheduler_worker.go:325 level=error msg="error sending requests to scheduler" err=EOF addr=10.97.154.35:9095
ts=2024-05-31T14:08:03.868184742Z caller=spanlogger.go:109 method=frontendSchedulerWorker.enqueueRequest user=****** trace_id=7a7f58c6d7d8e7a1bc1fe6a8cad95b19 level=warn msg="received error while sending request to scheduler" err=EOF
ts=2024-05-31T14:08:03.868104841Z caller=frontend_scheduler_worker.go:325 level=error msg="error sending requests to scheduler" err=EOF addr=10.97.154.35:9095
ts=2024-05-31T14:08:03.86808114Z caller=spanlogger.go:109 method=frontendSchedulerWorker.enqueueRequest user=****** trace_id=7a7f58c6d7d8e7a1bc1fe6a8cad95b19 level=warn msg="received error while sending request to scheduler" err=EOF
ts=2024-05-31T14:08:03.868004739Z caller=frontend_scheduler_worker.go:325 level=error msg="error sending requests to scheduler" err=EOF addr=10.97.154.35:9095
ts=2024-05-31T14:08:03.867977539Z caller=spanlogger.go:109 method=frontendSchedulerWorker.enqueueRequest user=****** trace_id=7a7f58c6d7d8e7a1bc1fe6a8cad95b19 level=warn msg="received error while sending request to scheduler" err=EOF
ts=2024-05-31T14:08:03.867898238Z caller=frontend_scheduler_worker.go:325 level=error msg="error sending requests to scheduler" err=EOF addr=10.97.154.35:9095
ts=2024-05-31T14:08:03.867883138Z caller=frontend_scheduler_worker.go:325 level=error msg="error sending requests to scheduler" err=EOF addr=10.97.154.35:9095

@dimitarvdimitrov
Copy link
Contributor

these errors look a lot like being caused by regular connection terminations

@dimitarvdimitrov
Copy link
Contributor

can you find a pattern in the errors @jmichalek132 @j-sokol ? I'm curious how often the occur for each query-scheduler (something like this logQL query sum by (addr) (rate({}[$__auto])))

@jmichalek132
Copy link
Contributor

jmichalek132 commented Jun 17, 2024

can you find a pattern in the errors @jmichalek132 @j-sokol ? I'm curious how often the occur for each query-scheduler (something like this logQL query sum by (addr) (rate({}[$__auto])))

So update, removing this setting from query-frontend

- "-server.grpc.keepalive.max-connection-age-grace=2562047h" # 100000 days, effectively infinity

but leaving in

- "-server.grpc.keepalive.max-connection-age=2562047h" # 100000 days, effectively infinity

and the issue doesn't seem to happen.

Screenshot of the logQL query.

image

The thing to keep in mind with our setup is we are not running ruler in remote mode so alerts don't hit the query pipeline and we don't have that many users right now so we get small sporadic rates of queries and the errors happen when queries are executed. So the actual pattern might be different with constant rate of queries.

image

@lukas-unity
Copy link
Contributor

We've also ran into this issue when we upgraded to 2.12, for us we saw it from notifications from Grafana managed alerts, as those returned a [no value] error with Error = failed to execute query A: canceled: query was canceled in query execution message, setting max-connection-age and max-connection-age-grace to 2562047h seems to have helped, as we're no longer see warnings in the scheduler/query-frontend logs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants