Grafana dashboard shows "too many outstanding requests" after upgrade to v2.4.2 #5123

itsnotv · 2022-01-12T19:19:35Z

Describe the bug

After upgrading to v2.4.2 from v2.4.1, none of the panels using loki show any data. I have a dashboard with 4 panels that load data from loki. I am able to see data ingested correctly with grafana explore datasource query.

Environment

Using loki with docker-compose and shipping docker logs with loki driver.

loki.yml

auth_enabled: false

server:
  http_listen_port: 3100

common:
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    instance_addr: 127.0.0.1
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

table_manager:
  retention_deletes_enabled: true
  retention_period: 168h

error on the grafana panel

status:429
statusText:""
data:Object
message:"too many outstanding requests"
error:""
response:"too many outstanding requests"
config:Object
headers:Object
url:"api/datasources/proxy/9/loki/api/v1/query_range?direction=BACKWARD&limit=240&query=sum(count_over_time(%7Bcontainer_name%3D%22nginx%22%2Csource%3D%22stdout%22%7D%5B2m%5D))&start=1641993484871000000&end=1642015084871000000&step=120"
retry:0
hideFromInspector:false
message:"too many outstanding requests"

The text was updated successfully, but these errors were encountered:

zaibakker · 2022-01-16T22:15:30Z

Hi, I resolved the pb for my part by increasing two default value with:

querier:
max_concurrent: 2048
query_scheduler:
max_outstanding_requests_per_tenant: 2048

It's not perfect, but this error, helped me to understand better and better the architecture.
Next step, used query_frontend (that is not mandatory but active if we add something in config) to do the job (Queueing) and of coure decrease as possible these values for my home docker service.

zaibakker · 2022-01-18T19:42:35Z

Hi,
I come back after tried many, many settings.

I solved my pb with:

activate the query_frontend
Reduce the splitting of queries with split_queries_by_interval: 24h
max_outstanding_per_tenant: 1024

My dashboard is complete now in 5s :)
Without the splitting parameter, i had always 429 error for 1 or 3 graph and a rendered in 3min

It works for me because i had a lot of small request. Too much for my docker, loki process. Reduce them was the solution.
Increase worker, frontend, parallelism or timeout was a bad idea.

dfoxg · 2022-01-22T22:02:58Z

see #5204

yakob-aleksandrovich · 2022-01-25T11:18:22Z

For completeness, here's the needed config

query_range:
  split_queries_by_interval: 24h

frontend:
  max_outstanding_per_tenant: 1024

itsnotv · 2022-01-29T19:30:34Z

For completeness, here's the needed config
query_range:
  split_queries_by_interval: 24h

frontend:
  max_outstanding_per_tenant: 1024

This helped partially, I still see the error every now and then.

yakob-aleksandrovich · 2022-01-31T08:27:51Z

You can raise max_outstanding_per_tenant even higher. I've set mine to 4096 now.
But I'm afraid you can never avoid 'too many requests' completely. As far as I understand (still learning...), the more data you try to load, the more often you will hit this limit.
In my case, 'loading more data' is caused because in Grafana I want to view the whole 721 hours (30 days), or because I've crammed too much queries into one graph.

I'm still working on finding the right trade-off between memory-usage and speed. Below, you'll see my current partial configuration, relevant to this specific issue.

server:
  http_listen_port: 3100
  grpc_listen_port: 9096
  # Read timeout for HTTP server
  http_server_read_timeout: 3m
  # Write timeout for HTTP server
  http_server_write_timeout: 3m

query_range:
  split_queries_by_interval: 0
  parallelise_shardable_queries: false

querier:
  max_concurrent: 2048

frontend:
  max_outstanding_per_tenant: 4096
  compress_responses: true

itsnotv · 2022-01-31T14:25:52Z

query_range:
split_queries_by_interval: 0

This part seems to help.

I never ran into this issue with 2.4.1. Something changed in 2.4.2, I hope they restore the default values to what it was before.

dotdc · 2022-02-21T09:00:05Z

For completeness, here's the needed config
query_range:
  split_queries_by_interval: 24h

frontend:
  max_outstanding_per_tenant: 1024

This worked for my setup, thanks!

whoamiUNIX · 2022-03-16T18:36:30Z

I can also confirm that on v2.4.2 you will face this issue if you keep new default value.

Switching value back to old default from version v2.4.1 solve my problem.

query_range:
split_queries_by_interval: 0

benisai · 2022-04-11T04:32:47Z

Bump, this is a serious issue. Please fix Loki team.

DanielVenturini · 2022-04-11T19:45:09Z

I'm not able to solve my problem using none of the above values/options on version 2.4.2. We rolled back our loki to version 2.4.1 and this solved our issue. Let's wait for the Loki team fix.

step-baby · 2022-04-13T05:23:26Z

2.5.0 also have this problem

123shang60 · 2022-04-13T12:28:48Z

queue.go#L105-L107

	select {
	case queue <- req:
		q.queueLength.WithLabelValues(userID).Inc()
		q.cond.Broadcast()
		// Call this function while holding a lock. This guarantees that no querier can fetch the request before function returns.
		if successFn != nil {
			successFn()
		}
		return nil
	//default:
	//	q.discardedRequests.WithLabelValues(userID).Inc()
	//	return ErrTooManyRequests
	}

After removing this part of the code, the problem was alleviated

LibiKorol · 2022-05-03T06:31:08Z

We got the same error with v2.5.0.
None of the above options solved the issue so we rolled back to v2.4.1.

LibiKorol · 2022-05-24T09:26:10Z

Is there an ETA for a fix?

Alcatros · 2022-06-10T05:44:53Z

i can confirm this issue exists after an upgrade to the newest version, i can't even roll back to 2.4.1, i may note that 2.4.1 uses v1beta tags and will not be available on gcp very soon

wuestkamp · 2022-06-15T08:19:12Z

We also had a lot of "403 too many outstanding requests" on loki 2.5.0 and 2.4.2.
Moved back to loki 2.4.1 and problem is gone.

Alcatros · 2022-06-15T14:51:08Z

@wuestkamp the issue is really that the 2.4.1 has security issues and is deprecated soon by the new k8 cluster versions

benisai · 2022-06-15T15:16:10Z

So why is grafana labs not fixing this issue? I don't understand. Why is it so hard?

Alcatros · 2022-06-15T15:25:58Z

@benisai i wish i knew ensure you are only using this in an isolated network the CVE's could lead to break ins and Grafana is a data pod with potentially lots of customer logs etc... don't endanger your company by running old versions

benisai · 2022-06-15T16:40:35Z

@benisai i wish i knew ensure you are only using this in an isolated network the CVE's could lead to break ins and Grafana is a data pod with potentially lots of customer logs etc... don't endanger your company by running old versions.

Homelab only. But still the issue persist without a fix. Or is there a fix?

clouedoc · 2022-06-24T13:58:26Z

I'm too lazy to set up a configuration file, so I just downgraded to 2.4.1 (homelab).
I wish there was a way to configure Loki with environment variables. Configuration files are a pain.

onovaes · 2022-06-25T14:21:35Z

Hi, I resolved the pb for my part by increasing two default value with:

querier: max_concurrent: 2048 query_scheduler: max_outstanding_requests_per_tenant: 2048

It's not perfect, but this error, helped me to understand better and better the architecture. Next step, used query_frontend (that is not mandatory but active if we add something in config) to do the job (Queueing) and of coure decrease as possible these values for my home docker service.

That's works form me with ansible

name: Create loki service
tags: grafana
docker_container:
name: loki
restart_policy: always
image: "grafana/loki:2.5.0"
log_driver: syslog
log_options:
tag: lokilog
networks:
- name: "loki"
command: "-config.file=/etc/loki/local-config.yaml -querier.max-outstanding-requests-per-tenant=2048 -querier.max-concurrent=2048"

LinTechSo · 2022-06-26T10:41:38Z

Hi, any updates ?
Thanks for the info.
But the problem still persist on Loki version 2.5.0

stefan-fast · 2022-06-27T08:03:36Z

I increased both the values frontend.max_outstanding_per_tenant and query_scheduler.max_outstanding_requests_per_tenant to 4096. I do not get any too many outstanding requests errors anymore (Loki v2.4.2, tested in test cluster as well as production cluster).

query_scheduler:
  max_outstanding_requests_per_tenant: 4096
frontend:
  max_outstanding_per_tenant: 4096
query_range:
  parallelise_shardable_queries: true
limits_config:
  split_queries_by_interval: 15m
  max_query_parallelism: 32

The default values for frontend.max_outstanding_per_tenant and query_scheduler.max_outstanding_requests_per_tenant are too low if you are using dashboards with multiple queries (multiple panels or multiple queries in one panel) over a longer time range because the queries will be split and will result in a lot of smaller sub-queries. Having multiple users using the same dashboard at the same time (or even only one user quickly refreshing the dashboard multiple times in a row) will further increase the count and you'll reach the limit even quicker.
This write-up really helped me understanding the query splitting and why there are so many queries:
https://taisho6339.gitbook.io/grafana-loki-deep-dive/query-process/split-a-query-into-someones
and
https://taisho6339.gitbook.io/grafana-loki-deep-dive/query-process/schedule-queries-to-queriers

LinTechSo · 2022-06-27T09:00:54Z

@stefan-fast Thank you so much for your help.
By these configurations, I can confirm that the issue is fixed on Loki versions 2.5.0 and 2.4.2.

Alcatros · 2022-07-01T00:56:58Z

This doesnt work with 2.6.2

clouedoc · 2022-09-15T16:48:24Z

@reyyzzy

The cause of the issue is that parallelism has been enabled by default, but it limits the number of queries that you can queue at the same time to a low number by default. The best solution right now is to edit the max number of outstanding requests.
In the future, we can hope for saner defaults.

For simple deployments (single-binary or SSD mode), add the following configuration:

query_scheduler:
  max_outstanding_requests_per_tenant: 10000

If you deployed in microservices mode, use this config:

frontend:
  max_outstanding_per_tenant: 10000

…rrect explanation for how to disable. (grafana#6715)  **What this PR does / why we need it**: I noticed when responding to grafana#5123 the docs did not correctly explain how to disable splitting queries by time. I searched through the code and confirmed `0` is the correct value to disable this feature **Which issue(s) this PR fixes**: Fixes #<issue number> **Special notes for your reviewer**:  **Checklist** - [ ] Documentation added - [ ] Tests updated - [ ] Is this an important fix or new feature? Add an entry in the `CHANGELOG.md`. - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` Signed-off-by: Edward Welch <edward.welch@grafana.com> Co-authored-by: Karen Miller <84039272+KMiller-Grafana@users.noreply.github.com>

aignatovich · 2022-12-29T15:22:52Z

Issue present in loki 2.7.0 while using default values.

mbrav · 2023-03-14T13:35:11Z

Issue present in loki 2.7.0 while using default values.

On 2.7.4 as well

thiagomorales · 2023-04-11T16:56:07Z

Same in 2.8.0

sven0219 · 2023-04-25T09:48:50Z

None of the above configurations worked for me, version 2.8.0

hardaker · 2023-05-05T22:57:15Z

For me the following were the only lines that I changed from the base config in the docker container and seem to work (so far):

querier:
  max_concurrent: 100

frontend:
  max_outstanding_per_tenant: 1024
  scheduler_worker_concurrency: 20

I'm not throwing a huge amount at the server at the moment, but at least multiple panels in a dashboard load in a separate graphana instance that's pointing at it.

sven0219 · 2023-05-08T03:20:08Z

Hi, I resolved the pb for my part by increasing two default value with:

querier: max_concurrent: 2048 query_scheduler: max_outstanding_requests_per_tenant: 2048

It's not perfect, but this error, helped me to understand better and better the architecture. Next step, used query_frontend (that is not mandatory but active if we add something in config) to do the job (Queueing) and of coure decrease as possible these values for my home docker service.

It's works to me

AlejandroVindel · 2023-05-23T09:10:34Z

@reyyzzy

The cause of the issue is that parallelism has been enabled by default, but it limits the number of queries that you can queue at the same time to a low number by default. The best solution right now is to edit the max number of outstanding requests. In the future, we can hope for saner defaults.

For simple deployments (single-binary or SSD mode), add the following configuration:
query_scheduler:
  max_outstanding_requests_per_tenant: 10000
If you deployed in microservices mode, use this config:
frontend:
  max_outstanding_per_tenant: 10000

The first works to me in Loki v2.8.2 for the binary deployment.

jhodysetiawansekardono · 2023-06-07T07:09:51Z

Hi, I resolved the pb for my part by increasing two default value with:
querier: max_concurrent: 2048 query_scheduler: max_outstanding_requests_per_tenant: 2048
It's not perfect, but this error, helped me to understand better and better the architecture. Next step, used query_frontend (that is not mandatory but active if we add something in config) to do the job (Queueing) and of coure decrease as possible these values for my home docker service.

That's works form me with ansible

name: Create loki service
tags: grafana
docker_container:
name: loki
restart_policy: always
image: "grafana/loki:2.5.0"
log_driver: syslog
log_options:
tag: lokilog
networks:

name: "loki"
command: "-config.file=/etc/loki/local-config.yaml -querier.max-outstanding-requests-per-tenant=2048 -querier.max-concurrent=2048"

Adding the configuration using command is works for me in Loki v2.6.1 installed using helm

  extraArgs:
    querier.max-outstanding-requests-per-tenant: "2048"
    querier.max-concurrent: "2048"

litvinav · 2023-06-08T07:56:02Z

The following seems to work for the chart version 5.6.2 and app version 2.8.2 for the grafana/loki helm chart.

loki:
  limits_config:
    split_queries_by_interval: 24h
    max_query_parallelism: 100
  query_scheduler:
    max_outstanding_requests_per_tenant: 4096
  frontend:
    max_outstanding_per_tenant: 4096
# other stuff...

If you are not deploying Loki via Helm, i believe you have to set these values not under the "loki:" key but on top level directly into the config.

It seems that "outstanding requests" is required to allow many requests to Loki, while the stuff in "limits_config" is the upper ceiling that Grafana send out to the datasource. Either "query_scheduler" or "frontend" is required based on your setup but just set both.

Depending on how complex your dashboard are, you might run into these limits and have to extend them.

If the defaults wont be changed, i guess this issue can be closed.

Alcatros · 2023-06-08T16:39:34Z

@litvinav im highly against closing this, sensible defaults is something every software should have. If you install an ingress it also works out of the box and you can configure it on top of it. That makes adoption for beginners easier.

In order to break the default you need only to select about 5 datasets and you will get a 429 (this isn't a complex screen or anything)

…rafana/loki#5123

grafana/loki#5123

alexey-sh · 2023-11-13T15:01:11Z

So what is the final decision?
IMHO, config generator with server(s) hardware input would be great.

- grafana/loki#5123 (comment)

jleach · 2024-01-16T17:18:18Z

querier:
max_concurrent: 2048

That's a big number relative to the default of 10. Might need to watch resource consumption.

# The maximum number of concurrent queries allowed.
# CLI flag: -querier.max-concurrent
[max_concurrent: <int> | default = 10]

See: grafana/loki#5123

msveshnikov · 2024-05-12T08:25:45Z

No one of above solutions works, only revert to 2.4.1 finally fixed this dreaded issue.

OliverStutz · 2024-05-12T14:54:55Z

@msveshnikov Don't run old versions you put your clusters at risk!

joey-zwaan · 2024-06-07T13:11:20Z

I have a good solution from another issue that was closed. Apparently the problem is with the parallel queries going on.
The following configuration worked for me for the latest build.

Solution was found here : #4613 (comment)

config:
  query_scheduler:
    max_outstanding_requests_per_tenant: 2048

  query_range:
    parallelise_shardable_queries: false
    split_queries_by_interval: 0

Satvikgajera1 · 2024-11-08T07:22:07Z

This issue opened in new site and 100% working solutions
https://bit.ly/3O1go1y

kavirajk added the component/query-frontend label May 24, 2022

tomdaley92 added a commit to Diesel-Net/loki that referenced this issue May 31, 2022

try rolling back to 2.4.1 per grafana/loki#5123

164405c

0x6d69636b mentioned this issue Jun 10, 2022

Dropped Queries and Processing Errors since v.2.4.2 #5942

Closed

jimmymjin mentioned this issue Jun 13, 2023

split_queries_by_interval on wrong place in config #9688

Closed

Abuelodelanada mentioned this issue Jun 30, 2023

Configuration plans for Loki charm canonical/loki-k8s-operator#283

Closed

onlineque pushed a commit to onlineque/internal-eks-cluster that referenced this issue Aug 29, 2023

fix 429 errors - too many outstanding requests with Loki according to g…

261a748

…rafana/loki#5123

onlineque mentioned this issue Aug 29, 2023

fix 429 errors - too many outstanding requests with Loki according to… onlineque/internal-eks-cluster#46

Merged

voidquark mentioned this issue Sep 6, 2023

too many outstanding requests voidquark/grafana-dashboards#4

Closed

hzierer mentioned this issue Sep 21, 2023

Fix Lokis more and more often error 500's eclipse-tractusx/sig-infra#280

Closed

philipcristiano added a commit to philipcristiano/nixos-cluster-config that referenced this issue Oct 3, 2023

loki: Reduce "too many outstanding requests"

e208bbc

grafana/loki#5123

carobaldino mentioned this issue Nov 1, 2023

loki returned 500 and "rpc error: code = Canceled desc = context canceled" when handling large data query #3244

Closed

akunzai added a commit to akunzai/containers-lab that referenced this issue Nov 16, 2023

Avoid error: too many outstanding requests

f789bcd

- grafana/loki#5123 (comment)

akunzai added a commit to akunzai/containers-lab that referenced this issue Nov 16, 2023

Mitigating: too many outstanding requests

64e93e0

- grafana/loki#5123 (comment)

wagdav added a commit to wagdav/homelab that referenced this issue Mar 14, 2024

loki: tune query parameters to avoid "Too many outstanding requests"

fe53be5

See: grafana/loki#5123

k1g99 mentioned this issue Apr 4, 2024

feat: expand query capacity skkuding/codedang-monitor#48

Merged

Grafana dashboard shows "too many outstanding requests" after upgrade to v2.4.2 #5123

Grafana dashboard shows "too many outstanding requests" after upgrade to v2.4.2 #5123

Comments

itsnotv commented Jan 12, 2022 • edited Loading

zaibakker commented Jan 16, 2022

zaibakker commented Jan 18, 2022

dfoxg commented Jan 22, 2022

yakob-aleksandrovich commented Jan 25, 2022

itsnotv commented Jan 29, 2022

yakob-aleksandrovich commented Jan 31, 2022 • edited Loading

itsnotv commented Jan 31, 2022

dotdc commented Feb 21, 2022

whoamiUNIX commented Mar 16, 2022

benisai commented Apr 11, 2022

DanielVenturini commented Apr 11, 2022

step-baby commented Apr 13, 2022

123shang60 commented Apr 13, 2022

LibiKorol commented May 3, 2022

LibiKorol commented May 24, 2022

Alcatros commented Jun 10, 2022

wuestkamp commented Jun 15, 2022

Alcatros commented Jun 15, 2022

benisai commented Jun 15, 2022 • edited Loading

Alcatros commented Jun 15, 2022

benisai commented Jun 15, 2022

clouedoc commented Jun 24, 2022

onovaes commented Jun 25, 2022

LinTechSo commented Jun 26, 2022

stefan-fast commented Jun 27, 2022 • edited Loading

LinTechSo commented Jun 27, 2022 • edited Loading

Alcatros commented Jul 1, 2022

clouedoc commented Sep 15, 2022 • edited Loading

aignatovich commented Dec 29, 2022

mbrav commented Mar 14, 2023

thiagomorales commented Apr 11, 2023

sven0219 commented Apr 25, 2023 • edited Loading

hardaker commented May 5, 2023 • edited Loading

sven0219 commented May 8, 2023

AlejandroVindel commented May 23, 2023

jhodysetiawansekardono commented Jun 7, 2023

litvinav commented Jun 8, 2023

Alcatros commented Jun 8, 2023

alexey-sh commented Nov 13, 2023

jleach commented Jan 16, 2024 • edited Loading

msveshnikov commented May 12, 2024

OliverStutz commented May 12, 2024

joey-zwaan commented Jun 7, 2024 • edited Loading

Satvikgajera1 commented Nov 8, 2024

itsnotv commented Jan 12, 2022 •

edited

Loading

yakob-aleksandrovich commented Jan 31, 2022 •

edited

Loading

benisai commented Jun 15, 2022 •

edited

Loading

stefan-fast commented Jun 27, 2022 •

edited

Loading

LinTechSo commented Jun 27, 2022 •

edited

Loading

clouedoc commented Sep 15, 2022 •

edited

Loading

sven0219 commented Apr 25, 2023 •

edited

Loading

hardaker commented May 5, 2023 •

edited

Loading

jleach commented Jan 16, 2024 •

edited

Loading

joey-zwaan commented Jun 7, 2024 •

edited

Loading