loki returned 500 and "rpc error: code = Canceled desc = context canceled" when handling large data query #3244

aricamf · 2021-01-27T05:07:36Z

Describe the bug
Loki logged "rpc error: code = Canceled desc = context canceled" when more 1.8GB data fetched and more than 1m took by the query.

To Reproduce
Steps to reproduce the behavior:

Started Loki (SHA or version) 2.1
Started Promtail (SHA or version) to tail '...'
Query: {} term
can be referenced on the bottom

Expected behavior
Grafana can fetch and visualize large size of data (10+ GB?) from loki

Environment:

Infrastructure: [e.g., Kubernetes, bare-metal, laptop]
loki and grafana are deployed in Azure Kubernetes Service and SSD used for data storage
Deployment tool: [e.g., helm, jsonnet]
Helm

Screenshots, Promtail config, or terminal output
If applicable, add any output to help explain your problem.

logs of Loki

ts=2021-01-27T04:50:13.279487527Z caller=spanlogger.go:53 org_id=fake traceID=6d3a3617a9d10b65 method=query.Exec level=debug Ingester.TotalReached=2 Ingester.TotalChunksMatched=0 Ingester.TotalBatches=0 Ingester.TotalLinesSent=0 Ingester.HeadChunkBytes="0 B" Ingester.HeadChunkLines=0 Ingester.DecompressedBytes="0 B" Ingester.DecompressedLines=0 Ingester.CompressedBytes="0 B" Ingester.TotalDuplicates=0 Store.TotalChunksRef=1490 Store.TotalChunksDownloaded=800 Store.ChunksDownloadTime=137.986708ms Store.HeadChunkBytes="0 B" Store.HeadChunkLines=0 Store.DecompressedBytes="1.8 GB" Store.DecompressedLines=1349743 Store.CompressedBytes="174 MB" Store.TotalDuplicates=0
ts=2021-01-27T04:50:13.279600928Z caller=spanlogger.go:53 org_id=fake traceID=6d3a3617a9d10b65 method=query.Exec level=debug Summary.BytesProcessedPerSecond="26 MB" Summary.LinesProcessedPerSecond=19645 Summary.TotalBytesProcessed="1.8 GB" Summary.TotalLinesProcessed=1349743 Summary.ExecTime=1m8.703279889s
level=info ts=2021-01-27T04:50:13.279953831Z caller=metrics.go:83 org_id=fake traceID=6d3a3617a9d10b65 latency=slow query="(sum by (path) (sum_over_time({filename=~~"/var/log/nginx/json_access.log",host=~~"10.0.2.", tenant_id=~"sre-tj06"} |json | status = 200 | line_format "{{.request_uri}}" |regexp "(?P^\\/([^/])\\/)" | line_format "{{.path}} {{.upstream_response_time}}" | unwrap upstream_response_time | error="" [5m]))) / sum by (path)(count_over_time({filename=~~"/var/log/nginx/json_access.log",host=~~"10.0.2.", tenant_id=~"sre-tj06"} | json | line_format "{{.request_uri}}" |regexp "(?P^\\/([^/])\\/)"| path != "" | path!= "\\" | error="" [5m]))" query_type=metric range_type=range length=48h0m1s step=2m0s duration=1m8.703279889s status=500 throughput=26MB total_bytes=1.8GB
level=warn ts=2021-01-27T04:50:13.280160032Z caller=logging.go:71 traceID=6d3a3617a9d10b65 msg="GET /loki/api/v1/query_range?direction=BACKWARD&limit=1359&query=(sum%20by%20(path)%20(sum_over_time(%7Bfilename%3D~%22%2Fvar%2Flog%2Fnginx%2Fjson_access.log%22%2Chost%3D~%2210.0.2.%22%2C%20tenant_id%3D~%22sre-tj06%22%7D%20%7Cjson%20%7C%20status%20%3D%20200%20%7C%20line_format%20%22%7B%7B.request_uri%7D%7D%22%20%7Cregexp%20%22(%3FP%3Cpath%3E%5E%5C%5C%2F(%5B%5E%2F%5D)%5C%5C%2F)%22%20%7C%20line_format%20%22%7B%7B.path%7D%7D%20%7B%7B.upstream_response_time%7D%7D%22%20%20%7C%20unwrap%20upstream_response_time%20%7C%20__error__%3D%22%22%20%5B5m%5D)))%20%2F%20sum%20by%20(path)(count_over_time(%7Bfilename%3D~%22%2Fvar%2Flog%2Fnginx%2Fjson_access.log%22%2Chost%3D~%2210.0.2.%22%2C%20tenant_id%3D~%22sre-tj06%22%7D%20%7C%20json%20%7C%20line_format%20%22%7B%7B.request_uri%7D%7D%22%20%7Cregexp%20%22(%3FP%3Cpath%3E%5E%5C%5C%2F(%5B%5E%2F%5D)%5C%5C%2F)%22%7C%20path%20!%3D%20%22%22%20%7C%20path!%3D%20%22%5C%5C%22%20%7C%20__error__%3D%22%22%20%5B5m%5D))&start=1611550083000000000&end=1611722884000000000&step=120 (500) 1m8.704357997s Response: "rpc error: code = Canceled desc = context canceled\n" ws: false; Accept: application/json, text/plain, /; Accept-Encoding: gzip, deflate, br; Accept-Language: en-US,en;q=0.5; User-Agent: Grafana/7.3.5; X-Forwarded-For: 222.126.245.76, 10.244.0.110, 10.244.0.110; X-Grafana-Org-Id: 1; X-Real-Ip: 222.126.245.76; X-Request-Id: e87774979c4ef03dfd79c99361bcf97f; X-Scheme: https; "

The text was updated successfully, but these errors were encountered:

aricamf · 2021-01-27T05:37:16Z

I updated the query from grafana...seems that even the total returned bytes was dropped...but it will still report the same error as long as it's a "long query".

ts=2021-01-27T05:13:41.095467378Z caller=spanlogger.go:53 org_id=fake traceID=1312e304681820a1 method=query.Exec level=debug Ingester.TotalReached=1 Ingester.TotalChunksMatched=0 Ingester.TotalBatches=0 Ingester.TotalLinesSent=0 Ingester.HeadChunkBytes="0 B" Ingester.HeadChunkLines=0 Ingester.DecompressedBytes="0 B" Ingester.DecompressedLines=0 Ingester.CompressedBytes="0 B" Ingester.TotalDuplicates=0 Store.TotalChunksRef=414 Store.TotalChunksDownloaded=200 Store.ChunksDownloadTime=1.008968582s Store.HeadChunkBytes="0 B" Store.HeadChunkLines=0 Store.DecompressedBytes="389 MB" Store.DecompressedLines=287565 Store.CompressedBytes="37 MB" Store.TotalDuplicates=0
ts=2021-01-27T05:13:41.095564779Z caller=spanlogger.go:53 org_id=fake traceID=1312e304681820a1 method=query.Exec level=debug Summary.BytesProcessedPerSecond="4.9 MB" Summary.LinesProcessedPerSecond=3589 Summary.TotalBytesProcessed="389 MB" Summary.TotalLinesProcessed=287565 Summary.ExecTime=1m20.110152353s
level=info ts=2021-01-27T05:13:41.09580058Z caller=metrics.go:83 org_id=fake traceID=1312e304681820a1 latency=slow query="avg by (path) (sum_over_time({filename=~~"/var/log/nginx/json_access.log",host=~~"10.0.2.", tenant_id=~"sre-tj06"} |json | status = 200 | line_format "{{.request_uri}}" |regexp "(?P^\\/([^/])\\/)" | line_format "{{.path}}" | unwrap upstream_response_time | error="" [1h])) > 1" query_type=metric range_type=range length=24h0m1s step=1m0s duration=1m20.110152353s status=500 throughput=4.9MB total_bytes=389MB
level=warn ts=2021-01-27T05:13:41.095917581Z caller=logging.go:71 traceID=1312e304681820a1 msg="GET /loki/api/v1/query_range?direction=BACKWARD&limit=1359&query=avg%20by%20(path)%20(sum_over_time(%7Bfilename%3D~%22%2Fvar%2Flog%2Fnginx%2Fjson_access.log%22%2Chost%3D~%2210.0.2.%22%2C%20tenant_id%3D~%22sre-tj06%22%7D%20%7Cjson%20%7C%20status%20%3D%20200%20%7C%20line_format%20%22%7B%7B.request_uri%7D%7D%22%20%7Cregexp%20%22(%3FP%3Cpath%3E%5E%5C%5C%2F(%5B%5E%2F%5D)%5C%5C%2F)%22%20%7C%20line_format%20%22%7B%7B.path%7D%7D%22%20%7C%20unwrap%20upstream_response_time%20%7C%20__error__%3D%22%22%20%5B1h%5D))%20%3E%201&start=1611637940000000000&end=1611724341000000000&step=60 (500) 1m20.110906958s Response: "rpc error: code = Canceled desc = context canceled\n" ws: false; Accept: application/json, text/plain, /; Accept-Encoding: gzip, deflate, br; Accept-Language: en-US,en;q=0.5; User-Agent: Grafana/7.3.5; X-Forwarded-For: 222.126.245.76, 10.244.0.110, 10.244.0.110; X-Grafana-Org-Id: 1; X-Real-Ip: 222.126.245.76; X-Request-Id: 0a74bc01e25f4501ebd02231eda224cc; X-Scheme: https; "

aricamf · 2021-01-27T11:58:04Z

One more strange thing...I found this error from loki logs
2021-01-27T11:52:11.390997942Z level=debug ts=2021-01-27T11:52:11.39081924Z caller=grpc_logging.go:53 method=/logproto.Querier/QuerySample duration=59.981648565s err="rpc error: code = Canceled desc = context canceled" msg="gRPC\n"

and seems that I had this error everytime with duration close to 1m...is it configurable? I've already updated the grpc settings, and the querier...but nothing change to this magic number...

Much appreciated if anyone can shed me a light here...

THanks

OrdinaryFan · 2021-03-04T08:16:27Z

and seems that I had this error everytime with duration close to 1m...is it configurable?

query_timeout can config it.

stale · 2021-06-03T20:50:03Z

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

LinTechSo · 2022-07-21T08:18:32Z

Hi. any updates ?

carobaldino · 2023-11-01T17:29:34Z

Hi! I was having the same problem and I opted for changing the way queries were being made. By default, parallel queries are active, so one big query splits into smaller ones and runs all them in parallel. At fist I was having the error of "too many outstanding requests", but after extending the maximum amount of parallel queries, dashboards took a while to load, but still get loaded. And sometimes it was not working, showing the error of this issue "context canceled".
Reading with more attention to the issue leave below, I opted to disable parallel queries, because for my ecosystem it was not justified, and degrading performance (I only have a singe instance Loki, no Single Scalable Deployment -SSD nor microservices)
#5123 (comment)

ivanahuckova mentioned this issue Apr 23, 2021

error="http: proxy error: context canceled" / "Data proxy error" for Loki Data Source grafana/grafana#31638

Closed

stale bot added the stale A stale issue or PR that will automatically be closed. label Jun 3, 2021

stale bot closed this as completed Jun 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

loki returned 500 and "rpc error: code = Canceled desc = context canceled" when handling large data query #3244

loki returned 500 and "rpc error: code = Canceled desc = context canceled" when handling large data query #3244

aricamf commented Jan 27, 2021

aricamf commented Jan 27, 2021

aricamf commented Jan 27, 2021

OrdinaryFan commented Mar 4, 2021 •

edited

Loading

stale bot commented Jun 3, 2021

LinTechSo commented Jul 21, 2022

carobaldino commented Nov 1, 2023

loki returned 500 and "rpc error: code = Canceled desc = context canceled" when handling large data query #3244

loki returned 500 and "rpc error: code = Canceled desc = context canceled" when handling large data query #3244

Comments

aricamf commented Jan 27, 2021

aricamf commented Jan 27, 2021

aricamf commented Jan 27, 2021

OrdinaryFan commented Mar 4, 2021 • edited Loading

stale bot commented Jun 3, 2021

LinTechSo commented Jul 21, 2022

carobaldino commented Nov 1, 2023

OrdinaryFan commented Mar 4, 2021 •

edited

Loading