Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loki returned 500 and "rpc error: code = Canceled desc = context canceled" when handling large data query #3244

Closed
aricamf opened this issue Jan 27, 2021 · 6 comments
Labels
stale A stale issue or PR that will automatically be closed.

Comments

@aricamf
Copy link

aricamf commented Jan 27, 2021

Describe the bug
Loki logged "rpc error: code = Canceled desc = context canceled" when more 1.8GB data fetched and more than 1m took by the query.

To Reproduce
Steps to reproduce the behavior:

  1. Started Loki (SHA or version) 2.1
  2. Started Promtail (SHA or version) to tail '...'
  3. Query: {} term
    can be referenced on the bottom

Expected behavior
Grafana can fetch and visualize large size of data (10+ GB?) from loki

Environment:

  • Infrastructure: [e.g., Kubernetes, bare-metal, laptop]
    loki and grafana are deployed in Azure Kubernetes Service and SSD used for data storage

  • Deployment tool: [e.g., helm, jsonnet]
    Helm

Screenshots, Promtail config, or terminal output
If applicable, add any output to help explain your problem.

logs of Loki

ts=2021-01-27T04:50:13.279487527Z caller=spanlogger.go:53 org_id=fake traceID=6d3a3617a9d10b65 method=query.Exec level=debug Ingester.TotalReached=2 Ingester.TotalChunksMatched=0 Ingester.TotalBatches=0 Ingester.TotalLinesSent=0 Ingester.HeadChunkBytes="0 B" Ingester.HeadChunkLines=0 Ingester.DecompressedBytes="0 B" Ingester.DecompressedLines=0 Ingester.CompressedBytes="0 B" Ingester.TotalDuplicates=0 Store.TotalChunksRef=1490 Store.TotalChunksDownloaded=800 Store.ChunksDownloadTime=137.986708ms Store.HeadChunkBytes="0 B" Store.HeadChunkLines=0 Store.DecompressedBytes="1.8 GB" Store.DecompressedLines=1349743 Store.CompressedBytes="174 MB" Store.TotalDuplicates=0
ts=2021-01-27T04:50:13.279600928Z caller=spanlogger.go:53 org_id=fake traceID=6d3a3617a9d10b65 method=query.Exec level=debug Summary.BytesProcessedPerSecond="26 MB" Summary.LinesProcessedPerSecond=19645 Summary.TotalBytesProcessed="1.8 GB" Summary.TotalLinesProcessed=1349743 Summary.ExecTime=1m8.703279889s
level=info ts=2021-01-27T04:50:13.279953831Z caller=metrics.go:83 org_id=fake traceID=6d3a3617a9d10b65 latency=slow query="(sum by (path) (sum_over_time({filename="/var/log/nginx/json_access.log",host="10.0.2.", tenant_id=~"sre-tj06"} |json | status = 200 | line_format "{{.request_uri}}" |regexp "(?P^\\/([^/])\\/)" | line_format "{{.path}} {{.upstream_response_time}}" | unwrap upstream_response_time | error="" [5m]))) / sum by (path)(count_over_time({filename="/var/log/nginx/json_access.log",host="10.0.2.", tenant_id=~"sre-tj06"} | json | line_format "{{.request_uri}}" |regexp "(?P^\\/([^/])\\/)"| path != "" | path!= "\\" | error="" [5m]))" query_type=metric range_type=range length=48h0m1s step=2m0s duration=1m8.703279889s status=500 throughput=26MB total_bytes=1.8GB
level=warn ts=2021-01-27T04:50:13.280160032Z caller=logging.go:71 traceID=6d3a3617a9d10b65 msg="GET /loki/api/v1/query_range?direction=BACKWARD&limit=1359&query=(sum%20by%20(path)%20(sum_over_time(%7Bfilename%3D~%22%2Fvar%2Flog%2Fnginx%2Fjson_access.log%22%2Chost%3D~%2210.0.2.%22%2C%20tenant_id%3D~%22sre-tj06%22%7D%20%7Cjson%20%7C%20status%20%3D%20200%20%7C%20line_format%20%22%7B%7B.request_uri%7D%7D%22%20%7Cregexp%20%22(%3FP%3Cpath%3E%5E%5C%5C%2F(%5B%5E%2F%5D)%5C%5C%2F)%22%20%7C%20line_format%20%22%7B%7B.path%7D%7D%20%7B%7B.upstream_response_time%7D%7D%22%20%20%7C%20unwrap%20upstream_response_time%20%7C%20__error__%3D%22%22%20%5B5m%5D)))%20%2F%20sum%20by%20(path)(count_over_time(%7Bfilename%3D~%22%2Fvar%2Flog%2Fnginx%2Fjson_access.log%22%2Chost%3D~%2210.0.2.%22%2C%20tenant_id%3D~%22sre-tj06%22%7D%20%7C%20json%20%7C%20line_format%20%22%7B%7B.request_uri%7D%7D%22%20%7Cregexp%20%22(%3FP%3Cpath%3E%5E%5C%5C%2F(%5B%5E%2F%5D)%5C%5C%2F)%22%7C%20path%20!%3D%20%22%22%20%7C%20path!%3D%20%22%5C%5C%22%20%7C%20__error__%3D%22%22%20%5B5m%5D))&start=1611550083000000000&end=1611722884000000000&step=120 (500) 1m8.704357997s Response: "rpc error: code = Canceled desc = context canceled\n" ws: false; Accept: application/json, text/plain, /; Accept-Encoding: gzip, deflate, br; Accept-Language: en-US,en;q=0.5; User-Agent: Grafana/7.3.5; X-Forwarded-For: 222.126.245.76, 10.244.0.110, 10.244.0.110; X-Grafana-Org-Id: 1; X-Real-Ip: 222.126.245.76; X-Request-Id: e87774979c4ef03dfd79c99361bcf97f; X-Scheme: https; "

@aricamf
Copy link
Author

aricamf commented Jan 27, 2021

I updated the query from grafana...seems that even the total returned bytes was dropped...but it will still report the same error as long as it's a "long query".

ts=2021-01-27T05:13:41.095467378Z caller=spanlogger.go:53 org_id=fake traceID=1312e304681820a1 method=query.Exec level=debug Ingester.TotalReached=1 Ingester.TotalChunksMatched=0 Ingester.TotalBatches=0 Ingester.TotalLinesSent=0 Ingester.HeadChunkBytes="0 B" Ingester.HeadChunkLines=0 Ingester.DecompressedBytes="0 B" Ingester.DecompressedLines=0 Ingester.CompressedBytes="0 B" Ingester.TotalDuplicates=0 Store.TotalChunksRef=414 Store.TotalChunksDownloaded=200 Store.ChunksDownloadTime=1.008968582s Store.HeadChunkBytes="0 B" Store.HeadChunkLines=0 Store.DecompressedBytes="389 MB" Store.DecompressedLines=287565 Store.CompressedBytes="37 MB" Store.TotalDuplicates=0
ts=2021-01-27T05:13:41.095564779Z caller=spanlogger.go:53 org_id=fake traceID=1312e304681820a1 method=query.Exec level=debug Summary.BytesProcessedPerSecond="4.9 MB" Summary.LinesProcessedPerSecond=3589 Summary.TotalBytesProcessed="389 MB" Summary.TotalLinesProcessed=287565 Summary.ExecTime=1m20.110152353s
level=info ts=2021-01-27T05:13:41.09580058Z caller=metrics.go:83 org_id=fake traceID=1312e304681820a1 latency=slow query="avg by (path) (sum_over_time({filename="/var/log/nginx/json_access.log",host="10.0.2.", tenant_id=~"sre-tj06"} |json | status = 200 | line_format "{{.request_uri}}" |regexp "(?P^\\/([^/])\\/)" | line_format "{{.path}}" | unwrap upstream_response_time | error="" [1h])) > 1" query_type=metric range_type=range length=24h0m1s step=1m0s duration=1m20.110152353s status=500 throughput=4.9MB total_bytes=389MB
level=warn ts=2021-01-27T05:13:41.095917581Z caller=logging.go:71 traceID=1312e304681820a1 msg="GET /loki/api/v1/query_range?direction=BACKWARD&limit=1359&query=avg%20by%20(path)%20(sum_over_time(%7Bfilename%3D~%22%2Fvar%2Flog%2Fnginx%2Fjson_access.log%22%2Chost%3D~%2210.0.2.%22%2C%20tenant_id%3D~%22sre-tj06%22%7D%20%7Cjson%20%7C%20status%20%3D%20200%20%7C%20line_format%20%22%7B%7B.request_uri%7D%7D%22%20%7Cregexp%20%22(%3FP%3Cpath%3E%5E%5C%5C%2F(%5B%5E%2F%5D)%5C%5C%2F)%22%20%7C%20line_format%20%22%7B%7B.path%7D%7D%22%20%7C%20unwrap%20upstream_response_time%20%7C%20__error__%3D%22%22%20%5B1h%5D))%20%3E%201&start=1611637940000000000&end=1611724341000000000&step=60 (500) 1m20.110906958s Response: "rpc error: code = Canceled desc = context canceled\n" ws: false; Accept: application/json, text/plain, /; Accept-Encoding: gzip, deflate, br; Accept-Language: en-US,en;q=0.5; User-Agent: Grafana/7.3.5; X-Forwarded-For: 222.126.245.76, 10.244.0.110, 10.244.0.110; X-Grafana-Org-Id: 1; X-Real-Ip: 222.126.245.76; X-Request-Id: 0a74bc01e25f4501ebd02231eda224cc; X-Scheme: https; "

@aricamf
Copy link
Author

aricamf commented Jan 27, 2021

One more strange thing...I found this error from loki logs
2021-01-27T11:52:11.390997942Z level=debug ts=2021-01-27T11:52:11.39081924Z caller=grpc_logging.go:53 method=/logproto.Querier/QuerySample duration=59.981648565s err="rpc error: code = Canceled desc = context canceled" msg="gRPC\n"

and seems that I had this error everytime with duration close to 1m...is it configurable? I've already updated the grpc settings, and the querier...but nothing change to this magic number...

Much appreciated if anyone can shed me a light here...

THanks

@OrdinaryFan
Copy link

OrdinaryFan commented Mar 4, 2021

and seems that I had this error everytime with duration close to 1m...is it configurable?

query_timeout can config it.

@stale
Copy link

stale bot commented Jun 3, 2021

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale A stale issue or PR that will automatically be closed. label Jun 3, 2021
@stale stale bot closed this as completed Jun 11, 2021
@LinTechSo
Copy link
Contributor

Hi. any updates ?

@carobaldino
Copy link

Hi! I was having the same problem and I opted for changing the way queries were being made. By default, parallel queries are active, so one big query splits into smaller ones and runs all them in parallel. At fist I was having the error of "too many outstanding requests", but after extending the maximum amount of parallel queries, dashboards took a while to load, but still get loaded. And sometimes it was not working, showing the error of this issue "context canceled".
Reading with more attention to the issue leave below, I opted to disable parallel queries, because for my ecosystem it was not justified, and degrading performance (I only have a singe instance Loki, no Single Scalable Deployment -SSD nor microservices)
#5123 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale A stale issue or PR that will automatically be closed.
Projects
None yet
Development

No branches or pull requests

4 participants