Request `cancel` issue between loki `QueryFrontend` and `QueryFrontendWorker` #5132

kavirajk · 2022-01-13T10:00:21Z

LogQL request cancellation not propogated.

Problem.

When LogQL request is cancelled by the client(LogCLI or Grafana), the cancellation is received by Query Frontend, but not propagated to some of the downstream requests started by Query Frontend(eg to queriers mainly).

This leads to more resource conception in Loki because, even after original request is cancelled, the queriers started by the request is still running.

NOTE: Can happen for any LogQL queries

Steps to reproduce the issue.

NOTE: Here I use LogCLI, for two reasons

Grafana sends two queries (actual log query and log volume query if enabled) so narrowing down the resource consumption is bit hard during that time interval. Whereas, using LogCLI we can send only single metric query that is causing the issue.
Easy to change the request timeout value for investigation.

1. Make `count_over_time` query via LogCLI

We use timeout 10s that closes the client connection after 10s cancelling the request.

Here we use date | md5sum | cut -d\- -f1 for random ID in the query (easy to find the exact query).

QID=$(date | md5sum | cut -d\- -f1) && \
	echo $QID && \
	timeout 10s \
	logcli query 'sum by (level) (count_over_time({cluster="<cluster-name>"} |= "'"$QID"'"[1m]))'\
	--since=48h \
	-q

Above command will print random id. Copy that to search for that query to find it's traces.

2. Search for that query in Loki.

Go search for the query we made on Grafana Explore or LogCLI via following query.
fill-in <random-id> copied from the previous step.

{cluster="<cluster>", namespace="<namespace>", job="<namespace>/query-frontend"} |= "caller=metrics.go" |= "<random-id>"

3. Tempo traces.

Optionally see the traces of the requests.

You can see it takes more than 10s (something was running after the cancelation)

Also from the traces, you can see, Query Frontend got the cancellation and returned 499 status correctly (in 10s)

So the problem is even Query Frontend got the cancellation and responded correctly with 499 at 10s, the complete request cycle ran for more than 15s (this went upto 45s in some cluster depends on the traffic)

The text was updated successfully, but these errors were encountered:

kavirajk mentioned this issue Jan 13, 2022

Fix cancel issue between Query Frontend and Query Schdeduler #5113

Merged

3 tasks

cyriltovena closed this as completed in #5113 Jan 13, 2022

liguozhong mentioned this issue Nov 24, 2022

[deadlock] scheduler: if querier OOM restart . #7722

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request `cancel` issue between loki `QueryFrontend` and `QueryFrontendWorker` #5132

Request `cancel` issue between loki `QueryFrontend` and `QueryFrontendWorker` #5132

kavirajk commented Jan 13, 2022

Request cancel issue between loki QueryFrontend and QueryFrontendWorker #5132

Request cancel issue between loki QueryFrontend and QueryFrontendWorker #5132

Comments

kavirajk commented Jan 13, 2022

LogQL request cancellation not propogated.

Problem.

Steps to reproduce the issue.

1. Make count_over_time query via LogCLI

2. Search for that query in Loki.

3. Tempo traces.

Request `cancel` issue between loki `QueryFrontend` and `QueryFrontendWorker` #5132

Request `cancel` issue between loki `QueryFrontend` and `QueryFrontendWorker` #5132

1. Make `count_over_time` query via LogCLI