Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request cancel issue between loki QueryFrontend and QueryFrontendWorker #5132

Closed
kavirajk opened this issue Jan 13, 2022 · 0 comments · Fixed by #5113
Closed

Request cancel issue between loki QueryFrontend and QueryFrontendWorker #5132

kavirajk opened this issue Jan 13, 2022 · 0 comments · Fixed by #5113

Comments

@kavirajk
Copy link
Contributor

LogQL request cancellation not propogated.

Problem.

When LogQL request is cancelled by the client(LogCLI or Grafana), the cancellation is received by Query Frontend, but not propagated to some of the downstream requests started by Query Frontend(eg to queriers mainly).

This leads to more resource conception in Loki because, even after original request is cancelled, the queriers started by the request is still running.

NOTE: Can happen for any LogQL queries

Steps to reproduce the issue.

NOTE: Here I use LogCLI, for two reasons

  1. Grafana sends two queries (actual log query and log volume query if enabled) so narrowing down the resource consumption is bit hard during that time interval. Whereas, using LogCLI we can send only single metric query that is causing the issue.
  2. Easy to change the request timeout value for investigation.

1. Make count_over_time query via LogCLI

We use timeout 10s that closes the client connection after 10s cancelling the request.

Here we use date | md5sum | cut -d\- -f1 for random ID in the query (easy to find the exact query).

QID=$(date | md5sum | cut -d\- -f1) && \
	echo $QID && \
	timeout 10s \
	logcli query 'sum by (level) (count_over_time({cluster="<cluster-name>"} |= "'"$QID"'"[1m]))'\
	--since=48h \
	-q

Above command will print random id. Copy that to search for that query to find it's traces.

2. Search for that query in Loki.

Go search for the query we made on Grafana Explore or LogCLI via following query.
fill-in <random-id> copied from the previous step.

{cluster="<cluster>", namespace="<namespace>", job="<namespace>/query-frontend"} |= "caller=metrics.go" |= "<random-id>"

3. Tempo traces.

Optionally see the traces of the requests.

tempo-traces

You can see it takes more than 10s (something was running after the cancelation)

Also from the traces, you can see, Query Frontend got the cancellation and returned 499 status correctly (in 10s)

So the problem is even Query Frontend got the cancellation and responded correctly with 499 at 10s, the complete request cycle ran for more than 15s (this went upto 45s in some cluster depends on the traffic)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant