Skip to content
This repository was archived by the owner on Aug 23, 2023. It is now read-only.

Return 499 http code instead of 502 when client disconnect during a render query with graphite proxying #1821

Merged
merged 1 commit into from
May 14, 2020

Conversation

fkaleo
Copy link
Contributor

@fkaleo fkaleo commented May 6, 2020

If the client disconnects while processing a query that needs to be forwarded to graphite, return http code 499 instead of default 502 by implementing a custom proxy error handler.

Fix #1820

…orwarded to graphite, return http code 499 instead of default 502 by implementing a custom proxy error handler
@fkaleo fkaleo requested review from replay and Dieterbe May 6, 2020 01:08
Copy link
Contributor

@replay replay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fkaleo fkaleo merged commit b494854 into master May 14, 2020
@fkaleo fkaleo deleted the 499_when_client_disconnects branch May 14, 2020 18:21
@fkaleo
Copy link
Contributor Author

fkaleo commented May 14, 2020

I was thinking, is it possible that tsdb-gw itself was disconnecting and not the actual client?

@replay
Copy link
Contributor

replay commented May 14, 2020

that's definitely possible. at least that would explain why we have so many of these disconnects on various different instances, it would be weird if all our users behaved that way.

@Dieterbe
Copy link
Contributor

I have just verified this PR using this simple test program:

package main

import (
	"fmt"
	"log"
	"net"
	"time"
)

func main() {

	conn, err := net.DialTimeout("tcp", "localhost:6060", 5*time.Second)
	if err != nil {
		log.Fatalf("could not dial: %s: %s", "localhost:6060", err)
	}
	conn.Write([]byte("GET /render?target=movingAverage(metrictank.stats.*.*.version.*.gauge1%2C%20'5min')&from=-600h&until=now&format=json&meta=true HTTP/1.1\n"))
	conn.Write([]byte("Host: localhost:6060\n\n"))
	err = conn.Close()
	fmt.Println(err)
}

before this PR

MT logs

2020-05-15 05:24:04.921 [INFO] Proxying to Graphite because of error: unknown function "movingAverage"
2020/05/15 05:24:04 http: proxy error: context canceled
2020-05-15 05:24:04.921 [INFO] ts=2020-05-15T05:24:04.921863879Z traceID=413e090cbb7c9b2a, sampled=true msg="GET /render?format=json&from=-600h&meta=true&target=movingAverage%28metrictank.stats.%2A.%2A.version.%2A.gauge1%2C+%275min%27%29&until=now (502) 250.214µs" orgID=1 sourceIP="172.24.0.1" error=""

ngrep sniff

#
T 172.24.0.1:60406 -> 172.24.0.10:6060 [AP] #22
GET /render?target=movingAverage(metrictank.stats.*.*.version.*.gauge1%2C%20'5min')&from=-600h&until=now&format=json&meta=true HTTP/1.1
Host: localhost:6060


#####
T 172.24.0.10:6060 -> 172.24.0.1:60406 [AP] #27
HTTP/1.1 502 Bad Gateway.
Trace-Id: 413e090cbb7c9b2a.
Vary: Origin.
Date: Fri, 15 May 2020 05:24:04 GMT.
Content-Length: 0.
.

With this PR applied

MT output

2020-05-15 05:26:55.602 [INFO] Proxying to Graphite because of error: unknown function "movingAverage"
2020-05-15 05:26:55.602 [INFO] http: proxy error: context canceled
2020-05-15 05:26:55.602 [INFO] ts=2020-05-15T05:26:55.602393309Z traceID=2fdd1afcda77f28a, sampled=true msg="GET /render?format=json&from=-600h&meta=true&target=movingAverage%28metrictank.stats.%2A.%2A.version.%2A.gauge1%2C+%275min%27%29&until=now (499) 237.018µs" orgID=1 sourceIP="172.24.0.1" error=""

ngrep sniff

T 172.24.0.1:33008 -> 172.24.0.10:6060 [AP] #52
GET /render?target=movingAverage(metrictank.stats.*.*.version.*.gauge1%2C%20'5min')&from=-600h&until=now&format=json&meta=true HTTP/1.1
Host: localhost:6060


#####
T 172.24.0.10:6060 -> 172.24.0.1:33008 [AP] #57
HTTP/1.1 499 status code 499.
Trace-Id: 2fdd1afcda77f28a.
Vary: Origin.
Date: Fri, 15 May 2020 05:26:55 GMT.
Content-Length: 0.
.

So, it seems to work as expected. "status code 499" isn't the most useful status message, and redundant, but given that no client will actually see it, it's fine.

@Dieterbe
Copy link
Contributor

PS. nice work @fkaleo !

@fkaleo
Copy link
Contributor Author

fkaleo commented May 15, 2020

@Dieterbe your thoughts on #1821 (comment) ?

@shanson7
Copy link
Collaborator

I was thinking, is it possible that tsdb-gw itself was disconnecting and not the actual client?

FWIW, we see these too and don't use tsdb-gw. I've traced it down to auto refreshing grafana dashboards. If the dashboard is set to refresh every 5s (for example) but some of the queries take longer, grafana cancels the request and we get a 502.

@Dieterbe
Copy link
Contributor

Dieterbe commented May 15, 2020

Seems tsdbgw uses http.DefaultTransport when it creates the proxy to MT
it has these settings:

// DefaultTransport is the default implementation of Transport and is
// used by DefaultClient. It establishes network connections as needed
// and caches them for reuse by subsequent calls. It uses HTTP proxies
// as directed by the $HTTP_PROXY and $NO_PROXY (or $http_proxy and
// $no_proxy) environment variables.
var DefaultTransport RoundTripper = &Transport{
	Proxy: ProxyFromEnvironment,
	DialContext: (&net.Dialer{
		Timeout:   30 * time.Second,
		KeepAlive: 30 * time.Second,
		DualStack: true,
	}).DialContext,
	ForceAttemptHTTP2:     true,
	MaxIdleConns:          100,
	IdleConnTimeout:       90 * time.Second,
	TLSHandshakeTimeout:   10 * time.Second,
	ExpectContinueTimeout: 1 * time.Second,
}

timeouts in go are quite complicated - it's hard to even figure out which setting controls exactly what - see e.g. https://blog.cloudflare.com/the-complete-guide-to-golang-net-http-timeouts/ which i'm not sure corresponds to our go version
I'm not convinced it's worthwhile for us to go investigating this very deeply. If a response is that slow that tsdbgw times it out, it wasn't a very useful response anyway. and people would have already seen an error prior to this change anyway, it's just we use a different code.
I suggest we just see if any reports come in for unjustified http 499 responses, which btw, i would expect to see on ops if this is an issue.

@fkaleo
Copy link
Contributor Author

fkaleo commented May 15, 2020

I was thinking, is it possible that tsdb-gw itself was disconnecting and not the actual client?

FWIW, we see these too and don't use tsdb-gw. I've traced it down to auto refreshing grafana dashboards. If the dashboard is set to refresh every 5s (for example) but some of the queries take longer, grafana cancels the request and we get a 502.

Thanks! That's what I suspected :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

client disconnecting while a query is being processed that requires proxying to graphite results in 502
4 participants