-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
net/http, x/net/http2: TCP connection is not closed after http client request times out after missing RST (TCP) or GOAWAY(HTTP/2) frame #36026
Comments
Got a repro? |
@bradfitz I understand my problem better now, and I can reproduce this easily, but there's not a straightforward free-standing reproducer. I think this is actually primarily a problem with Azure load balancers being badly behaved, but there's also a difference in Go client behaviour between HTTP/1.1 and HTTP/2. I do wonder whether there is scope for Go to be more defensive here. The principal issue is that Azure load balancers by default don't send a TCP RST for incoming packets on a connection they don't recognise. With my issue, I've got a long-standing connection to Azure CosmosDB (which sits behind an Azure load balancer), and I log onto a VPN which changes the default route on my laptop. The same thing can also happen if there's a network break that lasts longer than the Azure LB idle timeout, or, similarly, if I suspend my laptop for longer than the idle timeout. When any of these things happen, the open TCP connection gets into a nasty state. The connection can't make progress because Azure doesn't recognise it or has timed it out, but it also doesn't die quickly because Azure blackholes it instead of sending a RST back to the client. Linux keeps sending packets to Azure hoping for a response until it hits the In the case of a Go HTTP client with a Timeout set, using HTTP/2, when the underlying TCP connection goes bad, individual requests time out, but Go does not collect the connection. All future requests that are multiplexed to that connection also timeout because the underlying connection is basically dead - it cannot make progress. For me, adding the following after
Setting If I force the Go client to HTTP/1.1 (with keep-alives enabled), then when the Timeout hits the connection is automatically collected, which seems to be much better behaviour. I think in my case, for safety, it makes sense for me to disable the HTTP/2 client for now. I wonder whether, as a safety feature, it would make sense for the Go HTTP/2 client to abandon a connection for future requests if it has had a timeout - or whether this would be too intrusive? Conjecturing, it would be also a useful safety mechanism for any client sitting behind a stupid firewall that drops RSTs, if that brokenness still happens these days. |
Thank you for the report @jim-minter! To help get us to the last mile and make this issue actionable for Brad and myself, here is some seed code to help with the repro package main
import (
"fmt"
"log"
"net"
"net/http"
"net/http/httptest"
"os"
"os/exec"
"time"
"golang.org/x/net/http2"
)
func main() {
fmt.Printf("Our PID: %d", os.Getpid())
stallResp := make(chan bool)
cst := httptest.NewUnstartedServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
fmt.Printf("Request proto: %s\n", r.Proto)
<-stallResp
w.Write([]byte("Done here"))
}))
if err := http2.ConfigureServer(cst.Config, new(http2.Server)); err != nil {
log.Fatalf("Failed to configure HTTP/2 server: %v", err)
}
cst.TLS = cst.Config.TLSConfig
cst.StartTLS()
defer cst.Close()
tr := &http.Transport{TLSClientConfig: cst.Config.TLSConfig}
if err := http2.ConfigureTransport(tr); err != nil {
log.Fatalf("Failed to configure HTTP2 transport: %v", err)
}
tr.TLSClientConfig.InsecureSkipVerify = true
client := &http.Client{
Timeout: 400 * time.Millisecond,
Transport: tr,
}
req, _ := http.NewRequest("GET", cst.URL, nil)
res, err := client.Do(req)
if err == nil {
log.Fatal("Unexpectedly didn't get an error")
res.Body.Close()
}
ne := err.(net.Error)
if !ne.Timeout() {
log.Fatal("Failed to get a timeout")
}
close(stallResp)
// Now finally print the summary from lsof every 3 seconds.
for {
output, _ := lsofUs()
fmt.Printf("%s\n\n", output)
<-time.After(3 * time.Second)
}
}
func lsofUs() ([]byte, error) {
return exec.Command("lsof", "-p", fmt.Sprintf("%d", os.Getpid())).CombinedOutput()
} It contains a server and a client both in HTTP/2 mode, please tweak it to make it representative of the failures you reported and every 3 seconds it'll report main 34481 emmanuelodeke 8u IPv4 0xc0636412fd530a8f 0t0 TCP localhost:49402->localhost:49401 (ESTABLISHED)
main 34481 emmanuelodeke 9u IPv4 0xc06364130796210f 0t0 TCP localhost:49401->localhost:49402 (ESTABLISHED) which you can then use to verify your report and then we can also use to diagnose the issue. Thank you! |
@odeke-em, thanks for spurring me into action. Here's a reproducer that works on Linux (note: uses package main
import (
"crypto/tls"
"log"
"net"
"net/http"
"net/http/httptest"
"os/exec"
"time"
)
// configure returns a TLS test server and corresponding client. If http2 is
// set, the server is configured to enable HTTP/2; otherwise HTTP/1.x will be
// negotiated.
func configure(dialer func(network, addr string) (net.Conn, error), http2 bool) (s *httptest.Server, c *http.Client) {
s = httptest.NewUnstartedServer(http.HandlerFunc(func(http.ResponseWriter, *http.Request) {}))
if http2 {
s.TLS = &tls.Config{
NextProtos: []string{"h2"},
}
}
s.StartTLS()
c = s.Client()
c.Timeout = 500 * time.Millisecond
c.Transport.(*http.Transport).Dial = dialer
c.Transport.(*http.Transport).ForceAttemptHTTP2 = true
return
}
// doRequest makes a GET request from the client to the server and logs the
// first line of the response.
func doRequest(s *httptest.Server, c *http.Client) {
resp, err := c.Get(s.URL)
if err != nil {
log.Printf("doRequest: %v", err)
return
}
log.Printf("doRequest: %s %s", resp.Proto, resp.Status)
err = resp.Body.Close()
if err != nil {
log.Printf("doRequest: %v", err)
}
}
// iptables manages a local iptables rule blackholing localhost traffic on a
// given TCP port. It simulates bad behaviour by Azure load balancers, which by
// default don't send a TCP RST for incoming packets on a connection they don't
// recognise. If command is "-I", the rule is inserted; if it is "-D", it is
// deleted. For more details, see:
// https://docs.microsoft.com/en-us/azure/load-balancer/load-balancer-tcp-reset
func iptables(command, port string) error {
log.Printf("iptables: %s %s", command, port)
return exec.Command("sudo", "iptables", command, "INPUT", "-s", "127.0.0.1", "-p", "tcp", "--sport", port, "-j", "DROP").Run()
}
func test(http2 bool) error {
var localPort string
// dialer wraps net.Dial and leaks the local port of the connection it has
// just dialed to localPort
dialer := func(network, addr string) (net.Conn, error) {
log.Printf("dialing %s", addr)
conn, err := net.Dial(network, addr)
if err != nil {
return nil, err
}
_, localPort, err = net.SplitHostPort(conn.(*net.TCPConn).LocalAddr().String())
if err != nil {
return nil, err
}
return conn, nil
}
s, c := configure(dialer, http2)
// first request is expected to succeed (no reason for it to fail)
doRequest(s, c)
// now simulate the underlying connection being broken
defer iptables("-D", localPort)
err := iptables("-I", localPort)
if err != nil {
return err
}
// second request is expected to time out because the underlying connection
// is busted
doRequest(s, c)
// with HTTP/1.x, the third and subsequent requests succeed because the
// broken underlying connection is discarded and a new one is dialed.
// with HTTP/2, the third and subsequent requests fail because the
// underlying connection is *not* discarded. Very approximately, I think
// that HTTP/2 should not make new requests on an underlying connection if
// it has seen a request time out on that connection.
doRequest(s, c)
doRequest(s, c)
return nil
}
func main() {
log.Print("testing HTTP/1.x")
err := test(false)
if err != nil {
panic(err)
}
log.Print("testing HTTP/2")
err = test(true)
if err != nil {
panic(err)
}
} I get the following output:
Thanks! |
Awesome, thank you for the complete repro @jim-minter! I'll update this issue to match the problem and also kindly ask our HTTP/2 experts As reported by this issue the HTTP/1 stack already closes connections on such errors but assumes that the HTTP/2 stack will do that too as per Lines 2576 to 2582 in c170b14
|
#30702 seems related. I've observed a similar situation in that a dead connection will continue be tried to be used for new requests when using HTTP/2, but if I disable HTTP/1.1 this is not the case. For example if I block the server IP with iptables after establishing the connection. This means my requests can fail for minutes at a time with HTTP/2 due to the cached dead connection continuing to be used. For my application I'm thinking to disable HTTP/2 for the time being because of this. |
cc @fraenkel |
My best recommendation is to enable the PingTimeout to detect connections that have been improperly torn apart. |
Cool, thank you! Yes, enabling that does help this situation. I set ReadIdleTimeout to enable that functionality. |
Yes. That will enable the health checks. |
Configure the HTTP2 transport to handle broken/idle connections Two timeout values were added, ReadIdleTimeout and PingTimeout, that are both interrelated. The ReadIdleTimeout enables the ping health check every configured duration if no frame has been received on the HTTP/2 connection. The PingTimeout can be used to configure the total amount of time to wait for a ping response before closing the connection. Both of these timeout values assist in cleaning up broken or idle connections forcing the client to re-connect. This is the current known workaround for the following Go issue: golang/go#36026
In go1.14.x we were experiencing the same TCP connection leak. We use a shared http.Client w/RateLimiter to make hundreds of simultaneous http1.1 requests. Servers accepting requests have a hard timeout of 5m as does the shared http.Client Transport config. (Accepting servers are also behind multiple proxy layers) Setting |
@ibejohn818 my recommendation would be to upgrade to a supported version of Go, 1.18.1 being the recommended version at the time of writing. If that doesn’t address your issue I suggest creating a new issue with details of how to reproduce the problem you are seeing. |
I believe we encountered this issue last week, and it caused an outage. We "fixed" the outage by restarting the Go processes, which obviously forced them to reconnect. We have a workaround in place for now. I haven't started trying to reproduce it yet. We were using Go 1.18.x and instrumented the code with Datadog and the Go Thanks to the tracing, we can see that for each failed HTTP request to the ALB, the round tripper recorded several pairs of GetConn and GotConn traces. On each of the GotConn traces, the |
I have noticed that HTTP/2 tends to be unreliable in certain circumstances due to bugs in Go's HTTP/2 library. See the following upstream issues: golang/go#36026 golang/go#59690 Let's add a configuration option here, so that HTTP/2 can be disabled on an individual basis if needed.
For whoever facing the same issue due to the http2 TCP connection didn't get closed properly, this is how you can enable the ping health check to close the bad tcp connection:
|
**Description:** This PR introduces options to enable `http/2` health check by exposing `HTTP2ReadIdleTimeout` and `HTTP2PingTimeout` The golang issues are: golang/go#59690 golang/go#36026 In summary, if due to environmental issue the underlying tcp connection used by the http/2 client in the exporter became unstable/unusable/unreachable, unlike http/1, the http/2 client does not forcibly close the connection and redial a new one, instead it keeps using it for 15 minutes (default value of OS `tcp_retries2` ) until the OS cleans it up and a new tcp connection gets established. From OTEL user perspective, one will see a spike in export failures/timeouts for ~15 minutes, this will happen for every connection that got into a bad state, after 15 minutes things will recover until next time the tcp connection gets into a bad state. **Testing:** - Run OTEL with one of the exporters that uses HTTP/2 client, example `signalfx` exporter - For simplicity use a single pipeline/exporter - In a different shell, run this to watch the tcp state of the established connection ``` while (true); do echo date; sudo netstat -anp | grep -E '<endpoin_ip_address(es)>' | sort -k 5; sleep 2; done ``` - From the netstat, take a note of the source port and the source IP address - replace <> from previous step `sudo iptables -A OUTPUT -s <source_IP> -p tcp --sport <source_Port> -j DROP` - Note how the OTEL exporter export starts timing out Expected Result: - A new connection should be established, similarly to http/1 and exports should succeed Actual Result: - The exports keep failing for ~ 15 minutes or for whatever the OS `tcp_retries2` is configured to - After 15 minutes, a new tcp connection is created and exports start working **Documentation:** Readme is updated with new settings Signed-off-by: Dani Louca <dlouca@splunk.com>
**Description:** This PR introduces options to enable `http/2` health check by exposing `HTTP2ReadIdleTimeout` and `HTTP2PingTimeout` The golang issues are: golang/go#59690 golang/go#36026 In summary, if due to environmental issue the underlying tcp connection used by the http/2 client in the exporter became unstable/unusable/unreachable, unlike http/1, the http/2 client does not forcibly close the connection and redial a new one, instead it keeps using it for 15 minutes (default value of OS `tcp_retries2` ) until the OS cleans it up and a new tcp connection gets established. From OTEL user perspective, one will see a spike in export failures/timeouts for ~15 minutes, this will happen for every connection that got into a bad state, after 15 minutes things will recover until next time the tcp connection gets into a bad state. **Testing:** - Run OTEL with one of the exporters that uses HTTP/2 client, example `signalfx` exporter - For simplicity use a single pipeline/exporter - In a different shell, run this to watch the tcp state of the established connection ``` while (true); do echo date; sudo netstat -anp | grep -E '<endpoin_ip_address(es)>' | sort -k 5; sleep 2; done ``` - From the netstat, take a note of the source port and the source IP address - replace <> from previous step `sudo iptables -A OUTPUT -s <source_IP> -p tcp --sport <source_Port> -j DROP` - Note how the OTEL exporter export starts timing out Expected Result: - A new connection should be established, similarly to http/1 and exports should succeed Actual Result: - The exports keep failing for ~ 15 minutes or for whatever the OS `tcp_retries2` is configured to - After 15 minutes, a new tcp connection is created and exports start working **Documentation:** Readme is updated with new settings Signed-off-by: Dani Louca <dlouca@splunk.com>
If a request times out when using Azure/SLB, the connection does not get cleaned up and subsequent calls can get stuck re-using this connection. This adds options to disable HTTP2 to work better with Azure SLB. See golang/go#36026
If a request times out when using Azure/SLB, the connection does not get cleaned up and subsequent calls can get stuck re-using this connection. This adds options to disable HTTP2 to work better with Azure SLB. See golang/go#36026
If a request times out when using Azure/SLB, the connection does not get cleaned up and subsequent calls can get stuck re-using this connection. This adds options to disable HTTP2 to work better with Azure SLB. See golang/go#36026
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
I believe so.
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
With a simple client like this, and using it to make a dumb GET request to some server:
c := &http.Client{
Timeout: 30 * time.Second,
}
If the server doesn't respond within 30 seconds, the client gets a timeout error as expected. But, Go does not close the underlying TCP connection to the server. If the client keeps retrying, unfortunately its subsequent requests get queued on the same TCP connection.
If the server doesn't close the connection, or perhaps if there's an underlying TCP connection (e.g. connectivity) break which means the client and server aren't actually communicating, the client doesn't redial to establish a new connection, at least perhaps until the local kernel socket send buffer is full, or there's a TCP keepalive timeout, or something (conjecture).
You can see this when it happens because netstat continues to show the connection, and you can see its local Recv-Q figure incrementing on each retry.
Setting DisableKeepAlives works around this problem, but it's a pretty big hammer.
IMO it would make more sense for Go to abandon and close the underlying TCP connection if an HTTP timeout has occured on it.
What did you expect to see?
I'd like to see Go net/http close underlying TCP connections if the HTTP request they are running times out.
What did you see instead?
It keeps the underlying TCP connection open.
The text was updated successfully, but these errors were encountered: