Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/http: permanently broken connection with error "read: connection reset by peer" when response body is not closed #36700

Closed
josharian opened this issue Jan 22, 2020 · 20 comments
Labels
Documentation Issues describing a change to documentation. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@josharian
Copy link
Contributor

What version of Go are you using (go version)?

Go 1.14 beta 1

Does this issue reproduce with the latest release?

Not sure; I can't compile my program with 1.13.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/josh/Library/Caches/go-build"
GOENV="/Users/josh/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/josh"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/Users/josh/go/1.14"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/Users/josh/go/1.14/pkg/tool/darwin_amd64"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/Users/josh/x/skiff/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/1t/n61cbvls5bl293bbb0zyypqw0000gn/T/go-build473522411=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

Made a series of http requests from a Go client to a Go server. The requests were vanilla HTTP requests, using http.Get.

I lost connectivity at some point. When I regained connectivity, all subsequent http requests failed with error read: connection reset by peer. I waited quite a long time, and it never recovered.

It has happened a couple of times, but doesn't reproduce reliably (which is unsurprising, since me closing my laptop lid is not exactly a precision affair).

This looks very similar to #34978, although I don't know when exactly the connectivity failed.

Tentatively marking as Go 1.14.

cc @bradfitz @fraenkel @tombergan

@josharian josharian added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jan 22, 2020
@josharian josharian added this to the Go1.14 milestone Jan 22, 2020
@fraenkel
Copy link
Contributor

You were probably affected by #24138

@josharian
Copy link
Contributor Author

You were probably affected by #24138

Ugh. Thanks. This is about (not) having timeouts on the client side, right? It's unfortunate that it affects brand new requests and that it permanently breaks all new requests.

Could we add a (when-broken-only?) connection re-use timeout without breaking the compatibility promise? It seems like connection re-use decisions are entirely up to net/http.

Any suggestions for workarounds? It's tricky since I am actually doing long polling requests, where I actively want to keep connections open for hours, so I don't want to set aggressive timeouts. Would it help to call http.DefaultClient.CloseIdleConnections() whenever an http request fails? Or start using a new http.Client whenever an http request fails?

@networkimprov
Copy link

A different culprit than lack of timeout might be lack of resume-after-suspend notification. That would let us ping the server on resume and reconnect if nec. Also discussed in #36141 (comment)

Maybe unrelated, but the CL for #23459 (net.Dial TCP keepalive on by default) didn't touch net/http. Does http.Client not use net.Dial, or is this disabled for http.Client by a different CL?

@bcmills
Copy link
Contributor

bcmills commented Jan 23, 2020

Given that we don't know whether 1.13 was affected, I suspect this should not be milestoned to 1.14 at this point.

@josharian
Copy link
Contributor Author

josharian commented Jan 23, 2020

@bcmills I milestoned based on superficial similarity to #34978, which has outstanding backports to 1.13. Feel free to re-milestone, although it might be better to understand a bit more first.

@fraenkel
Copy link
Contributor

fraenkel commented Jan 24, 2020

@josharian Did you enable http2 on the Server? If so, curious what happens with http/1.1.

@josharian
Copy link
Contributor Author

I used net/http out of the box. I will run for a bit with http2 disabled and see what happens.

@networkimprov
Copy link

In 1.13 (and presumably earlier; not sure), http.DefaultTransport uses TCP keepalive. On client suspend/resume where the server drops the link (e.g. due to no traffic), subsequent client Write() disables TCP keepalive, preventing a timeout error from the last client Read() -- see #31490. I don't know whether this is preventing your http.DefaultClient from recovering, but thought I should mention it.

Setting a short keepalive helped in my case (a plain TCP/TLS context, not HTTP). Eventually I plan to disable keepalive and use a server pulse to detect a dead link.

From net/http/transport.go:

// DefaultTransport is the default implementation of Transport and is
// used by DefaultClient. It establishes network connections as needed
// and caches them for reuse by subsequent calls. It uses HTTP proxies
// as directed by the $HTTP_PROXY and $NO_PROXY (or $http_proxy and
// $no_proxy) environment variables.
var DefaultTransport RoundTripper = &Transport{
        Proxy: ProxyFromEnvironment,
        DialContext: (&net.Dialer{
                Timeout:   30 * time.Second,
                KeepAlive: 30 * time.Second,
                DualStack: true,
        }).DialContext,
        ForceAttemptHTTP2:     true,
        MaxIdleConns:          100,
        IdleConnTimeout:       90 * time.Second,
        TLSHandshakeTimeout:   10 * time.Second,
        ExpectContinueTimeout: 1 * time.Second,
}

@josharian josharian modified the milestones: Go1.14, Backlog Jan 25, 2020
@bradfitz
Copy link
Contributor

@networkimprov, that seems unrelated.

@josharian
Copy link
Contributor Author

I finally tracked this down.

It turns out that there was a code path in which in I didn't close resp.Body. Timeouts didn't help.

Although this was user error (sorry and thanks!), I do wonder whether there's a way to detect this, since it was not easy to find.

One option is a vet check that, similar to the lostcancel check, checks that resp.Body.Close gets called on all exits from a function. Another option would be some kind of check at runtime in the http package, although I don't know net/http well enough to know whether that is feasible.

@josharian
Copy link
Contributor Author

Correction: timeouts were also necessary (in addition to properly closing resp.Body).

@josharian
Copy link
Contributor Author

Closing as user error.

@bcmills
Copy link
Contributor

bcmills commented Jan 27, 2020

@josharian, could you provide some more detail about how this resulting from a missing resp.Body.Close?

The documentation for the Body field says:

The default HTTP client's Transport may not
reuse HTTP/1.x "keep-alive" TCP connections if the Body is
not read to completion and closed.

It does not say anything about what happens for HTTP/2 connections, so at the very least there seems to be a documentation issue here.

@bcmills bcmills reopened this Jan 27, 2020
@bcmills bcmills added the Documentation Issues describing a change to documentation. label Jan 27, 2020
@bcmills bcmills changed the title net/http: permanently broken connection with error "read: connection reset by peer" net/http: permanently broken connection with error "read: connection reset by peer" when response body is not closed Jan 27, 2020
@bcmills
Copy link
Contributor

bcmills commented Jan 27, 2020

This also seems like something we could detect and report, using a finalizer-based approach similar to the one described in #24739: if the Body becomes unreachable without being closed, that always indicates a caller error, and could be reported as such (perhaps as a panic, or perhaps by some other means).

@bradfitz
Copy link
Contributor

Yeah, I don't think forgetting to close resp.Body should cause anything worse than a TCP connection leak. It certainly shouldn't affect unrelated HTTP fetches.

@josharian
Copy link
Contributor Author

I'm honestly not sure. I tried to reproduce using multiple simpler ssetup and failed. There was clearly something else going on as well that closing the Body helped with. I went looking to try to understand better, but I've already spent over a full day on this, and I'm not really inclined to spend more time digging.

I do think a finalizer-based unclosed unreachable Body detection would be a good idea, since that is definitely a bug, regardless of what precise consequences it has. I suspect it'd be better to log instead of panicking--net/http already logs when it notices an ineffectual WriteHeader call, so there's some precedent for it.

alvaroaleman added a commit to alvaroaleman/ci-tools that referenced this issue Nov 17, 2020
Not closing the response body presumably claused flakyness in the
registry replacer when it tries to read a response body[0], see[1]
for why this can happen.

As this seems to be a common mistake in our codebase, this change
activates a linter for it.

[0] https://issues.redhat.com/browse/DPTP-1692
[1] golang/go#36700
@rogpeppe
Copy link
Contributor

rogpeppe commented Dec 4, 2020

Possibly related: I just saw this error message in our CI system. I couldn't reproduce it, but I can't see why closing idle connections should cause this kind of error:

Delete "http://127.0.0.1:35617/api/v2/variables/020f755c3c082000": net/http: HTTP/1.x transport connection broken: http: CloseIdleConnections called

@fraenkel
Copy link
Contributor

fraenkel commented Dec 4, 2020

The above might have just been fixed via #41600 and #42026.

@siredwin
Copy link

This makes so much sense.
I have two API calls, one has
c.Response().Header().Set("Connection", "keep-alive")
defer c.Request().Body.Close() // Added this one
and the other one
c.Response().Header().Set("Connection", "keep-alive")

Guess which one keeps giving me write: connection reset by peer?

julienduchesne added a commit to grafana/dskit that referenced this issue Oct 9, 2024
#599 went in a wrong direction, so I revert it here: it makes all `TestMultipleClients` tests flaky since not all members are necessarily joined after the updates

Instead, to fix the `TestMultipleClientsWithMixedLabelsAndExpectFailure` test, I add a couple more labeled members, that way, even if we reach 3 updates (which happened some times), we'll never get to 5 with a single member

Also, try to fix the `TestTLSServerWithLocalhostCertWithClientCertificateEnforcementUsingClientCA1` test by closing GRPC connections. Maybe this (golang/go#36700) is the issue?
julienduchesne added a commit to grafana/dskit that referenced this issue Oct 9, 2024
#599 went in a wrong direction, so I revert it here: it makes all `TestMultipleClients` tests flaky since not all members are necessarily joined after the updates

Instead, to fix the `TestMultipleClientsWithMixedLabelsAndExpectFailure` test, I add a couple more labeled members, that way, even if we reach 3 updates (which happened some times), we'll never get to 5 with a single member

Also, try to fix the `TestTLSServerWithLocalhostCertWithClientCertificateEnforcementUsingClientCA1` test by closing GRPC connections. Maybe this (golang/go#36700) is the issue?
julienduchesne added a commit to grafana/dskit that referenced this issue Oct 10, 2024
* Really fix flaky `TestMultipleClients` tests
#599 went in a wrong direction, so I revert it here: it makes all `TestMultipleClients` tests flaky since not all members are necessarily joined after the updates

Instead, to fix the `TestMultipleClientsWithMixedLabelsAndExpectFailure` test, I add a couple more labeled members, that way, even if we reach 3 updates (which happened some times), we'll never get to 5 with a single member

Also, try to fix the `TestTLSServerWithLocalhostCertWithClientCertificateEnforcementUsingClientCA1` test by closing GRPC connections. Maybe this (golang/go#36700) is the issue?

* Set `MaxIdleConnsPerHost`

* New attempt to fix this test. Could it be aborted connections?

* Retry RSTs
@seankhliao
Copy link
Member

i suppose someone can open a new issue if they can still reproduce it.

@seankhliao seankhliao closed this as not planned Won't fix, can't repro, duplicate, stale Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation Issues describing a change to documentation. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

8 participants