Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/http: TestServerKeepAlivesEnabled_h2 test flake #18701

Closed
minux opened this issue Jan 18, 2017 · 17 comments
Closed

net/http: TestServerKeepAlivesEnabled_h2 test flake #18701

minux opened this issue Jan 18, 2017 · 17 comments
Milestone

Comments

@minux
Copy link
Member

minux commented Jan 18, 2017

A recent trybot run showed this failure (CL is https://go-review.googlesource.com/c/35352/2):

https://storage.googleapis.com/go-build-log/ad057e0a/windows-386-gce_7ecf210b.log

--- FAIL: TestServerKeepAlivesEnabled_h2 (0.08s)
	clientserver_test.go:50: Get https://127.0.0.1:50159: read tcp 127.0.0.1:50161->127.0.0.1:50159: wsarecv: An established connection was aborted by the software in your host machine.
FAIL
FAIL	net/http	6.180s

/cc @bradfitz

@bradfitz bradfitz added this to the Go1.9Maybe milestone Jan 18, 2017
@bradfitz
Copy link
Contributor

I see nothing suspicious or racy in the test.

I'm afraid I won't be able to help on this one, just like I was unable to help last time in #14061.

/cc @alexbrainman

@alexbrainman
Copy link
Member

I run

go test -v -short -run=TestServerKeepAlivesEnabled_h2 net/http

in a loop against https://go-review.googlesource.com/#/c/35352/2 on my both windows xp and windows 7. It didn't fail even once.

Alex

@alexbrainman
Copy link
Member

Another failure from dashboard https://build.golang.org/log/cf6c2853214fb8d7a7667e0872d3bc3b8520b56d

Alex

@josharian
Copy link
Contributor

@bradfitz
Copy link
Contributor

@mdempsky
Copy link
Contributor

@bradfitz bradfitz modified the milestones: Go1.9Maybe, Go1.10 Jul 20, 2017
@mvdan
Copy link
Member

mvdan commented Aug 29, 2017

@mvdan
Copy link
Member

mvdan commented Aug 29, 2017

Err, I just noticed this was supposed to be a windows issue, and that the error is not exactly the same. Should I open a separate issue?

@tombergan
Copy link
Contributor

@mvdan yep, that looks like a separate issue

@bcmills
Copy link
Contributor

bcmills commented Sep 13, 2017

One more: windows, but the error message is closer to this one than #21724.
https://storage.googleapis.com/go-build-log/622d6748/windows-386-2008_eeda70f1.log

@alexbrainman
Copy link
Member

I applied this diff:

diff --git a/src/net/fd_windows.go b/src/net/fd_windows.go
index c2156b2..96f68b0 100644
--- a/src/net/fd_windows.go
+++ b/src/net/fd_windows.go
@@ -125,12 +125,20 @@ func (fd *netFD) connect(ctx context.Context, la, ra syscall.Sockaddr) (syscall.
 		}
 	}
 	// Refresh socket properties.
+	fd.debug("connect", "")
 	return nil, os.NewSyscallError("setsockopt", syscall.Setsockopt(fd.pfd.Sysfd, syscall.SOL_SOCKET, syscall.SO_UPDATE_CONNECT_CONTEXT, (*byte)(unsafe.Pointer(&fd.pfd.Sysfd)), int32(unsafe.Sizeof(fd.pfd.Sysfd))))
 }
 
+func (fd *netFD) debug(cmd, msg string) {
+	println(&fd.pfd, ":", cmd, ":", fd.pfd.Sysfd, ":", msg)
+}
+
 func (fd *netFD) Close() error {
+	fd.debug("close", "about to close")
 	runtime.SetFinalizer(fd, nil)
-	return fd.pfd.Close()
+	err := fd.pfd.Close()
+	fd.debug("close", "closed")
+	return err
 }
 
 func (fd *netFD) shutdown(how int) error {
@@ -148,7 +156,13 @@ func (fd *netFD) closeWrite() error {
 }
 
 func (fd *netFD) Read(buf []byte) (int, error) {
+	fd.debug("read", "about to read")
 	n, err := fd.pfd.Read(buf)
+	if err != nil {
+		fd.debug("read", "err="+err.Error())
+	} else {
+		fd.debug("read", "read "+itoa(n))
+	}
 	runtime.KeepAlive(fd)
 	return n, wrapSyscallError("wsarecv", err)
 }
@@ -220,6 +234,7 @@ func (fd *netFD) accept() (*netFD, error) {
 	rsa, _ := rrsa.Sockaddr()
 
 	netfd.setAddr(netfd.addrFunc()(lsa), netfd.addrFunc()(rsa))
+	fd.debug("accept", "")
 	return netfd, nil
 }

to 538b4ba and I can see this:

C:\Users\Alex>u:\test -test.v -test.run=TestServerKeepAlivesEnabled_h2
=== RUN   TestServerKeepAlivesEnabled_h2
0xc042094b40 : connect : 256 :
0xc042094900 : accept : 244 :
0xc042144000 : read : 312 : about to read
0xc042094b40 : read : 256 : about to read
0xc042144000 : read : 312 : read 160
0xc042144000 : read : 312 : about to read
0xc042094b40 : read : 256 : read 799
0xc042094b40 : read : 256 : about to read
0xc042144000 : read : 312 : read 93
0xc042094b40 : read : 256 : read 51
0xc042144000 : read : 312 : about to read
0xc042094b40 : read : 256 : about to read
0xc042144000 : read : 312 : read 167
0xc042094b40 : read : 256 : read 62
0xc042144000 : read : 312 : about to read
0xc042144000 : close : 312 : about to close
0xc042144000 : close : -1 : closed
0xc042144000 : read : 312 : read 38
0xc042094b40 : read : 256 : about to read
0xc042094b40 : read : 256 : read 239
0xc042094b40 : close : 256 : about to close
0xc0421fe000 : connect : 384 :
0xc042094b40 : close : -1 : closed
0xc0421fe000 : read : 384 : about to read
0xc042094900 : accept : 244 :
0xc042200000 : read : 344 : about to read
0xc042200000 : read : 344 : read 160
0xc042200000 : read : 344 : about to read
0xc0421fe000 : read : 384 : read 799
0xc0421fe000 : read : 384 : about to read
0xc042200000 : read : 344 : read 93
0xc0421fe000 : read : 384 : read 51
0xc042200000 : read : 344 : about to read
0xc0421fe000 : read : 384 : about to read
0xc042200000 : read : 344 : read 167
0xc0421fe000 : read : 384 : read 62
0xc042200000 : close : 344 : about to close
0xc0421fe000 : read : 384 : about to read
0xc042200000 : close : -1 : closed
0xc042200000 : read : 344 : about to read
0xc042200000 : read : -1 : err=use of closed network connection
0xc0421fe000 : read : 384 : err=An existing connection was forcibly closed by the remote host.
0xc0421fe000 : close : 384 : about to close
0xc042094900 : close : 244 : about to close
0xc0421fe000 : close : -1 : closed
0xc042094900 : close : -1 : about to close
0xc042094900 : close : -1 : closed
0xc042094900 : close : 244 : closed
--- FAIL: TestServerKeepAlivesEnabled_h2 (0.34s)
        clientserver_test.go:50: Get https://127.0.0.1:53283: read tcp 127.0.0.1:53285->127.0.0.1:53283: wsarecv: An existing connection was forcibly closed by the remote host.
FAIL

Note how there are two similar connections here. And during first connection client receives 799, 51, 62 and 239 bytes. While during second connection it fails to receive last 239 bytes.

I suspect what is happening here, the server sends these 239 bytes and closes connection immediately. That forces connection closed before client can read the data. If server wants to make sure that client gets the data, it must use CloseWrite followed by multiple Read until connection is closed.

Alex

@tombergan tombergan self-assigned this Oct 14, 2017
@tombergan
Copy link
Contributor

I suspect what is happening here, the server sends these 239 bytes and closes connection immediately.

You are right. I ran the test on a Windows box with GODEBUG=http2debug=2. After receiving a request, the server replies with a HEADERS, DATA, and GOAWAY, then closes the connection. The client's ReadFrame fails immediately, before receive any of those frames.

I understand why this flakes on Windows: The flake depends on whether or not the server reads the client's WINDOW_UPDATE frame before or after the server sends GOAWAY. If the WINDOW_UPDATE is not read when the connection is closed, then the kernel closes the connection with RST (instead of FIN) due to the non-empty receive buffer. I reduced this to the following repro:
https://play.golang.org/p/IEGfOLUQyP

I don't understand why TestServerKeepAlivesEnabled_h2 flakes on Windows but not Linux. Given that I have not seen a Linux flake, I expected the above program to fail on Windows but not Linux, but it fails deterministically on both Windows and Linux. Interestingly, the above program does not fail on play.golang.org. Perhaps docker is doing something weird.

In any case, the fix is simple. Currently, we close the connection immediately if we've responded to all streams that were open before we sent GOAWAY. Instead, we should keep the connection open for a short period of time so we can drain any packets from the kernel receive buffer. Ideally we'd delay about 1 RTT, under the assumption that the client won't send more frames after seeing GOAWAY, but getting a precise RTT measurement is annoying. I think it's sufficient to wait about 1 second (which we are already doing in a similar case).

@alexbrainman
Copy link
Member

I don't understand why TestServerKeepAlivesEnabled_h2 flakes on Windows but not Linux

I really wouldn't know. But you said "If the WINDOW_UPDATE is not read when the connection is closed, then the kernel closes the connection with RST", maybe WINDOW_UPDATE is always read before connection is closed on Linux (for one reason or another). Is it easy enough to make server close connection before reading WINDOW_UPDATE on Linux (just to see what happens)?

Interestingly, the above program does not fail on play.golang.org. Perhaps docker is doing something weird.

I don't think play.golang.org is running docker. Is it?

In any case, the fix is simple.

If you say so. Happy to leave it for you.

Alex

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/71372 mentions this issue: http2: always delay closing the connection after sending GOAWAY

gopherbot pushed a commit to golang/net that referenced this issue Oct 20, 2017
Currently, we close the connection immediately after sending a GOAWAY
frame if all outstanding responses have been completely sent. However,
the client may have had requests in-flight at that time which have been
queued in the kernel receive buffer. On both Windows and Linux, if the
connection is close()'d when the receive buffer is not empty, the kernel
sends RST. This has the effect of aborting both sides of the connection,
meaning the client may not actually receive all responses that were sent
before the GOAWAY.

Instead, we should delay calling close() until after the receive buffer
has been drained. We don't want to delay indefinitely, which means we
need some kind of timeout. Ideally that timeout should be about 1 RTT +
epsilon, under the assumption that the client will not send any more
frames after receiving the GOAWAY. However, 1 RTT is difficult to
measure. It turns out we were already using a 1 second delay in other
cases, so we reuse that same delay here.

Note that we do not call CloseWrite() to half-close the underlying TLS
connection. This seems unnecessary -- GOAWAY is effectively a half-close
at the HTTP/2 level.

Updates golang/go#18701 (fixes after it's bundled into net/http)

Change-Id: I4d68bada6369ba95e5db02afe6dfad0a393c0334
Reviewed-on: https://go-review.googlesource.com/71372
Run-TryBot: Tom Bergan <tombergan@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Reviewed-by: Joe Tsai <thebrokentoaster@gmail.com>
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/75210 mentions this issue: net/http: update bundled http2

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/88319 mentions this issue: [release-branch.go1.9] net/http: update bundled http2

c3mb0 pushed a commit to c3mb0/net that referenced this issue Apr 2, 2018
Currently, we close the connection immediately after sending a GOAWAY
frame if all outstanding responses have been completely sent. However,
the client may have had requests in-flight at that time which have been
queued in the kernel receive buffer. On both Windows and Linux, if the
connection is close()'d when the receive buffer is not empty, the kernel
sends RST. This has the effect of aborting both sides of the connection,
meaning the client may not actually receive all responses that were sent
before the GOAWAY.

Instead, we should delay calling close() until after the receive buffer
has been drained. We don't want to delay indefinitely, which means we
need some kind of timeout. Ideally that timeout should be about 1 RTT +
epsilon, under the assumption that the client will not send any more
frames after receiving the GOAWAY. However, 1 RTT is difficult to
measure. It turns out we were already using a 1 second delay in other
cases, so we reuse that same delay here.

Note that we do not call CloseWrite() to half-close the underlying TLS
connection. This seems unnecessary -- GOAWAY is effectively a half-close
at the HTTP/2 level.

Updates golang/go#18701 (fixes after it's bundled into net/http)

Change-Id: I4d68bada6369ba95e5db02afe6dfad0a393c0334
Reviewed-on: https://go-review.googlesource.com/71372
Run-TryBot: Tom Bergan <tombergan@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Reviewed-by: Joe Tsai <thebrokentoaster@gmail.com>
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/147419 mentions this issue: crypto/tls: implement TLS 1.3 middlebox compatibility mode

gopherbot pushed a commit that referenced this issue Nov 12, 2018
Looks like the introduction of CCS records in the client second flight
gave time to s_server to send NewSessionTicket messages in between the
client application data and close_notify. There seems to be no way of
turning NewSessionTicket messages off, neither by not sending a
psk_key_exchange_modes extension, nor by command line flag.

Interleaving the client write like that tickled an issue akin to #18701:
on Windows, the client reaches Close() before the last record is drained
from the send buffer, the kernel notices and resets the connection,
cutting short the last flow. There is no good way of synchronizing this,
so we sleep for a RTT before calling close, like in CL 75210. Sigh.

Updates #9671

Change-Id: I44dc1cca17b373695b5a18c2741f218af2990bd1
Reviewed-on: https://go-review.googlesource.com/c/147419
Run-TryBot: Filippo Valsorda <filippo@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Adam Langley <agl@golang.org>
@golang golang locked and limited conversation to collaborators Nov 8, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants