Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/http: apparent deadlock in TestServerWriteTimeout on plan9-amd64-0intro #57114

Closed
bcmills opened this issue Dec 6, 2022 · 9 comments
Closed
Assignees
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Plan9
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Dec 6, 2022

#!watchflakes
post <- builder ~ `^plan9` && pkg == "net/http" && test == "TestServerWriteTimeout"

As of CL 449935, the plan9-amd64-0intro builder went from failing something like 25% of the time to failing nearly 100% of the time.

The vast majority of the current failures are test timeouts running net/http, and they all seem to involve goroutines parked in net/http.(*http2ClientConn).RoundTrip via net/http_test.testServerWriteTimeout.

I think this is probably because the plan9 netpoll implementation does not support (or does not fully support?) SetDeadline.

(attn @golang/plan9; CC @neild)

@bcmills bcmills added OS-Plan9 NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Dec 6, 2022
@bcmills bcmills added this to the Backlog milestone Dec 6, 2022
@0intro
Copy link
Member

0intro commented Dec 6, 2022

Deadlines should be supported. What is not supported is half closing of TCP connections (closeWrite, closeRead, see issue #17906), but I don't believe it's related here.

I'm not sure why this issue is happening on the amd64 builder only and not on the 386 and arm builders. I'll investigate.

@bcmills
Copy link
Contributor Author

bcmills commented Jan 5, 2023

I'm not sure why this issue is happening on the amd64 builder only and not on the 386 and arm builders. I'll investigate.

Turns out that it does occasionally happen on the plan9-arm builder too:
https://build.golang.org/log/c3901bbe3d4681c4026a08664447fe2629aeebbd

@gopherbot
Copy link
Contributor

Found new dashboard test flakes for:

#!watchflakes
post <- builder ~ `^plan9` && pkg == "net/http" && test == "TestServerWriteTimeout"
2023-01-20 18:12 plan9-arm go@2507e789 net/http.TestServerWriteTimeout (log)
http.test 35687: warning: process exceeds 100 file descriptors
--- FAIL: TestServerWriteTimeout (0.00s)
    --- FAIL: TestServerWriteTimeout/h1 (0.83s)
        clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:55647: write tcp 127.0.0.1:53294->127.0.0.1:55647: i/o timeout
        serve_test.go:784: Get error, retrying: Get "https://127.0.0.1:53294": EOF
        clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:35055: write tcp 127.0.0.1:45038->127.0.0.1:35055: i/o timeout
        serve_test.go:784: Get error, retrying: Get "https://127.0.0.1:45038": EOF
        clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:34693: write tcp 127.0.0.1:58685->127.0.0.1:34693: i/o timeout
        serve_test.go:784: Get error, retrying: Get "https://127.0.0.1:58685": EOF
        clientserver_test.go:210: server log: http: TLS handshake error from 127.0.0.1:54889: write tcp 127.0.0.1:47544->127.0.0.1:54889: i/o timeout
        serve_test.go:784: Get error, retrying: Get "https://127.0.0.1:47544": EOF
        serve_test.go:797: server timed out writing request body: got err client disconnected; want os.ErrDeadlineExceeded

watchflakes

@millerresearch millerresearch self-assigned this Feb 20, 2023
@millerresearch
Copy link
Contributor

The deadlock occurs when the server hits a write deadline while trying to write the response headers. The client's Get() never returns because it's waiting to read the headers. The server never exits because it's waiting for the client to close the connection.

A workaround is to break the deadlock by inserting

cst.c.Transport.(*Transport).ResponseHeaderTimeout = 3 * time.Second

just before the call to Get in testServerWriteTimeout.

Strictly speaking this is not specifically a Plan 9 issue, since the deadlock could occur on any platform given the right timing. I've simulated this on linux by hacking a deadline in the past after the server reads the client's request. That results in the same deadlock.

However in the wild this is very unlikely to happen because the many-millisecond timeout used in the test should give plenty of time for the server to invoke the handler and start writing headers. Why are we seeing it on Plan 9? Because it turns out there is a race condition in the Plan 9 implementation of internal/poll.FD, when an I/O deadline timer fires at the same time as the deadline is being reset to zero. The race can be triggered in this test if the write deadline for the TLS handshake hits just after the handshake completes and the deadline is being reset. The consequence is that the connection remains marked as "timed out", so the server's subsequent write of the headers will fail.

I think I have a fix for the poll.FD race, which I'm testing now. Looking at history, I see some other races were fixed in 2020 by https://golang.org/cl/235820 but I didn't spot this one!

@bcmills should I submit the ResponseHeaderTimeout workaround for this test, or just rely on the poll.FD fix to prevent it from happening?

@bcmills
Copy link
Contributor Author

bcmills commented Feb 21, 2023

Strictly speaking this is not specifically a Plan 9 issue, since the deadlock could occur on any platform given the right timing. I've simulated this on linux by hacking a deadline in the past after the server reads the client's request. That results in the same deadlock.

should I submit the ResponseHeaderTimeout workaround for this test, or just rely on the poll.FD fix to prevent it from happening?

Could you file a separate issue for the cross-platform bug? I don't think that setting a client-side ResponseHeaderTimeout should be necessary.

If a write timeout occurs before (or while) writing headers, I would expect the server to close the connection immediately instead of waiting for the client to close it.

@millerresearch
Copy link
Contributor

If a write timeout occurs before (or while) writing headers, I would expect the server to close the connection immediately instead of waiting for the client to close it.

Is that the right thing to do with respect to http2 persistent connections? Or should the client be able to time out and send another request on the same connection? I don't know what the http spec says.

@bcmills
Copy link
Contributor Author

bcmills commented Feb 21, 2023

I'm not sure! That's part of why I think it merits a separate issue for discussion. 😅

@millerresearch
Copy link
Contributor

Could you file a separate issue for the cross-platform bug? I don't think that setting a client-side ResponseHeaderTimeout should be necessary.

If a write timeout occurs before (or while) writing headers, I would expect the server to close the connection immediately instead of waiting for the client to close it.

Sorry, I was mistaken: my simulation wasn't realistic. In http2 deadlines (after the TLS handshake) are set by an explicit time.AfterFunc, not by setting a deadline on the underlying connection. So this is just a Plan 9 issue.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/470215 mentions this issue: internal/poll: avoid race between SetDeadline and timer expiry in Plan 9

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Plan9
Projects
Archived in project
Development

No branches or pull requests

4 participants