-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
net/http: Transport wastes TLS handshake progress #50984
Comments
sounds related to #50657 somehow |
CC @neild |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as duplicate.
This comment was marked as duplicate.
I think we can't use a dedicated context for the TLS handshake, (at least not with a simple code modification). If user doesn't use idle pool ( Maybe we can check the connection pool for idle connection, if there're any idle connections, we can skip the dialing process, Another solution is adding a |
Any updates here? It seems like a few things that use this as a core dependency are experiencing intermittent TLS handshake errors (i am one of those people) |
This comment was marked as duplicate.
This comment was marked as duplicate.
Is this the same thing as #59017, which was fixed in 1.23? |
any update? |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes, this is present in the latest stable release, go1.17.6, and the latest tagged beta. It is not present in go1.16.13. It was introduced by the work for #32406.
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
An application I work with uses net/http to make outbound requests over HTTP/1.1+TLS. It uses Context values to control the lifecycle of those requests and to set timeouts. It cleans up the Context values (to avoid leaking timers, as recommended by
go vet
) when the outbound HTTPS call is complete.Here's an overview of the most relevant code:
I've included a full reproducer at the bottom, also available at https://go.dev/play/p/_vg00gqqbpG although the Playground's network stack messes slightly with the results.
What did you expect to see?
I expected that code pattern (use Context to set a timeout, clean up immediately after the call) to be a best practice. I expected it to lead to efficient use of compute resources. I expected it to provide good performance (not much unnecessary latency).
What did you see instead?
After upgrading the application from from Go 1.16 to Go 1.17, the load balancer in front of the HTTPS service this application calls saw about a 4x increase in the volume of inbound TLS handshake requests, while reporting no change in the typical number of concurrent connections. A 4x increase in TLS handshake volume for this service is enough to throw off capacity planning.
The reproducer shows that canceling the Context value that was used for a request that is now complete causes harm to the connection pool. It shows that when a request finds an empty pool and so triggers creation of a new connection, the link between the request that triggered the new connection outlives the request's need for that specific connection.
When another request to the same backend completes and makes its connection available in the Transport's pool, the request that triggered a new dial may use the now-idle connection instead. That allows the request to complete even before the connection it triggered is ready.
All recent Go versions include that link when executing a TCP handshake, leading to some TCP connections being closed as soon as they become ready thereby wasting the work of the DNS lookup and TCP handshake. New in Go 1.17 is an additional link to the TLS handshake, allowing that more-expensive work to be discarded. From the reproducer:
With Go 1.17+, the idle pool remains smaller than what the application needs in the steady state. More requests find an empty pool and need to wait for a connection to become available (generally winning in the race against a freshly-dialed connection). The reproducer shows an increase to the average request duration, though I haven't measured that in a production setting. (TLS handshakes are expensive, not slow.)
Reproducer results
Reproducer code,
dial_test.go
The text was updated successfully, but these errors were encountered: