Issues with client termination of H2 CONNECT streams #3652

howardjohn · 2024-04-29T19:18:25Z

Version
Hyper 1.3

Platform

6.8.7-arch1-1 #1 SMP PREEMPT_DYNAMIC Wed, 17 Apr 2024 15:20:28 +0000 x86_64 GNU/Linux

Description

We are seeing a few issues around using H2 CONNECT. Our application is using Tokio, Rustls, and Hyper to communicate between a client and server both using the same stack. We have multiple streams on one TCP connection sometimes (though the issue occurs regardless of multiplexing).

Our first issue that popped up was leaking connections. This was debugging and an (attempted) fix was opened in #3647. More details there. This was like not detected before because:

The CONNECT tests in the repo all close from the server side
When using hyper's pool this isn't seen since the pool will hold onto the connection after all streams are closed (until the idle timeout). We only started seeing this when we stopped using hyper's pool and started closing the connection immediately after the last stream closed.

With that fix, everything was working fine. However, a later refactor in our broke things again; we started dropping the SendRequest before we were complete with IO operations on Upgraded (before, we dropped the SendRequest after). This causes the stream/connection to be closed unexpectedly. This can be reproduced easily by moving the drop(client) in #3647 to before the write/read operations. This is easy to workaround, but its not documented and @seanmonstar suggested it was not expected on Discord.

The text was updated successfully, but these errors were encountered:

seanmonstar · 2024-04-30T13:38:30Z

That is surprising. The Connection is design to stay alive until all shared references to it are gone. Those are the SendRequest, and any ResponseFuture, SendStream, and RecvStream. Perhaps something about the way the types are stored inside the Upgraded makes it drop the reference and no longer be counted for liveness?

cc @nox, who worked on this previously.

howardjohn · 2024-04-30T18:48:57Z

Ah, I think I got things a little mixed up.

In #3647, I sent a test and a fix.

The test was supposed to reproduce "connection is not closed when SendRequest and Upgraded are dropped". It was written incorrectly; I never dropped Upgraded which kept the connection alive as expected.
My fix closed the connection when all SendRequests dropped, even if there were SendStream, RecvStream, etc left (via Upgraded). So it is the cause of the issue described

That being said, I do still see the issue in our application despite dropping everything, I am just unable to reproduce it in a unit test.

See hyperium/hyper#3652. What I have found is the final reference to a stream being dropped after the `maybe_close_connection_if_no_streams` but before the `inner.poll()` completes can lead to the connection dangling forever without any forward progress. No streams/references are alive, but the connection is not complete and never wakes up again. This seems like a classic TOCTOU race condition. In this fix, I check again at the end of poll and if this state is detected, wake up the task again. Wth the test in hyperium/hyper#3655, on my machine, it fails about 5% of the time: ``` 1876 runs so far, 100 failures (94.94% pass rate). 95.197349ms avg, 1.097347435s max, 5.398457ms min ``` With that PR, this test is 100% reliable ``` 64010 runs so far, 0 failures (100.00% pass rate). 44.484057ms avg, 121.454709ms max, 1.872657ms min ``` Note: we also have reproduced this using `h2` directly outside of `hyper`, which is what gives me confidence this issue lies in `h2` and not `hyper`.

howardjohn · 2024-05-01T22:36:22Z

Ok I think I figured it out and put up two PRs:

Test in hyper Add test for HTTP2 CONNECT termination #3655
Fix in h2 Fix race condition in connection termination h2#772

See hyperium/hyper#3652. What I have found is the final reference to a stream being dropped after the `maybe_close_connection_if_no_streams` but before the `inner.poll()` completes can lead to the connection dangling forever without any forward progress. No streams/references are alive, but the connection is not complete and never wakes up again. This seems like a classic TOCTOU race condition. In this fix, I check again at the end of poll and if this state is detected, wake up the task again. Wth the test in hyperium/hyper#3655, on my machine, it fails about 5% of the time: ``` 1876 runs so far, 100 failures (94.94% pass rate). 95.197349ms avg, 1.097347435s max, 5.398457ms min ``` With that PR, this test is 100% reliable ``` 64010 runs so far, 0 failures (100.00% pass rate). 44.484057ms avg, 121.454709ms max, 1.872657ms min ``` Note: we also have reproduced this using `h2` directly outside of `hyper`, which is what gives me confidence this issue lies in `h2` and not `hyper`.

howardjohn added the C-bug Category: bug. Something is wrong. This is bad! label Apr 29, 2024

seanmonstar added E-medium Effort: medium. Some knowledge of how hyper internal works would be useful. A-http2 Area: HTTP/2 specific. labels Apr 30, 2024

howardjohn mentioned this issue May 1, 2024

Add test for HTTP2 CONNECT termination #3655

Merged

howardjohn mentioned this issue May 1, 2024

Fix race condition in connection termination hyperium/h2#772

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with client termination of H2 CONNECT streams #3652

Issues with client termination of H2 CONNECT streams #3652

howardjohn commented Apr 29, 2024

seanmonstar commented Apr 30, 2024

howardjohn commented Apr 30, 2024

howardjohn commented May 1, 2024

Issues with client termination of H2 CONNECT streams #3652

Issues with client termination of H2 CONNECT streams #3652

Comments

howardjohn commented Apr 29, 2024

seanmonstar commented Apr 30, 2024

howardjohn commented Apr 30, 2024

howardjohn commented May 1, 2024