Skip to content

Commit d9a5b8e

Browse files
authored
fix(iroh): Allow gracefully closing connections (#3170)
## Description This adds a test case that checks that we can call `Connection::close` and actually receive the close code on the other side, indicating we've gracefully closed the connection. Even though in the test we call `Endpoint::close` and wait for that, with current iroh 0.32, we don't receive the close frame. The only way I can see to make it work with iroh's 0.32 API, is to choose some time to sleep before calling `Endpoint::close`. I've also attached a revert of #3165 which fixes the test, as that makes `Endpoint::close` wait for the close frames to be acknowledged again (by calling `quinn::Endpoint::wait_idle`). ## Notes & open questions There might be other ways to fix this, I welcome feedback on that. I do believe this is the right way to fix it, though. Given that the `Endpoint::wait_idle` call has been introduced and removed *twice* already, we should finally make a choice on whether it belongs in there or not, and then add sufficient code comments to make sure whoever looks at it next is fully informed. ## Change checklist - [x] Self-review. - [x] Documentation updates following the [style guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text), if relevant. - [x] Tests if relevant. - [x] All breaking changes documented.
1 parent f640e83 commit d9a5b8e

File tree

2 files changed

+97
-29
lines changed

2 files changed

+97
-29
lines changed

iroh/src/endpoint.rs

+80-21
Original file line numberDiff line numberDiff line change
@@ -1023,17 +1023,32 @@ impl Endpoint {
10231023
/// of `0` and an empty reason. Though it is best practice to close those
10241024
/// explicitly before with a custom error code and reason.
10251025
///
1026-
/// This will not wait for the [`quinn::Endpoint`] to drain connections.
1027-
///
1028-
/// To ensure no data is lost, design protocols so that the last *sender*
1029-
/// of data in the protocol calls [`Connection::closed`], and `await`s until
1030-
/// it receives a "close" message from the *receiver*. Once the *receiver*
1031-
/// gets the last data in the protocol, it should call [`Connection::close`]
1032-
/// to inform the *sender* that all data has been received.
1033-
///
1034-
/// Be aware however that the underlying UDP sockets are only closed
1035-
/// on [`Drop`], bearing in mind the [`Endpoint`] is only dropped once all the clones
1036-
/// are dropped.
1026+
/// It will then make a best effort to wait for all close notifications to be
1027+
/// acknowledged by the peers, re-transmitting them if needed. This ensures the
1028+
/// peers are aware of the closed connections instead of having to wait for a timeout
1029+
/// on the connection. Once all connections are closed or timed out, the future
1030+
/// finishes.
1031+
///
1032+
/// The maximum time-out that this future will wait for depends on QUIC transport
1033+
/// configurations of non-drained connections at the time of calling, and their current
1034+
/// estimates of round trip time. With default parameters and a conservative estimate
1035+
/// of round trip time, this call's future should take 3 seconds to resolve in cases of
1036+
/// bad connectivity or failed connections. In the usual case, this call's future should
1037+
/// return much more quickly.
1038+
///
1039+
/// It is highly recommended you *do* wait for this close call to finish, if possible.
1040+
/// Not doing so will make connections that were still open while closing the endpoint
1041+
/// time out on the remote end. Thus remote ends will assume connections to have failed
1042+
/// even if all application data was transmitted successfully.
1043+
///
1044+
/// Note: Someone used to closing TCP sockets might wonder why it is necessary to wait
1045+
/// for timeouts when closing QUIC endpoints, while they don't have to do this for TCP
1046+
/// sockets. This is due to QUIC and its acknowledgments being implemented in user-land,
1047+
/// while TCP sockets usually get closed and drained by the operating system in the
1048+
/// kernel during the "Time-Wait" period of the TCP socket.
1049+
///
1050+
/// Be aware however that the underlying UDP sockets are only closed once all clones of
1051+
/// the the respective [`Endpoint`] are dropped.
10371052
pub async fn close(&self) {
10381053
if self.is_closed() {
10391054
return;
@@ -1542,16 +1557,21 @@ impl Connection {
15421557
///
15431558
/// # Gracefully closing a connection
15441559
///
1545-
/// Only the peer last **receiving** application data can be certain that all data is
1546-
/// delivered.
1547-
///
1548-
/// To communicate to the last **sender** of the application data that all the data was received, we recommend designing protocols that follow this pattern:
1549-
///
1550-
/// 1) The **sender** sends the last data. It then calls [`Connection::closed`]. This will wait until it receives a CONNECTION_CLOSE frame from the other side.
1551-
/// 2) The **receiver** receives the last data. It then calls [`Connection::close`] and provides an error_code and/or reason.
1552-
/// 3) The **sender** checks that the error_code is the expected error code.
1553-
///
1554-
/// If the `close`/`closed` dance is not done, or is interrupted at any point, the connection will eventually time out, provided that the idle timeout is not disabled.
1560+
/// Only the peer last receiving application data can be certain that all data is
1561+
/// delivered. The only reliable action it can then take is to close the connection,
1562+
/// potentially with a custom error code. The delivery of the final CONNECTION_CLOSE
1563+
/// frame is very likely if both endpoints stay online long enough, calling
1564+
/// [`Endpoint::close`] will wait to provide sufficient time. Otherwise, the remote peer
1565+
/// will time out the connection, provided that the idle timeout is not disabled.
1566+
///
1567+
/// The sending side can not guarantee all stream data is delivered to the remote
1568+
/// application. It only knows the data is delivered to the QUIC stack of the remote
1569+
/// endpoint. Once the local side sends a CONNECTION_CLOSE frame in response to calling
1570+
/// [`close`] the remote endpoint may drop any data it received but is as yet
1571+
/// undelivered to the application, including data that was acknowledged as received to
1572+
/// the local endpoint.
1573+
///
1574+
/// [`close`]: Connection::close
15551575
#[inline]
15561576
pub fn close(&self, error_code: VarInt, reason: &[u8]) {
15571577
self.inner.close(error_code, reason)
@@ -2513,4 +2533,43 @@ mod tests {
25132533

25142534
Ok(())
25152535
}
2536+
2537+
#[tokio::test]
2538+
#[traced_test]
2539+
async fn graceful_close() -> testresult::TestResult {
2540+
let client = Endpoint::builder().bind().await?;
2541+
let server = Endpoint::builder()
2542+
.alpns(vec![TEST_ALPN.to_vec()])
2543+
.bind()
2544+
.await?;
2545+
let server_addr = server.node_addr().await?;
2546+
let server_task = tokio::spawn(async move {
2547+
let incoming = server.accept().await.unwrap();
2548+
let conn = incoming.await?;
2549+
let (mut send, mut recv) = conn.accept_bi().await?;
2550+
let msg = recv.read_to_end(1_000).await?;
2551+
send.write_all(&msg).await?;
2552+
send.finish()?;
2553+
let close_reason = conn.closed().await;
2554+
testresult::TestResult::Ok(close_reason)
2555+
});
2556+
2557+
let conn = client.connect(server_addr, TEST_ALPN).await?;
2558+
let (mut send, mut recv) = conn.open_bi().await?;
2559+
send.write_all(b"Hello, world!").await?;
2560+
send.finish()?;
2561+
recv.read_to_end(1_000).await?;
2562+
conn.close(42u32.into(), b"thanks, bye!");
2563+
client.close().await;
2564+
2565+
let close_err = server_task.await??;
2566+
let ConnectionError::ApplicationClosed(app_close) = close_err else {
2567+
panic!("Unexpected close reason: {close_err:?}");
2568+
};
2569+
2570+
assert_eq!(app_close.error_code, 42u32.into());
2571+
assert_eq!(app_close.reason.as_ref(), b"thanks, bye!");
2572+
2573+
Ok(())
2574+
}
25162575
}

iroh/src/magicsock.rs

+17-8
Original file line numberDiff line numberDiff line change
@@ -1809,20 +1809,29 @@ impl Handle {
18091809
/// Only the first close does anything. Any later closes return nil.
18101810
/// Polling the socket ([`AsyncUdpSocket::poll_recv`]) will return [`Poll::Pending`]
18111811
/// indefinitely after this call.
1812-
///
1813-
/// This will not wait for the [`quinn::Endpoint`] to drain connections.
1814-
///
1815-
/// To ensure no data is lost, design protocols so that the last *sender*
1816-
/// of data in the protocol calls [`crate::endpoint::Connection::closed`], and `await`s until
1817-
/// it receives a "close" message from the *receiver*. Once the *receiver*
1818-
/// gets the last data in the protocol, it should call [`crate::endpoint::Connection::close`]
1819-
/// to inform the *sender* that all data has been received.
18201812
#[instrument(skip_all, fields(me = %self.msock.me))]
18211813
pub(crate) async fn close(&self) {
18221814
trace!("magicsock closing...");
18231815
// Initiate closing all connections, and refuse future connections.
18241816
self.endpoint.close(0u16.into(), b"");
18251817

1818+
// In the history of this code, this call had been
1819+
// - removed: https://github.com/n0-computer/iroh/pull/1753
1820+
// - then added back in: https://github.com/n0-computer/iroh/pull/2227/files#diff-ba27e40e2986a3919b20f6b412ad4fe63154af648610ea5d9ed0b5d5b0e2d780R573
1821+
// - then removed again: https://github.com/n0-computer/iroh/pull/3165
1822+
// and finally added back in together with this comment.
1823+
// So before removing this call, please consider carefully.
1824+
// Among other things, this call tries its best to make sure that any queued close frames
1825+
// (e.g. via the call to `endpoint.close(...)` above), are flushed out to the sockets
1826+
// *and acknowledged* (or time out with the "probe timeout" of usually 3 seconds).
1827+
// This allows the other endpoints for these connections to be notified to release
1828+
// their resources, or - depending on the protocol - that all data was received.
1829+
// With the current quinn API, this is the only way to ensure protocol code can use
1830+
// connection close codes, and close the endpoint properly.
1831+
// If this call is skipped, then connections that protocols close just shortly before the
1832+
// call to `Endpoint::close` will in most cases cause connection time-outs on remote ends.
1833+
self.endpoint.wait_idle().await;
1834+
18261835
if self.msock.is_closed() {
18271836
return;
18281837
}

0 commit comments

Comments
 (0)