bug: connections that never close #294

Davidson-Souza · 2024-11-29T15:45:54Z

If we leave floresta running for a long time, at least on Linux, we get several connections with state fin-wait2. Looks like this means that some side didn't close (likely the other peer). It doesn't seem like they are taking CPU time, but this may be a problem if we eventually run out of fd or something like this.

The text was updated successfully, but these errors were encountered:

vinny-pereira · 2024-12-08T12:09:21Z

Hey @Davidson-Souza , I would like to work on this issue. Would it be ok to be assigned to me?

vinny-pereira · 2024-12-15T04:32:26Z

After investigating this issue, I believe I’ve identified the cause.

The FIN_WAIT2 state occurs when a socket’s shutdown() sends a FIN packet, transitioning the connection from ESTABLISHED to FIN_WAIT1. Upon receiving an ACK from the peer, the connection moves to FIN_WAIT2, where it remains until the client receives the server's FIN packet. Normally, the Linux kernel handles orphaned FIN_WAIT2 connections via the tcp_fin_timeout setting. However, since Floresta controls the connection (non-orphaned), this mechanism doesn’t apply.

In Floresta’s Peer implementation, when a shutdown message is propagated to a peer, it closes the write half of the TcpStream and sets its shutdown flag to true. The read loop is expected to terminate upon receiving a FIN packet (returning 0) or reaching the shutdown flag.

The problem arises in specific scenarios where the read loop hangs indefinitely, leaving the connection in FIN_WAIT2. This happens because Floresta's TcpStreamActor implementation uses read_exact() from Tokio’s AsyncRead, which blocks until the buffer is filled. If no data is sent, read_exact() will hang forever, waiting for data or EOF. If the peer fails to send a FIN packet (due to mishandling or packet loss) and never actually sends some data, the connection cannot close.

Relevant code:

read_exact:
This function attempts to fill the provided buffer by repeatedly reading from the stream. There are three possible outcomes:
- The buffer is completely filled, and the function returns successfully.
- The stream reaches EOF before the buffer is filled, causing the function to throw an error.
- If the buffer is not yet fully filled, the function internally calls poll_read.
poll_read:
As per the developers comment on this specific function, if poll_read is called, but there is no actual transmission of data, the poll state will be set to Poll::Pending until there is actual data to be consumed. Only then the Output will be returned.

To solve this, I propose either adding a timeout (of 60 seconds as the is the standard for linux kernel's tcp_fin_timeout) to the read_exact() call or a cancellation token to the spawned task. This is a simpler and less intrusive solution compared to switching from a TcpStream to a TcpSocket for more granular control over keepalive and linger settings.

I’m also considering how to reliably reproduce this issue for testing, without running Floresta indefinitely. I am thinking about a small script that will connect two peers locally, where one attempts to close the connection and the other never confirms it. I am opened to suggestions.

I would appreciate your input and opinion on this @Davidson-Souza

Davidson-Souza · 2024-12-16T14:55:19Z

To solve this, I propose either adding a timeout (of 60 seconds as the is the standard for linux kernel's tcp_fin_timeout) to the read_exact() call or a cancellation token to the spawned task.

But we don't have control over this, do we? This is a system-wide config, not a local one.

Tokio docs say that close in AsyncWrite only closes the writer, not the reader. Perhaps changing the trait bounds inside peer.rs from AsyncWrite to AsyncWriteExt would help here.

Davidson-Souza added the bug Something isn't working label Nov 29, 2024

Davidson-Souza assigned vinny-pereira Dec 9, 2024

Guilospanck mentioned this issue Dec 20, 2024

Add prometheus support #314

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: connections that never close #294

bug: connections that never close #294

Davidson-Souza commented Nov 29, 2024

vinny-pereira commented Dec 8, 2024

vinny-pereira commented Dec 15, 2024

Davidson-Souza commented Dec 16, 2024

bug: connections that never close #294

bug: connections that never close #294

Comments

Davidson-Souza commented Nov 29, 2024

vinny-pereira commented Dec 8, 2024

vinny-pereira commented Dec 15, 2024

Davidson-Souza commented Dec 16, 2024