Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: connections that never close #294

Open
Davidson-Souza opened this issue Nov 29, 2024 · 3 comments
Open

bug: connections that never close #294

Davidson-Souza opened this issue Nov 29, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@Davidson-Souza
Copy link
Collaborator

If we leave floresta running for a long time, at least on Linux, we get several connections with state fin-wait2. Looks like this means that some side didn't close (likely the other peer). It doesn't seem like they are taking CPU time, but this may be a problem if we eventually run out of fd or something like this.

@Davidson-Souza Davidson-Souza added the bug Something isn't working label Nov 29, 2024
@vinny-pereira
Copy link

Hey @Davidson-Souza , I would like to work on this issue. Would it be ok to be assigned to me?

@vinny-pereira
Copy link

After investigating this issue, I believe I’ve identified the cause.

The FIN_WAIT2 state occurs when a socket’s shutdown() sends a FIN packet, transitioning the connection from ESTABLISHED to FIN_WAIT1. Upon receiving an ACK from the peer, the connection moves to FIN_WAIT2, where it remains until the client receives the server's FIN packet. Normally, the Linux kernel handles orphaned FIN_WAIT2 connections via the tcp_fin_timeout setting. However, since Floresta controls the connection (non-orphaned), this mechanism doesn’t apply.

In Floresta’s Peer implementation, when a shutdown message is propagated to a peer, it closes the write half of the TcpStream and sets its shutdown flag to true. The read loop is expected to terminate upon receiving a FIN packet (returning 0) or reaching the shutdown flag.

The problem arises in specific scenarios where the read loop hangs indefinitely, leaving the connection in FIN_WAIT2. This happens because Floresta's TcpStreamActor implementation uses read_exact() from Tokio’s AsyncRead, which blocks until the buffer is filled. If no data is sent, read_exact() will hang forever, waiting for data or EOF. If the peer fails to send a FIN packet (due to mishandling or packet loss) and never actually sends some data, the connection cannot close.

Relevant code:

  1. read_exact:
    This function attempts to fill the provided buffer by repeatedly reading from the stream. There are three possible outcomes:

    • The buffer is completely filled, and the function returns successfully.
    • The stream reaches EOF before the buffer is filled, causing the function to throw an error.
    • If the buffer is not yet fully filled, the function internally calls poll_read.
  2. poll_read:
    As per the developers comment on this specific function, if poll_read is called, but there is no actual transmission of data, the poll state will be set to Poll::Pending until there is actual data to be consumed. Only then the Output will be returned.

To solve this, I propose either adding a timeout (of 60 seconds as the is the standard for linux kernel's tcp_fin_timeout) to the read_exact() call or a cancellation token to the spawned task. This is a simpler and less intrusive solution compared to switching from a TcpStream to a TcpSocket for more granular control over keepalive and linger settings.

I’m also considering how to reliably reproduce this issue for testing, without running Floresta indefinitely. I am thinking about a small script that will connect two peers locally, where one attempts to close the connection and the other never confirms it. I am opened to suggestions.

I would appreciate your input and opinion on this @Davidson-Souza

@Davidson-Souza
Copy link
Collaborator Author

To solve this, I propose either adding a timeout (of 60 seconds as the is the standard for linux kernel's tcp_fin_timeout) to the read_exact() call or a cancellation token to the spawned task.

But we don't have control over this, do we? This is a system-wide config, not a local one.

Tokio docs say that close in AsyncWrite only closes the writer, not the reader. Perhaps changing the trait bounds inside peer.rs from AsyncWrite to AsyncWriteExt would help here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants