Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GC Bug #117

Closed
macrocan opened this issue Apr 15, 2021 · 12 comments · Fixed by #156
Closed

GC Bug #117

macrocan opened this issue Apr 15, 2021 · 12 comments · Fixed by #156

Comments

@macrocan
Copy link

macrocan commented Apr 15, 2021

Firstly, show one case

    let data = b"hello world";
    stream.write_all(data).await.expect("C write msg 1");
    stream.write_all(data).await.expect("C write msg 2");
    // msg3
    stream.write_all(data).await.expect("C write msg 3");

    // drop it immediately
    drop(stream);

Drop stream immediately after write message. As result, msg3 can't be received by reader.
After debug, We found that GC send RST frame before msg3.

// garbage_collect()

    if let Some(f) = frame {
        log::trace!("{}: sending: {}", self.id, f.header());
        self.socket.get_mut().send(&f).await.or(Err(ConnectionError::Closed))?
    }

GC send RST frame directly to reader, At the meantime, msg3 is still queueed in stream receiver.
If GC send RST frame to stream receiver, the bug will be fixed.

// garbage_collect()

    if let Some(f) = frame {
        log::trace!("{}/{}: sending: {}", self.id, stream_id, f.header());
        // send to message process queue to process(/send) orderly
        let cmd = StreamCommand::SendFrame(f.left());
        if self.stream_sender.try_send(cmd).is_err() {
            continue;
        }
    }
@macrocan macrocan mentioned this issue Apr 15, 2021
@mxinden
Copy link
Member

mxinden commented Apr 21, 2021

Yes, indeed, there is a race condition here. Users of yamux have to call poll_close on a stream before dropping it to make sure all messages are sent before closing the stream with the remote.

Let me know in case that works for you.

@macrocan
Copy link
Author

Yes, indeed, there is a race condition here. Users of yamux have to call poll_close on a stream before dropping it to make sure all messages are sent before closing the stream with the remote.

Let me know in case that works for you.

For convenience of user, It seems that GC can close stream automatically when stream is dropped.

@tomaka
Copy link
Member

tomaka commented Apr 23, 2021

By sending RST to the remote, you inform of the fact that the messages that the remote has sent on the substream might not have been received and processed. Sending a close message and immediately discarding the substream would be against that idea.

@macrocan
Copy link
Author

By sending RST to the remote, you inform of the fact that the messages that the remote has sent on the substream might not have been received and processed. Sending a close message and immediately discarding the substream would be against that idea.

My mistake, when stream is dropped, Gc should reset stream after all queued message were sent.

@tomaka
Copy link
Member

tomaka commented Apr 25, 2021

It's not about sending remaining messages, but about receiving the messages that the remote might be sending to you. In your specific situation, you know that the remote doesn't send anything, but that's not the case in the absolute.

@macrocan
Copy link
Author

It's not about sending remaining messages, but about receiving the messages that the remote might be sending to you. In your specific situation, you know that the remote doesn't send anything, but that's not the case in the absolute.

You are right, we should send Ack alongside with RST frame if receive from the remote.
I think that Gc can do it also.

@folex
Copy link

folex commented Apr 6, 2022

is this bug still here?

I'm getting lots of unexpected EOFs when sending big messages between two nodes, could it be relevant? EOFs start to happen when message size exceeds send_split_size parameter

UPD: After several days of debugging, it turned out I didn't do socket.close().await in OutboundUpgrade.

@mxinden
Copy link
Member

mxinden commented Apr 6, 2022

is this bug still here?

I am not aware of it.

UPD: After several days of debugging, it turned out I didn't do socket.close().await in OutboundUpgrade.

Do I understand correctly that this resolves the above mentioned EOF on large messages? 🎉

@folex
Copy link

folex commented Apr 7, 2022

Do I understand correctly that this resolves the above mentioned EOF on large messages?

Yes! :)

While we're at it, could you kindly elaborate on why requirement to call socket.close() appeared and why there wasn't such a requirement before?

I'm looking at e.g. rust-ipfs c7a79980bd and I see that they somehow understood that they should add socket.close().

I want to learn how they knew about this requirement, was it communicated or expressed somewhere in the docs?

Many thanks for your great work!

@mxinden
Copy link
Member

mxinden commented Apr 8, 2022

While we're at it, could you kindly elaborate on why requirement to call socket.close() appeared and why there wasn't such a requirement before?

I'm looking at e.g. rust-ipfs c7a79980bd and I see that they somehow understood that they should add socket.close().

I want to learn how they knew about this requirement, was it communicated or expressed somewhere in the docs?

Puuh, I am not sure to be honest. Maybe @koivunej knows more.

Many thanks for your great work!

❤️

@koivunej
Copy link

koivunej commented Apr 9, 2022

Thanks for the ping!

Regarding the linked socket.close().await I am pretty sure it was due to related change in write_length_prefixed in rust-libp2p and this need for this was hinted in the rust-libp2p change as required migration ... I'll try to find it, sadly nothing of this was recorded in comments or commit msg. I would however not take rust-ipfs as an example @folex; I haven't had time to work on it for a long while now.

I'll update if/when I'll find more rationale for that close.

Oki, found it I think, originally write_one was being used in that particular place, and it used to do the close dance as well, so the linked commit was just about restoring the old behaviour without deprecated rust-libp2p code. The related libp2p commit is libp2p/rust-libp2p@c1ef4bf.

However to repeat, I cannot comment on if that close should be there, as my efforts have recently focused on trying to understand the latest windows test failures which started with the latest rust-libp2p upgrade. The bitswap protocol we have is nothing but a draft. I'll try to remember to add some disclaimers around it as well.

@thomaseizinger
Copy link
Contributor

This bug can still be triggered if a user calls flush() first and then drops the stream because at the moment, poll_flush is a no-op. I have a fix for this in #156.

mxinden pushed a commit that referenced this issue May 24, 2023
…156)

Currently, we have a `garbage_collect` function that checks whether any of our
streams have been dropped. This can cause a race condition where the channel
between a `Stream` and the `Connection` still has pending frames for a stream
but dropping a stream causes us to already send a `FIN` flag for the stream.

We fix this by maintaining a single channel for each stream. When a stream gets
dropped, the `Receiver` becomes disconnected. We use this information to queue
the correct frame (`FIN` vs `RST`) into the buffer. At this point, all previous
frames have already been processed and the race condition is thus not present.

Additionally, this also allows us to implement `Stream::poll_flush` by
forwarding to the underlying `Sender`. Note that at present day, this only
checks whether there is _space_ in the channel, not whether the items have been
emitted by the `Receiver`.

We have a PR upstream that might fix this:
rust-lang/futures-rs#2746

Fixes: #117.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants