-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Endless buffering of outbound traffic when network is loaded #181
Comments
@daonb That's what happens when you don't check bufferedAmount. |
Related topic: #77 |
@enobufs thanks for the reply. I don't think that's the issue I'm seeing for two reasons:
I believe there's a bug in congestion recovery but I have no tests to prove it :-( To test it I need a client that can be in "buffering mode" where outgoing datagrams are buffered for XXX ms rather than sent. This way we can write tests that simulate congestion and better test the congestion control code. |
@daonb It's hard to debug without reproducible case... What I saw: Many data channels are used
SACK stops coming in from the peer after 23:23:33.054958Last TSN sack'ed was 903304684
One T3-rtx timeout is in the log but it was the 5th RTO
Two retransmission occur right after the T3-rtx timeout, but there's no SACK
Two errors
My Thoughts
@daonb I have no clue other than the peer stop sending SACK (Selective ACK). Can you describe more about your environment?
Recently, I came across a bug in usrsctp (most of the browsers use) that occurred when the partial reliability option was used over a lossy connection - See #138 - possibly related. I am not sure if iPadOS uses it, and I don't know if the fix has already been applied to the browsers out there. |
Thanks for diving into this, I appreciate the help. Here are the answers to your questions, followed by one question I have.
I'm using pion for a backend - webexec - used by terminal7 - a hybrid tablet app. Each terminal pane has its own data channel used to pipe data in & out. In addition there's one data channel used for control. The 37 incoming messages are 37 keystrokes. Every time my server gets this kind of a message it passes it to the shell which causes the shell to print it on its pseudo terminal and webexec sends it back to the client. I know my server calls sctp to send the data but it doesn't get to
I only set the ICETimeouts and the LoggerFactory
I use a home network to connect to a low latency host server (latest speettest - 24.42Mbps download, 2.8 Mbps upload)
It happens only when the network is overloaded wihch is quite often: I share the network with my two daughters and sometime they are both on zoom at the same time. I'm sure my uplink is badly congested and the SACKs suffer huge delays and sometimes dropped.
I should have it running on chrome-android soon so I'll test it once I have it
Why only two retransmissions? I suspect the client sends the SACKs but they don't get to the server and the same happens with the ACKs on both retransmits. Shouldn't rtcp keep retransmitting until it gets an ACK? |
Looking at this line:
The retransmission on the log was made after the 5th attempt (nRtos: the number of retransmission timeouts) and the log you gave me only had the last 23 seconds. Pion/sctp implements backoff algorithm as RFC says, doubling the interval time every time T3-rtx expire, so at that point, the interval was 32 seconds at least (depends on the measured RTT). |
Great! We've solved this. IMO, the doubling of retransmission timeout is wrong for real time communications. I'm not saying you should change it, it's in the RFC:
I think the authors were too lazy here. TCP was made for file transfer, not real-time communication. Luckily, in C7 the RFC defines RTO.max which let's me limit the retransmission timeout. Reading the code I learned we do have an I guess I need to add |
@daonb we don't control those, I don't know if we can add stuff. Here are all the values. Since we also compile to WASM mode it will cause portability issues. I think the best thing we could do is create an issue against webrtc-pc and see if there is any interest ideas. I would call out that webrtc-nv matches up with your use case. If they completely push back then I would say adding a |
It was the fifth retransmission got timed out, and the resulting retransmission also failed (I don't see SACK for it). Could you tell me why you'd think reducing max RTO would help? I think something else is going on. We could make the max RTO configurable, but once a "pipe" is clogged, it usually takes a long time for the congestion to be cleared and I cannot think of any other way than the exponential backoff particularly when you cannot rule out the timeout was caused by congestion. I think you should find out why the browser is not responding to all these retransmissions in the first place. Maybe you could modify the value by hand (using |
In the old TCP days everyone was transferring files so congestion at the edge router was common and the backoff made sense. It was a way to wait for other transfers to finish. Usually, large files were transferred in the background and the back off timer worked great. We didn't really care when the transfer ends as long as the magic happens and the files appear on the target server. Today's congestion is completely different. It's a result of rate controlled live video streaming and waiting for it to end does not make any sense. What's more, My application takes very little bandwidth compared to video streaming. With RTOMax of 2-3 seconds I have a better chance of getting some packets through before the user is frustrated. I know that 60 seconds for RTOMax comes from the standard but it's wrong for real time applications. No one will wait 60 seconds to see the key they hit appear on the screen. I wanted to set RTOMax by the channel because I want to add file transfer and in that case I need the exponential backoff. No matter, I'll cross the bridge when I get there. For now, I I think I should open an issue in pion/webrtc to add RTOMax to |
@daonb you can skip right to the PR even (and link to this issue in the commit). I am happy to merge this right away so you can get back to building :) I would love to bring it up in the W3C. I bet lots of other devs are hitting it, they just don't understand this stuff at the level you are. Even in the WebRTC working group only a small subset of people understand SCTP. |
I would love to take it up with W3C. It's fun to be reminded of a products I developed back in the '90s and what the net was like back in the last millennium. |
@enobufs I've found a case where this seems to consistently fail. See coder/coder-v1-cli@5ff4902 If you copy the DataChannel to Running tests with go test -v -bench=. cdr.dev/coder-cli/wsnet Should result in output similar to...
|
I've isolated it further in this commit: Uncomment the following line, and it works as expected: coder/coder-v1-cli@0dc0e56#diff-34c91981f059bb5d86d78bb53990a009247009fcfc946b00884b03f5c1a63d55R95 My assumption is there's a race condition somewhere when we write a lot of individual messages at the same time... but I'm not exactly sure where to look. |
Looking into this now. The commit hash @kylecarbs reported does not seem to exist, but I was able to find a commit that is, I believe, close to the one Kyle referred to:
It runs, as he explained, with the following command:
It passes with no issue, but when I enabled DEBUG log, it reproduced what Kyle reported:
What I am seeing is, sender side SCTP is retransmitting (T3 retransmission timeout) packets but receiver side would not respond with SACK at all. Adding a debug log at ice level revealed that the receiver side's packetio.Buffer was full, and all the retransmitted packets were discarded at the receiver (at ICE level).
At SCTP level,
Next, I will try to simplify the reproducible code... |
@daonb @Sean-Der Do you have any thoughts on what could cause FYI: I made a slight modification to the log line like this: ice % git diff
diff --git a/candidate_base.go b/candidate_base.go
index c4f6f1b..5a7420f 100644
--- a/candidate_base.go
+++ b/candidate_base.go
@@ -256,7 +256,7 @@ func handleInboundCandidateMsg(ctx context.Context, c Candidate, buffer []byte,
// NOTE This will return packetio.ErrFull if the buffer ever manages to fill up.
if _, err := c.agent().buffer.Write(buffer); err != nil {
- log.Warnf("failed to write packet")
+ log.Warnf("failed to write packet: %v", err)
}
} |
@enobufs I don't believe it's related. It could be that the buffer is too small or that your code doesn't read it and it fills up. Thanks for the comment, it's the nudge I needed to start working on a PR and close this. |
This change adds RTOMax to Config letting the user cap the retransmition timer without breaking compatability. If RTOMax is 0 the current value of 60 seconds is used.
This change adds RTOMax to Config letting the user cap the retransmition timer without breaking compatability. If RTOMax is 0 the current value of 60 seconds is used.
This change adds RTOMax to Config letting the user cap the retransmission timer without breaking compatibility. If RTOMax is 0 the current value of 60 seconds is used.
I've also put up pion/transport#278 which unlimits the buffer and code that care already puts a limit on it. Avoiding dropping packets feels nice when there's already multiple layers of buffering. pion/datachannel#196 As well since it messes with ordering/unordering testing |
@daonb Thans for looking into this. Just to be clear.... what I saw is receiver side sctp.Association's readLoop is blocked on a.nextConn.Read() where |
This change adds RTOMax to Config letting the user cap the retransmission timer without breaking compatibility. If RTOMax is 0 the current value of 60 seconds is used.
Your environment.
What did you do?
I developed a terminal server over pion/webrtc, using data channels to pipe input & output. Things work well on a clean network and on a mobile one. I even used netem to emulate high jitter scenario and all works well. Only when I use my home network - a small pipe with two zoom sessions going on in parallel - the terminal stops receiving data.
I'm attaching a log where I have three data channels opened on the same peer. From the logs it looks like the buffer keeps growing and growing and growing
frozen2.log
The text was updated successfully, but these errors were encountered: