-
Notifications
You must be signed in to change notification settings - Fork 865
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] CUDTGroup::recv() may cause the "No room" issue #1805
Comments
Epoll is likely subscribed to a listener socket, but the procedure likely doesn't check if the socket that reported read-readiness is a group member. Listener socket definitely isn't a group member. |
The test wasn't conducted against the latest version and there were many changes around here. I checked the procedure that extracts sockets, which are later considered to be added to the |
ok, I will test again the latest master. |
I have test again the latest master and found a problem: |
I think we should drop packets smaller than |
This comment has been minimized.
This comment has been minimized.
Hard to say, why this CEPoll::wait log reports 0/0, unfortunately I didn't put in this log even the EID, so I don't have this on the plate that the notice has been missed. The general reason for "No room" is: too slow reading of the packets by the application (the logs with "DELIVERING" should show also how much "too late" the packet was read). There could be some not exactly clear reasons for that beside the fact that the application doesn't call In case of groups this is a little bit more complicated and in some situation the group readiness can be spurious; the group readiness is declared when TSBPD declares readiness on the socket, but there's no guarantee that the ready socket has provided the packet with the sequence number that succeeds the previously read one. Not sure if that's the case, but it seems so. Would you be able to provide me with the socket options you used and more-less the bitrate of the stream, plus the full debug log you collected? I have prepared also a little improved log for CEPoll::wait on my branch: |
Sorry, I have two srt listener listen on two ports, and I made a mistake that log them into same file, so the log I posted is mess up :(
once my application wake up by |
Please attach the full log, I know how to analyze it. You can even put everything in one log. I can't determine what happened without tracing simultaneously what TSBPD is providing on particular sockets. |
Here is the full log: |
Ok, this is a really long log, so let me just ask again to clarify the behavior: since the moment when |
yes, keep repeating:
the |
I think the |
Might be. I just need to track which socket is which here. Don't forget that this listener socket is there for a reason - the group reader must also track the listener for any incoming new connection, if it's on the listener side. This is in order to interrupt the waiting function in case when none of the existing connections provide data, but the newly connected socket would. Therefore the newly connected socket causes setting SRT_EPOLL_UPDATE event on the listener socket, that should interrupt the waiting and add this newly accepted socket to the pool. |
two suggestions:
|
Not so fast :) There are limitations of the current implementation. Unfortunately I could not reuse the UDT legacy code to make the "common buffer" solution without rewriting half of the receiver facility from scratch, hence I have created an "application-like" receiver: every socket provides packets as they are ready, according to their own TSBPD (that's why they must be in a perfect time sync). Therefore it's only possible to blindly read whatever is coming from the socket (when the play time comes) and simply discard the packet when it's past the base. TSBPD then works for every socket separately, and it cannot decide about any packet drops (except for jumping over an empty cell as per TLPKTDROP), it only decides whether the next available packet's time has come or not. Then some reader must read the packet and possibly discard it. As I understand from a rough review of an extracted key fragment of the log is that the transmission gets stopped for a moment, and during this time SRT group reader falls into a spin-rolling wait-checker. Still, not yet exactly figured out what happened. The design of CEPoll::swait was such for a reason, and the 0-return (no sockets ready up to the timeout) should handled just as well as exceptions, except that 0-return is a normal expected result, while exceptions should interrupt the reading function. |
Ok, what I can see additionally here is that the packets over the socket @56331894 after first few, come in very slowly. Once it also happens that from one link a packet is 3 packets behind the other. Times when packets come in are distant much more than they were in the beginning. You can extract the INCOMING phase from the log to see how the packets come in into this connection. Not sure, but it looks for me as if simply the transmission has stopped, the link didn't deliver packets at all, hence this "ALL LINKS ELEPHANTS" log. When this happens, in the non-blocking mode you should get the group readiness obviously cleared. Would you be able to reproduce this and have some shorter session, or it happens only if you perform multiple transmissions on multiple groups? |
once my application wake up by I can reproduce this issue since my test is under unstable network and different links via different ISP. |
I added
I think the connection didn't lost, it still occasionally got packets, but
|
Ah ok. I I'll need to review this again, but I think I understand the problem. |
Ok, would you be able to help me with testing it? The same branch on my repo replica: |
It has been test for a day, so far so good, thanks a lot! |
Ok, I consider this a fix then. If you think this needs more testing, go on, I'll prepare a PR out of it and only mention this issue. |
Sad :( |
Hmm, I don't think I introduced anything that would cause it. I'll research it. |
Do you still have it this way? Can you forcefully break it with coredump (using e.g. Ctrl+\ or killing with SIGQUIT)? |
At 14:50, I found it stuck at
Then I restarted the server, it quickly stuck again after printing |
I will SIGQUIT it or gdb attach it next time 😂 |
This looks like something during the handshake, so rather unrelated to what I have changed. I can see some potential out-of-order locking at the moment when it locks the socket for adding it to the group, but the problem is potentially tough only in blocking mode, while as I saw you are using a non-blocking mode. At least this is a trace I can follow. |
Does srt have some global lock? Otherwise some other threads should still logging? |
There are mutexes that guard some particular data, that's all. Other threads should be logging, unless they are hanged up, too. |
Ok, I found another lock disorder. Lock on CUDTSocket::m_ControlLock is probably unnecessary, and it may cause a deadlock as it orders before m_ConnectionLock. Fortunately, m_ConnectionLock orders before m_GlobControlLock, which's locking is actually absolutely necessary. I updated the branch, you might want to use it for further testing. |
It has been test for a day, so far so good, thanks a lot! |
It's me to thank you for testing, I couldn't have done it without you. |
Hi guys. @gou4shi1 @ethouris |
this comment is already fixed by #2026 |
#1835 says it fixes the problem described here, in #1805. Does it refer to the initial issue then? |
The "initial issue" is only a description of a symptom. |
I would appreciate someone to describe the issue remaining here, and what is fixed in #1835. |
Sure, in non-blocking mode, if |
Describe the bug
m_Positions
inCUDTGroup::recv()
should not consider the listen socket,otherwise it may led to that the listen socket becomes HORSE, and
m_RcvBaseSeqNo
becomes0
,then
group->updateReadState()
never set group to readable becauseseqcmp(1676307874, m_RcvBaseSeqNo)
always return negative,then "No room to store incoming packets".
To Reproduce
SRT broadcast video to server (live mode) and the network is very bad.
Expected behavior
not use the listen socket as HORSE, no "No room"
Screenshots
Desktop (please provide the following information):
The text was updated successfully, but these errors were encountered: