Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] CUDTGroup::recv() may cause the "No room" issue #1805

Closed
gou4shi1 opened this issue Feb 9, 2021 · 40 comments · Fixed by #2203
Closed

[BUG] CUDTGroup::recv() may cause the "No room" issue #1805

gou4shi1 opened this issue Feb 9, 2021 · 40 comments · Fixed by #2203
Assignees
Labels
[core] Area: Changes in SRT library core Type: Bug Indicates an unexpected problem or unintended behavior
Milestone

Comments

@gou4shi1
Copy link
Contributor

gou4shi1 commented Feb 9, 2021

Describe the bug

m_Positions in CUDTGroup::recv() should not consider the listen socket,
otherwise it may led to that the listen socket becomes HORSE, and m_RcvBaseSeqNo becomes 0,
then group->updateReadState() never set group to readable because seqcmp(1676307874, m_RcvBaseSeqNo) always return negative,
then "No room to store incoming packets".

To Reproduce
SRT broadcast video to server (live mode) and the network is very bad.

Expected behavior
not use the listen socket as HORSE, no "No room"

Screenshots

image

Desktop (please provide the following information):

  • OS: Linux
  • SRT Version / commit ID: master/481e7f7
@gou4shi1 gou4shi1 added the Type: Bug Indicates an unexpected problem or unintended behavior label Feb 9, 2021
@maxsharabayko maxsharabayko added the [core] Area: Changes in SRT library core label Feb 9, 2021
@ethouris
Copy link
Collaborator

Epoll is likely subscribed to a listener socket, but the procedure likely doesn't check if the socket that reported read-readiness is a group member. Listener socket definitely isn't a group member.

@maxsharabayko maxsharabayko added this to the v1.4.3 milestone Feb 10, 2021
@ethouris
Copy link
Collaborator

The test wasn't conducted against the latest version and there were many changes around here. I checked the procedure that extracts sockets, which are later considered to be added to the m_Positions container and they are extracted out of member sockets that are read-ready (and listener socket is not a member socket, it's just added to epoll container to track connecting new links within the group).

@gou4shi1
Copy link
Contributor Author

gou4shi1 commented Feb 10, 2021

ok, I will test again the latest master.
btw, when do you plant to relase 1.4.3?
btw, when will the bonding feature not experimental?

@gou4shi1
Copy link
Contributor Author

gou4shi1 commented Feb 24, 2021

I have test again the latest master and found a problem:
For example, my group had two members: A and B.
In one recv() call, A was horse, B was elephant, m_RcvBaseSeqNo was 100, but B only had packets up to 95.
After a while, B received packets from 96-110, but A didn't receive any packet ready to play.
As a result, the tsbpd() of B still tried to PLAYING 96, and the group->updateReadState() failed.

@gou4shi1
Copy link
Contributor Author

gou4shi1 commented Feb 24, 2021

I think we should drop packets smaller than m_RcvBaseSeqNo in tsbpd(),
not only drop in recv()

@gou4shi1

This comment has been minimized.

@ethouris
Copy link
Collaborator

ethouris commented Feb 25, 2021

Hard to say, why this CEPoll::wait log reports 0/0, unfortunately I didn't put in this log even the EID, so I don't have this on the plate that the notice has been missed.

The general reason for "No room" is: too slow reading of the packets by the application (the logs with "DELIVERING" should show also how much "too late" the packet was read). There could be some not exactly clear reasons for that beside the fact that the application doesn't call srt_recvmsg at the right time, some of them are under research.

In case of groups this is a little bit more complicated and in some situation the group readiness can be spurious; the group readiness is declared when TSBPD declares readiness on the socket, but there's no guarantee that the ready socket has provided the packet with the sequence number that succeeds the previously read one. Not sure if that's the case, but it seems so.

Would you be able to provide me with the socket options you used and more-less the bitrate of the stream, plus the full debug log you collected?

I have prepared also a little improved log for CEPoll::wait on my branch: dev-test-spurious-group-epoll (in my fork of the repo)

@gou4shi1
Copy link
Contributor Author

gou4shi1 commented Feb 25, 2021

Sorry, I have two srt listener listen on two ports, and I made a mistake that log them into same file, so the log I posted is mess up :(
But I found the real root cause of "no room" of latest master:

08:32:40.778325/SRT:Listener D:SRT.gr: group/recv: @1130073714 BEGIN FUNCTION
08:32:40.778369/SRT:Listener D:SRT.ea: srt_epoll_update_usock: UPDATED E11 for @56331891 +
08:32:40.778328/SRT:Listener D:SRT.ea: CEPoll::wait: EVENT WAITING: TRIGGERED
08:32:40.778383/SRT:Listener D:SRT.br: isRcvDataReady: packet NOT extracted.
08:32:40.778407/SRT:Listener D:SRT.ei: epoll/update: @56331891 +[W]: E11 TRACKING: @56331881:[R][E] @56331889:[R][E] @56331891:[R][E] @56331908:[R][^U]  NOT updated: no changes
08:32:40.778425/SRT:Listener D:SRT.ei: epoll/update: @56331891 +[W]: E12 TRACKING: @56331881:[W][E] @56331889:[W][E] @56331891:[W][E]  NOT updated: no changes
08:32:40.778437/SRT:Listener D:SRT.ea: srt_epoll_update_usock: UPDATED E11 for @56331889 +
08:32:40.778448/SRT:Listener D:SRT.br: isRcvDataReady: packet NOT extracted.
08:32:40.778472/SRT:Listener D:SRT.ei: epoll/update: @56331889 +[W]: E11 TRACKING: @56331881:[R][E] @56331889:[R][E] @56331891:[R][E] @56331908:[R][^U]  NOT updated: no changes
08:32:40.778489/SRT:Listener D:SRT.ei: epoll/update: @56331889 +[W]: E12 TRACKING: @56331881:[W][E] @56331889:[W][E] @56331891:[W][E]  NOT updated: no changes
08:32:40.778500/SRT:Listener D:SRT.ea: srt_epoll_update_usock: UPDATED E11 for @56331881 +
08:32:40.778512/SRT:Listener D:SRT.br: getRcvReadyPacket: Found next packet seq=%893789679 (0 empty cells skipped)
08:32:40.778531/SRT:Listener D:SRT.br: getPktTsbPdTime: TIMEBASE=07:40:09.464394733 [STDY] + dTS=3063084021us + LATENCY=120ms + uDRIFT=0
08:32:40.778549/SRT:Listener D:SRT.br: isRcvDataReady: packet extracted seqdistance=-1 TsbPdTime=08:31:12.668415733 [STDY]
08:32:40.778569/SRT:Listener D:SRT.ei: epoll/update: @56331881 +[R]: E11 TRACKING: @56331881:[R][E] @56331889:[R][E] @56331891:[R][E] @56331908:[R][^U]  NOT updated: no changes
08:32:40.778587/SRT:Listener D:SRT.ei: epoll/update: @56331881 +[R]: E12 TRACKING: @56331881:[W][E] @56331889:[W][E] @56331891:[W][E]  NOT updated: no changes
08:32:40.778606/SRT:Listener D:SRT.ei: epoll/update: @56331881 +[W]: E11 TRACKING: @56331881:[R][E] @56331889:[R][E] @56331891:[R][E] @56331908:[R][^U]  NOT updated: no changes
08:32:40.778620/SRT:Listener D:SRT.ei: epoll/update: @56331881 +[W]: E12 TRACKING: @56331881:[W][E] @56331889:[W][E] @56331891:[W][E]  NOT updated: no changes
08:32:40.778636/SRT:Listener D:SRT.ea: E11 rdy=1: @56331881:[R]  TRACKED: @56331881:[R][E] @56331889:[R][E] @56331891:[R][E] @56331908:[R][^U] 
08:32:40.778653/SRT:Listener D:SRT.gr: group/recv: 1 RDY: @56331881:[R] 
08:32:40.778667/SRT:Listener D:SRT.br: isRcvDataReady: packet NOT extracted.
08:32:40.778680/SRT:Listener D:SRT.br: isRcvDataReady: packet NOT extracted.
08:32:40.778694/SRT:Listener D:SRT.ar: @56331881:receiveMessage: BEGIN ASYNC MODE. Going to extract payload size=1316
08:32:40.778707/SRT:Listener D:SRT.br: getPktTsbPdTime: TIMEBASE=07:40:09.464394733 [STDY] + dTS=3063084021us + LATENCY=120ms + uDRIFT=0
08:32:40.778727/SRT:Listener D:SRT.br: getRcvReadyMsg: POS=6186 +0 pkt %893789679 ready to play (delayed -529ms)
08:32:40.778740/SRT:Listener D:SRT.br: readMsg: checking unit POS=6186
08:32:40.778750/SRT:Listener D:SRT.br: getPktTsbPdTime: TIMEBASE=07:40:09.464394733 [STDY] + dTS=3063084021us + LATENCY=120ms + uDRIFT=0
08:32:40.778772/SRT:Listener D:SRT.br: getPktTsbPdTime: TIMEBASE=07:40:09.464394733 [STDY] + dTS=3063084051us + LATENCY=120ms + uDRIFT=0
08:32:40.778789/SRT:Listener D:SRT.br: readMsg: DELIVERED seq=893789679 T=08:31:12.668415733 [STDY] in 529ms - TIME-PREVIOUS: PKT: 0 LOCAL: 0 !08018153 NEXT pkt T=08:31:12.668445733 [STDY]
08:32:40.778803/SRT:Listener D:SRT.br: readMsg: FREEING UNIT POS=6186
08:32:40.778817/SRT:Listener D:SRT.br: rcvBuf/extractData: begin=6187 reporting extraction size=1316
08:32:40.778830/SRT:Listener D:SRT.ar: @56331881:AFTER readMsg: (NON-BLOCKING) result=1316
08:32:40.778842/SRT:Listener D:SRT.br: getRcvReadyPacket: Found next packet seq=%893789680 (0 empty cells skipped)
08:32:40.778857/SRT:Listener D:SRT.br: getPktTsbPdTime: TIMEBASE=07:40:09.464394733 [STDY] + dTS=3063084051us + LATENCY=120ms + uDRIFT=0
08:32:40.778874/SRT:Listener D:SRT.br: isRcvDataReady: packet extracted seqdistance=-1 TsbPdTime=08:31:12.668445733 [STDY]
08:32:40.778888/SRT:Listener D:SRT.gr: group/recv: @56331881 EXTRACTED data with %893789679 #30763: 08018153
08:32:40.778904/SRT:Listener D:SRT.gr: group/recv: @56331881 %893789679 #30763 BEHIND base=%893789681 - discarding
08:32:40.778915/SRT:Listener D:SRT.ar: @56331881:receiveMessage: BEGIN ASYNC MODE. Going to extract payload size=1316
08:32:40.778930/SRT:Listener D:SRT.br: getPktTsbPdTime: TIMEBASE=07:40:09.464394733 [STDY] + dTS=3063084051us + LATENCY=120ms + uDRIFT=0
08:32:40.778950/SRT:Listener D:SRT.br: getRcvReadyMsg: POS=6187 +0 pkt %893789680 ready to play (delayed -529ms)
08:32:40.778962/SRT:Listener D:SRT.br: readMsg: checking unit POS=6187
08:32:40.778976/SRT:Listener D:SRT.br: getPktTsbPdTime: TIMEBASE=07:40:09.464394733 [STDY] + dTS=3063084051us + LATENCY=120ms + uDRIFT=0
08:32:40.778996/SRT:Listener D:SRT.br: readMsg: DELIVERED seq=893789680 T=08:31:12.668445733 [STDY] in 529ms - TIME-PREVIOUS: PKT: 0 LOCAL: 0 !E3CBC2CF NEXT pkt T=NONE
08:32:40.779010/SRT:Listener D:SRT.br: readMsg: FREEING UNIT POS=6187
08:32:40.779024/SRT:Listener D:SRT.br: rcvBuf/extractData: begin=6188 reporting extraction size=565
08:32:40.779039/SRT:Listener D:SRT.ar: @56331881:AFTER readMsg: (NON-BLOCKING) result=565
08:32:40.779053/SRT:Listener D:SRT.br: isRcvDataReady: packet NOT extracted.
08:32:40.779067/SRT:Listener D:SRT.ar: receiveMessage: DATA READ, but nothing more - kicking TSBPD.
08:32:40.779095/SRT:Listener D:SRT.ei: epoll/update: @56331881 -[R]: E11 TRACKING: @56331881:[R][E] @56331889:[R][E] @56331891:[R][E] @56331908:[R][^U] 
08:32:40.779113/SRT:Listener D:SRT.ei: epoll/update: @56331881 -[R]: E12 TRACKING: @56331881:[W][E] @56331889:[W][E] @56331891:[W][E]  NOT updated: not subscribed
08:32:40.779124/SRT:Listener D:SRT.ar: @56331881:CURRENT BANDWIDTH: 0.011648Mbps (1 buffers per second)
08:32:40.779147/SRT:Listener D:SRT.gr: group/recv: @56331881 EXTRACTED data with %893789680 #30764: E3CBC2CF
08:32:40.779160/SRT:Listener D:SRT.gr: group/recv: @56331881 %893789680 #30764 BEHIND base=%893789681 - discarding
08:32:40.779260/SRT:Listener D:SRT.ar: @56331881:receiveMessage: BEGIN ASYNC MODE. Going to extract payload size=1316
08:32:40.779276/SRT:Listener D:SRT.br: getRcvReadyMsg: nothing to deliver: NOT RECEIVED
08:32:40.779286/SRT:Listener D:SRT.ar: @56331881:AFTER readMsg: (NON-BLOCKING) result=0
08:32:40.779298/SRT:Listener D:SRT.ar: receiveMessage: nothing to read, kicking TSBPD, return AGAIN
08:32:40.779326/SRT:Listener D:SRT.ei: epoll/update: @56331881 -[R]: E11 TRACKING: @56331881:[R][E] @56331889:[R][E] @56331891:[R][E] @56331908:[R][^U]  NOT updated: no changes
08:32:40.779344/SRT:Listener D:SRT.ei: epoll/update: @56331881 -[R]: E12 TRACKING: @56331881:[W][E] @56331889:[W][E] @56331891:[W][E]  NOT updated: no changes
08:32:40.779366/SRT:Listener D:SRT.gr: group/recv: @56331881 EXTRACTED data with %-1 #-1: (NOTHING)
08:32:40.779382/SRT:Listener D:SRT.gr: group/recv @56331881: SPURIOUS epoll, ignoring
08:32:40.779418/SRT:Listener D:SRT.gr: group/recv: NOT extracted anything - checking for a need to kick kangaroos
08:32:40.779452/SRT:Listener D:SRT.gr: group/recv: ALL LINKS ELEPHANTS. Re-polling.

once my application wake up by srt_epoll_wait(), I will do while (srt_recvmsg2() > 0)...
from the log we can see that, the last srt_recvmsg2() discard some packets behind m_RcvBaseSeqNo
then keep ALL LINKS ELEPHANTS. Re-polling. forever
I think we should not re-poll in nonblocking mode

@ethouris
Copy link
Collaborator

Please attach the full log, I know how to analyze it. You can even put everything in one log. I can't determine what happened without tracing simultaneously what TSBPD is providing on particular sockets.

@gou4shi1
Copy link
Contributor Author

Here is the full log:
https://1drv.ms/u/s!AuaosNlJ5ELxjyenqzz6Y25TV2Dv?e=QyWioU
Thanks for the quick reply!

@ethouris
Copy link
Collaborator

Ok, this is a really long log, so let me just ask again to clarify the behavior: since the moment when ALL LINKS ELEPHANTS message appears the transmission is stuck, and rest of the log shows that it simply repeats the whole thing all over again?

@gou4shi1
Copy link
Contributor Author

gou4shi1 commented Feb 25, 2021

yes, keep repeating:

08:32:41.303176/SRT:Listener D:SRT.gr: group/recv: ALL LINKS ELEPHANTS. Re-polling.
08:32:41.303187/SRT:Listener D:SRT.gr: group/recv: Reviewing member sockets for polling
08:32:41.303197/SRT:Listener D:SRT.gr: group/recv: E(7) @56331898[READ] @56331896[READ] @56331895[READ]  --> EPOLL/SWAIT
08:32:41.303207/SRT:Listener D:SRT.ea: srt_epoll_update_usock: UPDATED E7 for @56331898 +
08:32:41.303217/SRT:Listener D:SRT.br: isRcvDataReady: packet NOT extracted.
08:32:41.303233/SRT:Listener D:SRT.ei: epoll/update: @56331898 +[W]: E7 TRACKING: @56331895:[R][E] @56331896:[R][E] @56331898:[R][E] @56331908:[R][^U]  NOT updated: no changes
08:32:41.303247/SRT:Listener D:SRT.ei: epoll/update: @56331898 +[W]: E8 TRACKING: @56331895:[W][E] @56331896:[W][E] @56331898:[W][E]  NOT updated: no changes
08:32:41.303260/SRT:Listener D:SRT.ea: srt_epoll_update_usock: UPDATED E7 for @56331896 +
08:32:41.303270/SRT:Listener D:SRT.br: isRcvDataReady: packet NOT extracted.
08:32:41.303284/SRT:Listener D:SRT.ei: epoll/update: @56331896 +[W]: E7 TRACKING: @56331895:[R][E] @56331896:[R][E] @56331898:[R][E] @56331908:[R][^U]  NOT updated: no changes
08:32:41.303296/SRT:Listener D:SRT.ei: epoll/update: @56331896 +[W]: E8 TRACKING: @56331895:[W][E] @56331896:[W][E] @56331898:[W][E]  NOT updated: no changes
08:32:41.303307/SRT:Listener D:SRT.ea: srt_epoll_update_usock: UPDATED E7 for @56331895 +
08:32:41.303316/SRT:Listener D:SRT.br: isRcvDataReady: packet NOT extracted.
08:32:41.303331/SRT:Listener D:SRT.ei: epoll/update: @56331895 +[W]: E7 TRACKING: @56331895:[R][E] @56331896:[R][E] @56331898:[R][E] @56331908:[R][^U]  NOT updated: no changes
08:32:41.303343/SRT:Listener D:SRT.ei: epoll/update: @56331895 +[W]: E8 TRACKING: @56331895:[W][E] @56331896:[W][E] @56331898:[W][E]  NOT updated: no changes
08:32:41.303354/SRT:Listener D:SRT.ea: E7 rdy=1: @56331908:[R]  TRACKED: @56331895:[R][E] @56331896:[R][E] @56331898:[R][E] @56331908:[R][^U] 
08:32:41.303366/SRT:Listener D:SRT.gr: group/recv: 1 RDY: @56331908:[R] 
08:32:41.303377/SRT:Listener D:SRT.br: isRcvDataReady: packet NOT extracted.
08:32:41.303387/SRT:Listener D:SRT.br: isRcvDataReady: packet NOT extracted.
08:32:41.303397/SRT:Listener D:SRT.br: isRcvDataReady: packet NOT extracted.
08:32:41.303406/SRT:Listener D:SRT.gr: group/recv: NOT extracted anything - checking for a need to kick kangaroos
08:32:41.303416/SRT:Listener D:SRT.gr: group/recv: ALL LINKS ELEPHANTS. Re-polling.

the @56331908 is the listener socket

@gou4shi1
Copy link
Contributor Author

I think the swait() in recv_WaitForReadReady() should exclude listener socket?

@ethouris
Copy link
Collaborator

Might be. I just need to track which socket is which here. Don't forget that this listener socket is there for a reason - the group reader must also track the listener for any incoming new connection, if it's on the listener side. This is in order to interrupt the waiting function in case when none of the existing connections provide data, but the newly connected socket would. Therefore the newly connected socket causes setting SRT_EPOLL_UPDATE event on the listener socket, that should interrupt the waiting and add this newly accepted socket to the pool.

@gou4shi1
Copy link
Contributor Author

two suggestions:

  • also throw exception if readReady.size() == 0 just likes what you did for nready == 0
  • drop packets smaller than m_RcvBaseSeqNo in tsbpd(), not only drop in recv()

@ethouris
Copy link
Collaborator

Not so fast :) There are limitations of the current implementation. Unfortunately I could not reuse the UDT legacy code to make the "common buffer" solution without rewriting half of the receiver facility from scratch, hence I have created an "application-like" receiver: every socket provides packets as they are ready, according to their own TSBPD (that's why they must be in a perfect time sync). Therefore it's only possible to blindly read whatever is coming from the socket (when the play time comes) and simply discard the packet when it's past the base.

TSBPD then works for every socket separately, and it cannot decide about any packet drops (except for jumping over an empty cell as per TLPKTDROP), it only decides whether the next available packet's time has come or not. Then some reader must read the packet and possibly discard it.

As I understand from a rough review of an extracted key fragment of the log is that the transmission gets stopped for a moment, and during this time SRT group reader falls into a spin-rolling wait-checker. Still, not yet exactly figured out what happened.

The design of CEPoll::swait was such for a reason, and the 0-return (no sockets ready up to the timeout) should handled just as well as exceptions, except that 0-return is a normal expected result, while exceptions should interrupt the reading function.

@ethouris
Copy link
Collaborator

Ok, what I can see additionally here is that the packets over the socket @56331894 after first few, come in very slowly. Once it also happens that from one link a packet is 3 packets behind the other. Times when packets come in are distant much more than they were in the beginning. You can extract the INCOMING phase from the log to see how the packets come in into this connection.

Not sure, but it looks for me as if simply the transmission has stopped, the link didn't deliver packets at all, hence this "ALL LINKS ELEPHANTS" log. When this happens, in the non-blocking mode you should get the group readiness obviously cleared.

Would you be able to reproduce this and have some shorter session, or it happens only if you perform multiple transmissions on multiple groups?

@gou4shi1
Copy link
Contributor Author

gou4shi1 commented Feb 25, 2021

once my application wake up by srt_epoll_wait(), I will do while (srt_recvmsg2() > 0)..., I only back to srt_epoll_wait() again if srt_recvmsg2() <= 0
so if the connection lost packets while I am in the loop, srt_recvmsg2() should return error immediately instead of waiting for packets? (in nonblocking mode)

I can reproduce this issue since my test is under unstable network and different links via different ISP.

@gou4shi1
Copy link
Contributor Author

I added BEGIN FUNCTION log at the beginning of group::recv()

08:32:41.313368/SRT:Listener D:SRT.gr: group/recv: @1130073721 BEGIN FUNCTION
08:32:41.325222/SRT:TsbPd:@99 D:SRT.ei: epoll/update: @1130073726 +[R]: E1 TRACKING: @56331908:[R][E] @1130073697:[R][E] @1130073701:[R][E] @1130073714:[R][E] @1130073717:[R][E] @1130073721:[R][E] @1130073726:[R
][E] @1130073729:[R][E]  NOT updated: no changes
08:32:41.347528/SRT:TsbPd:@99 D:SRT.ei: epoll/update: @1130073726 +[R]: E1 TRACKING: @56331908:[R][E] @1130073697:[R][E] @1130073701:[R][E] @1130073714:[R][E] @1130073717:[R][E] @1130073721:[R][E] @1130073726:[R
][E] @1130073729:[R][E]  NOT updated: no changes
08:32:41.356572/SRT:TsbPd:@85 D:SRT.ei: epoll/update: @1130073717 +[R]: E1 TRACKING: @56331908:[R][E] @1130073697:[R][E] @1130073701:[R][E] @1130073714:[R][E] @1130073717:[R][E] @1130073721:[R][E] @1130073726:[R
][E] @1130073729:[R][E]  NOT updated: no changes
08:32:41.369798/SRT:TsbPd:@81 D:SRT.ei: epoll/update: @1130073714 +[R]: E1 TRACKING: @56331908:[R][E] @1130073697:[R][E] @1130073701:[R][E] @1130073714:[R][E] @1130073717:[R][E] @1130073721:[R][E] @1130073726:[R
][E] @1130073729:[R][E]  NOT updated: no changes
08:32:41.375232/SRT:TsbPd:@85 D:SRT.ei: epoll/update: @1130073717 +[R]: E1 TRACKING: @56331908:[R][E] @1130073697:[R][E] @1130073701:[R][E] @1130073714:[R][E] @1130073717:[R][E] @1130073721:[R][E] @1130073726:[R
][E] @1130073729:[R][E]  NOT updated: no changes
08:32:41.392347/SRT:TsbPd:@81 D:SRT.ei: epoll/update: @1130073714 +[R]: E1 TRACKING: @56331908:[R][E] @1130073697:[R][E] @1130073701:[R][E] @1130073714:[R][E] @1130073717:[R][E] @1130073721:[R][E] @1130073726:[R
][E] @1130073729:[R][E]  NOT updated: no changes
08:32:41.392983/SRT:TsbPd:@00 D:SRT.ei: epoll/update: @1130073729 +[R]: E1 TRACKING: @56331908:[R][E] @1130073697:[R][E] @1130073701:[R][E] @1130073714:[R][E] @1130073717:[R][E] @1130073721:[R][E] @1130073726:[R
][E] @1130073729:[R][E]  NOT updated: no changes
08:32:41.396260/SRT:Listener D:SRT.ei: epoll/update: @1130073721 -[R]: E1 TRACKING: @56331908:[R][E] @1130073697:[R][E] @1130073701:[R][E] @1130073714:[R][E] @1130073717:[R][E] @1130073721:[R][E] @1130073726:[R]
[E] @1130073729:[R][E]  NOT updated: no changes
08:32:41.396360/SRT:Listener D:SRT.gr: group/recv: @1130073721 BEGIN FUNCTION
08:32:41.397249/SRT:Listener D:SRT.ei: epoll/update: @1130073721 -[R]: E1 TRACKING: @56331908:[R][E] @1130073697:[R][E] @1130073701:[R][E] @1130073714:[R][E] @1130073717:[R][E] @1130073721:[R][E] @1130073726:[R]
[E] @1130073729:[R][E]  NOT updated: no changes
08:32:41.397278/SRT:Listener D:SRT.gr: group/recv: @1130073721 BEGIN FUNCTION
08:32:41.397919/SRT:Listener D:SRT.ei: epoll/update: @1130073721 -[R]: E1 TRACKING: @56331908:[R][E] @1130073697:[R][E] @1130073701:[R][E] @1130073714:[R][E] @1130073717:[R][E] @1130073721:[R][E] @1130073726:[R]
[E] @1130073729:[R][E]  NOT updated: no changes
08:32:41.398010/SRT:Listener D:SRT.gr: group/recv: @1130073721 BEGIN FUNCTION

I think the connection didn't lost, it still occasionally got packets, but srt_recvmsg2() just keep re-polling until it got packets (in nonblocking mode), after it delivered some packets, it keep re-polling again.

08:32:41.312042/SRT:Listener D:SRT.gr: group/recv: ALL LINKS ELEPHANTS. Re-polling.
08:32:41.312069/SRT:Listener D:SRT.gr: group/recv: Reviewing member sockets for polling
08:32:41.312101/SRT:Listener D:SRT.gr: group/recv: E(7) @56331898[READ] @56331896[READ] @56331895[READ]  --> EPOLL/SWAIT
08:32:41.312132/SRT:Listener D:SRT.ea: srt_epoll_update_usock: UPDATED E7 for @56331898 +
08:32:41.312145/SRT:Listener D:SRT.br: isRcvDataReady: packet NOT extracted.
08:32:41.312181/SRT:Listener D:SRT.ei: epoll/update: @56331898 +[W]: E7 TRACKING: @56331895:[R][E] @56331896:[R][E] @56331898:[R][E] @56331908:[R][^U]  NOT updated: no changes
08:32:41.312209/SRT:Listener D:SRT.ei: epoll/update: @56331898 +[W]: E8 TRACKING: @56331895:[W][E] @56331896:[W][E] @56331898:[W][E]  NOT updated: no changes
08:32:41.312238/SRT:Listener D:SRT.ea: srt_epoll_update_usock: UPDATED E7 for @56331896 +
08:32:41.312285/SRT:Listener D:SRT.br: isRcvDataReady: packet NOT extracted.
08:32:41.312303/SRT:Listener D:SRT.ei: epoll/update: @56331896 +[W]: E7 TRACKING: @56331895:[R][E] @56331896:[R][E] @56331898:[R][E] @56331908:[R][^U]  NOT updated: no changes
08:32:41.312327/SRT:Listener D:SRT.ei: epoll/update: @56331896 +[W]: E8 TRACKING: @56331895:[W][E] @56331896:[W][E] @56331898:[W][E]  NOT updated: no changes
08:32:41.312355/SRT:Listener D:SRT.ea: srt_epoll_update_usock: UPDATED E7 for @56331895 +
08:32:41.312604/SRT:Listener D:SRT.br: getRcvReadyPacket: Found next packet seq=%1746779043 (0 empty cells skipped)
08:32:41.312626/SRT:Listener D:SRT.br: getPktTsbPdTime: TIMEBASE=07:40:09.049993086 [STDY] + dTS=3064336505us + LATENCY=120ms + uDRIFT=0
08:32:41.312642/SRT:Listener D:SRT.br: isRcvDataReady: packet extracted seqdistance=-1 TsbPdTime=08:31:13.506498086 [STDY]
08:32:41.312657/SRT:Listener D:SRT.ei: epoll/update: @56331895 +[R]: E7 TRACKING: @56331895:[R][E] @56331896:[R][E] @56331898:[R][E] @56331908:[R][^U]  NOT updated: no changes
08:32:41.312669/SRT:Listener D:SRT.ei: epoll/update: @56331895 +[R]: E8 TRACKING: @56331895:[W][E] @56331896:[W][E] @56331898:[W][E]  NOT updated: no changes
08:32:41.312682/SRT:Listener D:SRT.ei: epoll/update: @56331895 +[W]: E7 TRACKING: @56331895:[R][E] @56331896:[R][E] @56331898:[R][E] @56331908:[R][^U]  NOT updated: no changes
08:32:41.312692/SRT:Listener D:SRT.ei: epoll/update: @56331895 +[W]: E8 TRACKING: @56331895:[W][E] @56331896:[W][E] @56331898:[W][E]  NOT updated: no changes
08:32:41.312709/SRT:Listener D:SRT.ea: E7 rdy=2: @56331908:[R] @56331895:[R]  TRACKED: @56331895:[R][E] @56331896:[R][E] @56331898:[R][E] @56331908:[R][^U] 
08:32:41.312772/SRT:Listener D:SRT.gr: group/recv: 2 RDY: @56331895:[R] @56331908:[R] 
08:32:41.312790/SRT:Listener D:SRT.br: isRcvDataReady: packet NOT extracted.
08:32:41.312823/SRT:Listener D:SRT.br: isRcvDataReady: packet NOT extracted.
08:32:41.312848/SRT:Listener D:SRT.ar: @56331895:receiveMessage: BEGIN ASYNC MODE. Going to extract payload size=1316
08:32:41.312896/SRT:Listener D:SRT.br: getPktTsbPdTime: TIMEBASE=07:40:09.049993086 [STDY] + dTS=3064336505us + LATENCY=120ms + uDRIFT=0
08:32:41.312926/SRT:Listener D:SRT.br: getRcvReadyMsg: POS=3447 +0 pkt %1746779043 ready to play (delayed -225ms)
08:32:41.312958/SRT:Listener D:SRT.br: readMsg: checking unit POS=3447
08:32:41.312989/SRT:Listener D:SRT.br: getPktTsbPdTime: TIMEBASE=07:40:09.049993086 [STDY] + dTS=3064336505us + LATENCY=120ms + uDRIFT=0
08:32:41.313020/SRT:Listener D:SRT.br: readMsg: DELIVERED seq=1746779043 T=08:31:13.506498086 [STDY] in 225ms - TIME-PREVIOUS: PKT: 0 LOCAL: 0 !86973B4B NEXT pkt T=NONE
08:32:41.313048/SRT:Listener D:SRT.br: readMsg: FREEING UNIT POS=3447
08:32:41.313074/SRT:Listener D:SRT.br: rcvBuf/extractData: begin=3448 reporting extraction size=684
08:32:41.313104/SRT:Listener D:SRT.ar: @56331895:AFTER readMsg: (NON-BLOCKING) result=684
08:32:41.313136/SRT:Listener D:SRT.br: isRcvDataReady: packet NOT extracted.
08:32:41.313148/SRT:Listener D:SRT.ar: receiveMessage: DATA READ, but nothing more - kicking TSBPD.
08:32:41.313169/SRT:Listener D:SRT.ei: epoll/update: @56331895 -[R]: E7 TRACKING: @56331895:[R][E] @56331896:[R][E] @56331898:[R][E] @56331908:[R][^U] 
08:32:41.313184/SRT:Listener D:SRT.ei: epoll/update: @56331895 -[R]: E8 TRACKING: @56331895:[W][E] @56331896:[W][E] @56331898:[W][E]  NOT updated: not subscribed
08:32:41.313210/SRT:Listener D:SRT.ar: @56331895:CURRENT BANDWIDTH: 0.011648Mbps (1 buffers per second)
08:32:41.313243/SRT:Listener D:SRT.gr: group/recv: @56331895 EXTRACTED data with %1746779043 #28024: 86973B4B
08:32:41.313263/SRT:Listener D:SRT.gr: group/recv: @56331895 %1746779043 #28024 DELIVERING
08:32:41.313287/SRT:Listener D:SRT.ei: epoll/update: @1130073721 -[R]: E1 TRACKING: @56331908:[R][E] @1130073697:[R][E] @1130073701:[R][E] @1130073714:[R][E] @1130073717:[R][E] @1130073721:[R][E] @1130073726:[R][E] @1130073729:[R][E] 
08:32:41.313300/SRT:Listener D:SRT.gr: group/recv: successfully extracted packet size=684 m_RcvBaseSeqNo=1746779043 - returning
08:32:41.313368/SRT:Listener D:SRT.gr: group/recv: @1130073721 BEGIN FUNCTION
08:32:41.313405/SRT:Listener D:SRT.gr: group/recv: Reviewing member sockets for polling
08:32:41.313418/SRT:Listener D:SRT.gr: group/recv: E(7) @56331898[READ] @56331896[READ] @56331895[READ]  --> EPOLL/SWAIT
08:32:41.313434/SRT:Listener D:SRT.ea: srt_epoll_update_usock: UPDATED E7 for @56331898 +
08:32:41.313463/SRT:Listener D:SRT.br: isRcvDataReady: packet NOT extracted.
08:32:41.313484/SRT:Listener D:SRT.ei: epoll/update: @56331898 +[W]: E7 TRACKING: @56331895:[R][E] @56331896:[R][E] @56331898:[R][E] @56331908:[R][^U]  NOT updated: no changes
08:32:41.313515/SRT:Listener D:SRT.ei: epoll/update: @56331898 +[W]: E8 TRACKING: @56331895:[W][E] @56331896:[W][E] @56331898:[W][E]  NOT updated: no changes
08:32:41.313527/SRT:Listener D:SRT.ea: srt_epoll_update_usock: UPDATED E7 for @56331896 +
08:32:41.313557/SRT:Listener D:SRT.br: isRcvDataReady: packet NOT extracted.
08:32:41.313607/SRT:Listener D:SRT.ei: epoll/update: @56331896 +[W]: E7 TRACKING: @56331895:[R][E] @56331896:[R][E] @56331898:[R][E] @56331908:[R][^U]  NOT updated: no changes
08:32:41.313633/SRT:Listener D:SRT.ei: epoll/update: @56331896 +[W]: E8 TRACKING: @56331895:[W][E] @56331896:[W][E] @56331898:[W][E]  NOT updated: no changes
08:32:41.313644/SRT:Listener D:SRT.ea: srt_epoll_update_usock: UPDATED E7 for @56331895 +
08:32:41.313653/SRT:Listener D:SRT.br: isRcvDataReady: packet NOT extracted.
08:32:41.313664/SRT:Listener D:SRT.ei: epoll/update: @56331895 +[W]: E7 TRACKING: @56331895:[R][E] @56331896:[R][E] @56331898:[R][E] @56331908:[R][^U]  NOT updated: no changes
08:32:41.313704/SRT:Listener D:SRT.ei: epoll/update: @56331895 +[W]: E8 TRACKING: @56331895:[W][E] @56331896:[W][E] @56331898:[W][E]  NOT updated: no changes
08:32:41.313737/SRT:Listener D:SRT.ea: E7 rdy=1: @56331908:[R]  TRACKED: @56331895:[R][E] @56331896:[R][E] @56331898:[R][E] @56331908:[R][^U] 
08:32:41.313803/SRT:Listener D:SRT.gr: group/recv: 1 RDY: @56331908:[R] 
08:32:41.313848/SRT:Listener D:SRT.br: isRcvDataReady: packet NOT extracted.
08:32:41.313888/SRT:Listener D:SRT.br: isRcvDataReady: packet NOT extracted.
08:32:41.313955/SRT:Listener D:SRT.br: isRcvDataReady: packet NOT extracted.
08:32:41.314005/SRT:Listener D:SRT.gr: group/recv: NOT extracted anything - checking for a need to kick kangaroos
08:32:41.314053/SRT:Listener D:SRT.gr: group/recv: ALL LINKS ELEPHANTS. Re-polling.

@ethouris
Copy link
Collaborator

Ah ok. I I'll need to review this again, but I think I understand the problem.

@ethouris
Copy link
Collaborator

Ok, would you be able to help me with testing it?

The same branch on my repo replica: dev-test-spurious-group-epoll. I have only made a rough test with non-blocking mode and pause-5s-resume on the sender, at least it stops and then continues reading.

@gou4shi1
Copy link
Contributor Author

It has been test for a day, so far so good, thanks a lot!

@gou4shi1 gou4shi1 changed the title [BUG] m_Positions in CUDTGroup::recv() should exclude listen socket [BUG] CUDTGroup::recv() cause the "No room" issue Feb 28, 2021
@gou4shi1 gou4shi1 changed the title [BUG] CUDTGroup::recv() cause the "No room" issue [BUG] CUDTGroup::recv() causes the "No room" issue Feb 28, 2021
@gou4shi1 gou4shi1 changed the title [BUG] CUDTGroup::recv() causes the "No room" issue [BUG] CUDTGroup::recv() may cause the "No room" issue Mar 1, 2021
@ethouris
Copy link
Collaborator

ethouris commented Mar 1, 2021

Ok, I consider this a fix then. If you think this needs more testing, go on, I'll prepare a PR out of it and only mention this issue.

@gou4shi1
Copy link
Contributor Author

gou4shi1 commented Mar 1, 2021

Sad :(
It seems that your branch may stuck, the log will stop printing, maybe deadlock?
https://1drv.ms/u/s!AuaosNlJ5ELxjyv2ZUVo7_E09J4D?e=Fy5Q51

@ethouris
Copy link
Collaborator

ethouris commented Mar 1, 2021

Hmm, I don't think I introduced anything that would cause it. I'll research it.

@ethouris
Copy link
Collaborator

ethouris commented Mar 1, 2021

Do you still have it this way? Can you forcefully break it with coredump (using e.g. Ctrl+\ or killing with SIGQUIT)?

@gou4shi1
Copy link
Contributor Author

gou4shi1 commented Mar 1, 2021

At 14:50, I found it stuck at

13:03:19.159320/SRT:RcvQ:w1 D:SRT.cn: createSrtHandshake: buf size=1456 hsx=EXT:hsrsp kmx=EXT:kmrsp kmdata_wordsize=0 version=5
13:03:19.159314/SRT:Listener D:SRT.gm: GROUP $1443329016 OPTION: #37 value:00
13:03:19.159326/SRT:RcvQ:w1 D:SRT.cn: createSrtHandshake: NOT  Advertising PBKEYLEN - value = 0
13:03:19.159374/SRT:Listener D:SRT.gm: ... SPREADING to existing sockets.
13:03:19.159380/SRT:RcvQ:w1 D:SRT.cn: createSrtHandshake: (ext: HSX,GROUP) data: version=5 type=0x5 ISN=648483814 MSS=1500 FLW=8192 reqtype=conclusion srcID=369587191 cookie=25415686 srcIP=218.107.32.5.0.0.0.0.0.0.0.0.0.0.0.0.FLAGS:  hsx config no-pbklen
13:03:19.159401/SRT:RcvQ:w1 D:SRT.cn: HSRSP/snd: HSv5 peer uses TSBPD, responding TSBPDSND latency=120
13:03:19.159395/SRT:Listener D:SRT.ac: @369587193:OPTION: #37 value:00
13:03:19.159406/SRT:RcvQ:w1 D:SRT.cn: HSRSP/snd: AGENT UNDERSTANDS REXMIT flag and PEER reported that it does, too.
13:03:19.159448/SRT:RcvQ:w1 D:SRT.cn: HSRSP/snd: LATENCY[SND:120 RCV:120] FLAGS[+TSBPD-snd +TSBPD-rcv +haicrypt +TLPktDrop +NAKReport +ReXmitFlag -StreamAPI +unknown]
13:03:19.159455/SRT:RcvQ:w1 D:SRT.cn: createSrtHandshake: after HSREQ: offset=13 HSREQ size=3 space left: 351

Then I restarted the server, it quickly stuck again after printing createSrtHandshake (the log is posted above).
Then I restarted the server again, it back to normal now.

@gou4shi1
Copy link
Contributor Author

gou4shi1 commented Mar 1, 2021

I will SIGQUIT it or gdb attach it next time 😂

@ethouris
Copy link
Collaborator

ethouris commented Mar 1, 2021

This looks like something during the handshake, so rather unrelated to what I have changed.

I can see some potential out-of-order locking at the moment when it locks the socket for adding it to the group, but the problem is potentially tough only in blocking mode, while as I saw you are using a non-blocking mode. At least this is a trace I can follow.

@gou4shi1
Copy link
Contributor Author

gou4shi1 commented Mar 1, 2021

Does srt have some global lock? Otherwise some other threads should still logging?

@ethouris
Copy link
Collaborator

ethouris commented Mar 1, 2021

There are mutexes that guard some particular data, that's all. Other threads should be logging, unless they are hanged up, too.

@ethouris
Copy link
Collaborator

ethouris commented Mar 1, 2021

Ok, I found another lock disorder. Lock on CUDTSocket::m_ControlLock is probably unnecessary, and it may cause a deadlock as it orders before m_ConnectionLock. Fortunately, m_ConnectionLock orders before m_GlobControlLock, which's locking is actually absolutely necessary. I updated the branch, you might want to use it for further testing.

@gou4shi1
Copy link
Contributor Author

gou4shi1 commented Mar 3, 2021

It has been test for a day, so far so good, thanks a lot!

@ethouris
Copy link
Collaborator

ethouris commented Mar 3, 2021

It's me to thank you for testing, I couldn't have done it without you.

@maxsharabayko
Copy link
Collaborator

Hi guys. @gou4shi1 @ethouris
Do I understand correctly that the initial issue is no longer valid. And instead, this ticket now tracks the issue described in this comment? 🙂

@maxsharabayko maxsharabayko modified the milestones: v1.4.4, v1.4.5 Aug 12, 2021
@gou4shi1
Copy link
Contributor Author

this comment is already fixed by #2026
The last remaining issue is #1835

@maxsharabayko
Copy link
Collaborator

#1835 says it fixes the problem described here, in #1805. Does it refer to the initial issue then?

@ethouris
Copy link
Collaborator

The "initial issue" is only a description of a symptom.

@maxsharabayko
Copy link
Collaborator

The "initial issue" is only a description of a symptom.

I would appreciate someone to describe the issue remaining here, and what is fixed in #1835.

@gou4shi1
Copy link
Contributor Author

gou4shi1 commented Aug 13, 2021

I would appreciate someone to describe the issue remaining here, and what is fixed in #1835.

Sure, in non-blocking mode, if recv_WaitForReadReady() return empty, recv() should throw CUDTException() instead of keep re-polling

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[core] Area: Changes in SRT library core Type: Bug Indicates an unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants