Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UDP socket closed while still used by other threads #306

Closed
fboucher67 opened this issue Mar 16, 2018 · 25 comments · Fixed by #808
Closed

UDP socket closed while still used by other threads #306

fboucher67 opened this issue Mar 16, 2018 · 25 comments · Fixed by #808
Labels
[core] Area: Changes in SRT library core Priority: Medium Type: Bug Indicates an unexpected problem or unintended behavior
Milestone

Comments

@fboucher67
Copy link
Contributor

CUDTUnited::removeSocket function closes the channel before destroying RcvQueue and SndQueue objects. RcvQueue is actively using the UDP socket handle in a call to recvmsg(). Closing the socket does NOT cause recvmsg() to exit with an error message, it sometimes hangs indefinitely. This causes random deadlocks, corruption or crashes when using file descriptor.
Possible fix is to call channel close after delete of RcvQueue and SndQueue.

@ethouris
Copy link
Collaborator

Please check if this is still so in dev branch version - I remember to have been fixing a similar problem.

@fboucher67
Copy link
Contributor Author

With dev branch I get the following error message:
CChannel reported ERROR DURING TRANSMISSION - IPE. INTERRUPTING worker anyway.
This should limit the problem.
However, there is still a risk since removeSocket closes the socket while RcvQueue and SndQueue worker threads are still running.
Could it be possible to call "m->second.m_pChannel->close()" after calling "delete m->second.m_pRcvQueue" ?

@ethouris
Copy link
Collaborator

Well, if you can see any IPE in the logs (Internal Program Error), it's wise to raise the alarm.

This message appears when the call to ::recvmsg on the underlying UDP socket resulted with error qualified as "read error", and except the predictable errors the only such error is EBADF meaning that, most likely, the socket was closed. This is intentional way to make the worker thread interrupted when the socket is being closed. The problem is that m_bClosing flag is not set, hence such message.

This message is probably not a big deal, but still, it means that something isn't completely taken care of. In this particular case it forces the receiver worker to stop reading by destroying its socket; actually the sender shouldn't send any more data when this happens in file mode, and in live mode you just interrupt the transmission.

You can try to experiment with this reordering, but I have kinda doubts about this. Firstmost that must be ensured before you delete a queue for particular direction is that the worker thread for that queue surely exited, otherwise you'll get an immediate crash. And the channel should not be actively used at the moment when it's deleted, as the close is initiated in the main thread usually, whereas the physical deletion is in a GC thread. From your report it looks like that at the moment when the channel was being closed, the receiver worker thread was still running.

OTOH, the destruction of CRcvQueue and CSndQueue includes joining their worker threads, so you must somehow force them to exit before you call this destructor, otherwise you'll cause a hangup.

@fboucher67
Copy link
Contributor Author

It took more time but the same deadlock reproduced using dev branch. GC thread is waiting for RcvQueue worker thread to stop. RcvQueue worker thread is stuck indefinitely in a call to recvmsg() with a file descriptor that has been closed.

This problem is easier to reproduce when calling srt_cleanup() function, since that forces the GC to remove all remaining sockets.

Please do not force the receive worker to stop by destroying its socket. That is not a good idea for unix or linux file descriptors.

For unix or linux, the SO_RCVTIMEO option is set on the UDP socket, the CRcvQueue worker thread will exit at the next timeout (if the socket handle remains valid).

@ethouris ethouris added Priority: Medium Type: Bug Indicates an unexpected problem or unintended behavior Status: In Progress and removed Status: Pending labels Mar 22, 2018
@ethouris
Copy link
Collaborator

Ok, what is needed to do:

  1. Add kinda deactivate() method to both CSndQueue and CRcvQueue classes. This method should simply set the m_bClosing flag to true. Make sure this flag is volatile. Actually compilers do not declare this helps with memory barriers, but there are also mutexes in use, so it should suffice.
  2. Call the deactivate() method on both sending and receiving queues before closing the channel.

This should make the queues most probably exit at the moment when the previous reading form the channel (or sending) was done, and the flag checkpoint in the while loop made it break. If the worker thread was reading as removeSocket was closing the channel, the reading operation will break, but this time by checking the flag the worker thread will know that it should simply exit anyway.

@gegles
Copy link
Contributor

gegles commented Oct 16, 2018

@fboucher67 @ethouris I think I am facing this same issue! About 50% of the time, after closing the SRT socket, the thread seems stuck on ::rcvmsg()... Any fix or workaround?

@fboucher67
Copy link
Contributor Author

The workaround that I'm currently using is to close the socket/channel after deleting the RcvQueue (this is from CUDTUnited::removeSocket function):

if (0 == m->second.m_iRefCount)
{
// m->second.m_pChannel->close();
delete m->second.m_pSndQueue;
delete m->second.m_pRcvQueue;
m->second.m_pChannel->close();

@gegles
Copy link
Contributor

gegles commented Oct 16, 2018

Thanks @fboucher67. I used your fix, at first I thought it would help (at least on a localhost test it seemed to), but on a real connection, I am still seeing the deadlock. I will test further to validate this theory. Cheers. G.

@ethouris
Copy link
Collaborator

If this change fixes the problem, could you please submit a PR?

@gegles
Copy link
Contributor

gegles commented Oct 23, 2018

Like I said, it may have helped, but it does not seem that the problem is completely gone. I just don't know enough to tell.

But if it doesn't hurt and helps some situation, @fboucher67 should definitely submit a PR.

@mrfrodl
Copy link

mrfrodl commented Mar 5, 2019

Hi, I noticed occasional crashes of the library and wonder if they were the same issue as described here.

This is how it crashed last time

Program terminated with signal 11, Segmentation fault.
#0  0x00007fd798ad4fe9 in CChannel::sendto (this=0xd77f000090922f88, addr=0x7fd7889792c0, packet=...) at /home/user1/srt/srtcore/channel.cpp:459
459	      mh.msg_namelen = m_iSockAddrSize;
Missing separate debuginfos, use: debuginfo-install glibc-2.17-260.el7_6.3.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-19.el7.x86_64 libcom_err-1.42.9-12.el7_5.x86_64 libgcc-4.8.5-36.el7.x86_64 libselinux-2.5-12.el7.x86_64 libstdc++-4.8.5-28.el7_5.1.x86_64 openssl-libs-1.0.2k-16.el7.x86_64 pcre-8.32-17.el7.x86_64 zlib-1.2.7-18.el7.x86_64
(gdb) bt
#0  0x00007fd798ad4fe9 in CChannel::sendto (this=0xd77f000090922f88, addr=0x7fd7889792c0, packet=...) at /home/user1/srt/srtcore/channel.cpp:459
#1  0x00007fd798aff33f in CSndQueue::sendto (this=0x7fd7880248a0, addr=0x7fd7889792c0, packet=...) at /home/user1/srt/srtcore/queue.cpp:610
#2  0x00007fd798ae6d67 in CUDT::sendCtrl (this=0x7fd7882dcd30, pkttype=UMSG_ACK, lparam=0x0, rparam=0x0, size=0) at /home/user1/srt/srtcore/core.cpp:6444
#3  0x00007fd798aeb380 in CUDT::checkTimers (this=0x7fd7882dcd30) at /home/user1/srt/srtcore/core.cpp:8411
#4  0x00007fd798b010b4 in CRcvQueue::worker_ProcessAddressedPacket (this=0x7fd78806b570, id=855480738, unit=0x7fd5347007f8, addr=0x7fd6a37fdd10) at /home/user1/srt/srtcore/queue.cpp:1599
#5  0x00007fd798b00a4f in CRcvQueue::worker (param=0x7fd78806b570) at /home/user1/srt/srtcore/queue.cpp:1208
#6  0x00007fd797d46dd5 in start_thread () from /lib64/libpthread.so.0
#7  0x00007fd797a6fead in clone () from /lib64/libc.so.6
(gdb) p *this
Cannot access memory at address 0xd77f000090922f88
(gdb) up
#1  0x00007fd798aff33f in CSndQueue::sendto (this=0x7fd7880248a0, addr=0x7fd7889792c0, packet=...) at /home/user1/srt/srtcore/queue.cpp:610
610	   m_pChannel->sendto(addr, packet);
(gdb) p *this
$1 = {m_WorkerThread = 15456072446169808801, m_pSndUList = 0xd77f000020234b89, m_pChannel = 0xd77f000090922f88, m_pTimer = 0xd77f000010d69988, m_WindowLock = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 16777216, __kind = 0, __spins = 0, 
      __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 15 times>, "\001", '\000' <repeats 23 times>, __align = 0}, m_WindowCond = {__data = {__lock = 0, __futex = 16777216, __total_seq = 16777216, __wakeup_seq = 0, 
      __woken_seq = 0, __mutex = 0xd77f0000c0480288, __nwaiters = 33554432, __broadcast_seq = 0}, __size = "\000\000\000\000\000\000\000\001\000\000\000\001", '\000' <repeats 20 times>, "\210\002H\300\000\000\177\327\000\000\000\002\000\000\000", 
    __align = 72057594037927936}, m_bClosing = 136, m_ExitCond = {__data = {__lock = 0, __futex = 0, __total_seq = 0, __wakeup_seq = 0, __woken_seq = 0, __mutex = 0x0, __nwaiters = 0, __broadcast_seq = 0}, __size = '\000' <repeats 47 times>, __align = 0}}
(gdb) up
#2  0x00007fd798ae6d67 in CUDT::sendCtrl (this=0x7fd7882dcd30, pkttype=UMSG_ACK, lparam=0x0, rparam=0x0, size=0) at /home/user1/srt/srtcore/core.cpp:6444
6444	         nbsent = m_pSndQueue->sendto(m_pPeerAddr, ctrlpkt);
(gdb) p *m_pSndQueue
$2 = {m_WorkerThread = 15456072446169808801, m_pSndUList = 0xd77f000020234b89, m_pChannel = 0xd77f000090922f88, m_pTimer = 0xd77f000010d69988, m_WindowLock = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 16777216, __kind = 0, __spins = 0, 
      __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 15 times>, "\001", '\000' <repeats 23 times>, __align = 0}, m_WindowCond = {__data = {__lock = 0, __futex = 16777216, __total_seq = 16777216, __wakeup_seq = 0, 
      __woken_seq = 0, __mutex = 0xd77f0000c0480288, __nwaiters = 33554432, __broadcast_seq = 0}, __size = "\000\000\000\000\000\000\000\001\000\000\000\001", '\000' <repeats 20 times>, "\210\002H\300\000\000\177\327\000\000\000\002\000\000\000", 
    __align = 72057594037927936}, m_bClosing = 136, m_ExitCond = {__data = {__lock = 0, __futex = 0, __total_seq = 0, __wakeup_seq = 0, __woken_seq = 0, __mutex = 0x0, __nwaiters = 0, __broadcast_seq = 0}, __size = '\000' <repeats 47 times>, __align = 0}}
(gdb) 

@maxsharabayko
Copy link
Collaborator

@mrfrodl Please specify the version of SRT you are using, and the OS.

@maxsharabayko maxsharabayko added this to the v.1.3.3 milestone Mar 5, 2019
@mrfrodl
Copy link

mrfrodl commented Mar 5, 2019

OS is Centos 7. The version is ethouris@49406b2. I am also trying to reproduce the issue with v1.3.2 (no crash yet) but it happpens so rarely that I thought it best to ask.

@mrfrodl
Copy link

mrfrodl commented Mar 7, 2019

Hi, I have an update. Version 1.3.2 crashed due to heap corruption.

Crash message

*** Error in `./transcoder': free(): corrupted unsorted chunks: 0x00007fdce0011760 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x81489)[0x7fdceda01489]
/opt/transcoder/srt-v1.3.2-debug/lib64/libsrt.so.1(_ZN10CRcvBufferD1Ev+0x8b)[0x7fdcee20109d]
/opt/transcoder/srt-v1.3.2-debug/lib64/libsrt.so.1(_ZN4CUDTD1Ev+0x6d)[0x7fdcee20acc7]
/opt/transcoder/srt-v1.3.2-debug/lib64/libsrt.so.1(_ZN10CUDTSocketD2Ev+0x70)[0x7fdcee1e1db8]
/opt/transcoder/srt-v1.3.2-debug/lib64/libsrt.so.1(_ZN10CUDTUnited12removeSocketEi+0x5c4)[0x7fdcee1e843a]
/opt/transcoder/srt-v1.3.2-debug/lib64/libsrt.so.1(_ZN10CUDTUnited18checkBrokenSocketsEv+0x622)[0x7fdcee1e7de6]
/opt/transcoder/srt-v1.3.2-debug/lib64/libsrt.so.1(_ZN10CUDTUnited14garbageCollectEPv+0x5b)[0x7fdcee1e95e7]
/lib64/libpthread.so.0(+0x7dd5)[0x7fdcedd54dd5]
/lib64/libc.so.6(clone+0x6d)[0x7fdceda7dead]
======= Memory map: ========
[...]

@maxsharabayko maxsharabayko pinned this issue Mar 13, 2019
@ethouris
Copy link
Collaborator

I've experienced some crashes in this code when I tried to add new objects later to be deleted by array delete. I had to find an alternative solution.

If you can find some procedure that can repro this, I'll be really appreciated. Even if this would have to be repeated 1000 times in a row to get once successful - doesn't matter, I can put it into the machine until it crashes.

Such a repro, even if it won't bring me closer to a solution, may at least help me get closer to it and later prove that the problem is taken care of.

At least for this backtrace, even not having the direct pointer, I believe the crash happens in this line:

   delete [] m_pUnit;

There's no direct reason for it to crash - it's a private field, just once allocated in the constructor, not accessed outside the class. This looks then rather like a manipulation on the memory level.

@mrfrodl
Copy link

mrfrodl commented Mar 21, 2019

Steps to reproduce

  • sender and receiver processes are on one machine
  • sender is listener
  • artificial packet loss and latency on localhost using netem (sudo tc qdisc add dev lo root netem limit 40000 delay 90ms loss 2.0%)
  • have the linger option enabled

Sender process

  1. listen and accept one connection
  2. send data for ~30 seconds
  3. close connection and repeat from 1.

Receiver process

  1. start the process (initialize SRT)
  2. connect and keep receiving for about 22 seconds
  3. close the connection
  4. cleanup SRT, exit process and start again from 1.

Notes

  • sender has the same SRT context the whole time (srt_startup -> srt_cleanup) as opposed to the receiver where the context is created and destroyed in each iteration
  • iteration times differ slightly for sender and receiver so that they meet in different phases of their cycle every time

Observed outcome

It usually takes hours (sometimes even days) to reproduce. Either of the processes (sender, receiver) may crash. I saw the receiver crash with SIGSEGV and the sender with SIGABRT (corrupted heap). But I suppose both could have the same underlying condition (writing outside allocated memory) which could end both ways.

@maxsharabayko
Copy link
Collaborator

@ethouris
The issue is likely to be due to SRTO_LINGER socket option. We have had a short e-mail discussion with @mrfrodl. He is reporting no crashes for five days now with linger disabled (please correct me if that has changed).
In the case of linger enabled, the program is likely to crash if there is some data left in the sending buffer at the moment of socket closure. I think that can be reproduced easier by modifying the CSndBuffer::getCurrBufSize() function to always return a positive value. This will turn on the linger algo in CUDT::close() and further on.
I've had the issue with srt-live-transmit (non-blocking mode).

@mrfrodl
Copy link

mrfrodl commented Mar 25, 2019

Update: I had the reproducer running with disabled SRTO_LINGER and it hasn't crashed in over a week. I consider this a stable workaround of the issue.

@ethouris
Copy link
Collaborator

Ok, I have this repro'd with this procedure and netem loss 10%. In the log I can see that the main thread is closing the socket and informed about that it has finished closing, whereas the sender buffer at this moment is preparing a packet to send (in the SndQ:worker thread), and the crash happens exactly at this moment. The "lingering", what is interesting, is already finished at this moment. Investigating.

@maxsharabayko
Copy link
Collaborator

@mrfrodl What is the value of SRTO_RCVSYN and SRTO_SNDSYN in your experiments. By default both are set to true.

@mrfrodl
Copy link

mrfrodl commented Mar 26, 2019

I set both SRTO_RCVSYN and SRTO_SNDSYN to false on the sender side and to true on the receiver side.

@ethouris
Copy link
Collaborator

@mrfrodl I made a fix #627.

During experiments I detected that the problem is mainly due to undefined order of destruction of global objects in C++. If srt_cleanup is honestly called at the very end of the program, this doesn't happen.

The problem is that with non-blocking mode and linger on, the socket is not really closed, it's only flagged for closing, but rest of the facilities continue to work, until there are no more data to process, which will be then taken over by the GC thread. When srt_cleanup is called, all these sockets will be wiped out.

Could you please check this fix if this fixes the problem?

@maxsharabayko maxsharabayko added Status: Abandoned There is no reply from the issue reporter and removed Status: In Progress labels May 29, 2019
@maxsharabayko maxsharabayko modified the milestones: v.1.3.3, v.1.3.4 May 29, 2019
@maxsharabayko maxsharabayko unpinned this issue Jul 5, 2019
@maxsharabayko maxsharabayko removed this from the v1.3.4 milestone Aug 9, 2019
@maxsharabayko
Copy link
Collaborator

Assuming #627 fixes this. Closing. But don't hesitate to reopen if required.

@fboucher67
Copy link
Contributor Author

Calling setClosing() on recvQ and sendQ does not fix the issue. Silencing messages does not fix the issue. When calling close() on a file descriptor, another thread may create a file descriptor and obtain the same value. recvQ and sendQ then do recvfrom and sendto operations on a file descriptor that belongs to another thread. Before closing a socket, we must guarantee that it is not being used. After calling setClosing() we must wait for recvQ and sendQ to have tested the m_bClosing flag. A synchronization is missing for the m_bClosing flag.

@maxsharabayko maxsharabayko reopened this Aug 9, 2019
@maxsharabayko maxsharabayko added this to the v1.4.1 milestone Aug 9, 2019
@maxsharabayko maxsharabayko removed the Status: Abandoned There is no reply from the issue reporter label Aug 9, 2019
@maxsharabayko
Copy link
Collaborator

@fboucher67 Thanks for the update! Reopened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[core] Area: Changes in SRT library core Priority: Medium Type: Bug Indicates an unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants