UDP socket closed while still used by other threads #306

fboucher67 · 2018-03-16T20:54:50Z

CUDTUnited::removeSocket function closes the channel before destroying RcvQueue and SndQueue objects. RcvQueue is actively using the UDP socket handle in a call to recvmsg(). Closing the socket does NOT cause recvmsg() to exit with an error message, it sometimes hangs indefinitely. This causes random deadlocks, corruption or crashes when using file descriptor.
Possible fix is to call channel close after delete of RcvQueue and SndQueue.

ethouris · 2018-03-21T14:32:22Z

Please check if this is still so in dev branch version - I remember to have been fixing a similar problem.

fboucher67 · 2018-03-21T15:36:58Z

With dev branch I get the following error message:
CChannel reported ERROR DURING TRANSMISSION - IPE. INTERRUPTING worker anyway.
This should limit the problem.
However, there is still a risk since removeSocket closes the socket while RcvQueue and SndQueue worker threads are still running.
Could it be possible to call "m->second.m_pChannel->close()" after calling "delete m->second.m_pRcvQueue" ?

ethouris · 2018-03-21T17:09:06Z

Well, if you can see any IPE in the logs (Internal Program Error), it's wise to raise the alarm.

This message appears when the call to ::recvmsg on the underlying UDP socket resulted with error qualified as "read error", and except the predictable errors the only such error is EBADF meaning that, most likely, the socket was closed. This is intentional way to make the worker thread interrupted when the socket is being closed. The problem is that m_bClosing flag is not set, hence such message.

This message is probably not a big deal, but still, it means that something isn't completely taken care of. In this particular case it forces the receiver worker to stop reading by destroying its socket; actually the sender shouldn't send any more data when this happens in file mode, and in live mode you just interrupt the transmission.

You can try to experiment with this reordering, but I have kinda doubts about this. Firstmost that must be ensured before you delete a queue for particular direction is that the worker thread for that queue surely exited, otherwise you'll get an immediate crash. And the channel should not be actively used at the moment when it's deleted, as the close is initiated in the main thread usually, whereas the physical deletion is in a GC thread. From your report it looks like that at the moment when the channel was being closed, the receiver worker thread was still running.

OTOH, the destruction of CRcvQueue and CSndQueue includes joining their worker threads, so you must somehow force them to exit before you call this destructor, otherwise you'll cause a hangup.

fboucher67 · 2018-03-21T17:20:56Z

It took more time but the same deadlock reproduced using dev branch. GC thread is waiting for RcvQueue worker thread to stop. RcvQueue worker thread is stuck indefinitely in a call to recvmsg() with a file descriptor that has been closed.

This problem is easier to reproduce when calling srt_cleanup() function, since that forces the GC to remove all remaining sockets.

Please do not force the receive worker to stop by destroying its socket. That is not a good idea for unix or linux file descriptors.

For unix or linux, the SO_RCVTIMEO option is set on the UDP socket, the CRcvQueue worker thread will exit at the next timeout (if the socket handle remains valid).

ethouris · 2018-03-22T10:58:34Z

Ok, what is needed to do:

Add kinda deactivate() method to both CSndQueue and CRcvQueue classes. This method should simply set the m_bClosing flag to true. Make sure this flag is volatile. Actually compilers do not declare this helps with memory barriers, but there are also mutexes in use, so it should suffice.
Call the deactivate() method on both sending and receiving queues before closing the channel.

This should make the queues most probably exit at the moment when the previous reading form the channel (or sending) was done, and the flag checkpoint in the while loop made it break. If the worker thread was reading as removeSocket was closing the channel, the reading operation will break, but this time by checking the flag the worker thread will know that it should simply exit anyway.

gegles · 2018-10-16T19:00:51Z

@fboucher67 @ethouris I think I am facing this same issue! About 50% of the time, after closing the SRT socket, the thread seems stuck on ::rcvmsg()... Any fix or workaround?

fboucher67 · 2018-10-16T19:21:40Z

The workaround that I'm currently using is to close the socket/channel after deleting the RcvQueue (this is from CUDTUnited::removeSocket function):

if (0 == m->second.m_iRefCount)
{
// m->second.m_pChannel->close();
delete m->second.m_pSndQueue;
delete m->second.m_pRcvQueue;
m->second.m_pChannel->close();

gegles · 2018-10-16T21:58:17Z

Thanks @fboucher67. I used your fix, at first I thought it would help (at least on a localhost test it seemed to), but on a real connection, I am still seeing the deadlock. I will test further to validate this theory. Cheers. G.

ethouris · 2018-10-23T09:06:49Z

If this change fixes the problem, could you please submit a PR?

gegles · 2018-10-23T16:51:55Z

Like I said, it may have helped, but it does not seem that the problem is completely gone. I just don't know enough to tell.

But if it doesn't hurt and helps some situation, @fboucher67 should definitely submit a PR.

mrfrodl · 2019-03-05T15:14:58Z

Hi, I noticed occasional crashes of the library and wonder if they were the same issue as described here.

This is how it crashed last time

Program terminated with signal 11, Segmentation fault.
#0  0x00007fd798ad4fe9 in CChannel::sendto (this=0xd77f000090922f88, addr=0x7fd7889792c0, packet=...) at /home/user1/srt/srtcore/channel.cpp:459
459	      mh.msg_namelen = m_iSockAddrSize;
Missing separate debuginfos, use: debuginfo-install glibc-2.17-260.el7_6.3.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-19.el7.x86_64 libcom_err-1.42.9-12.el7_5.x86_64 libgcc-4.8.5-36.el7.x86_64 libselinux-2.5-12.el7.x86_64 libstdc++-4.8.5-28.el7_5.1.x86_64 openssl-libs-1.0.2k-16.el7.x86_64 pcre-8.32-17.el7.x86_64 zlib-1.2.7-18.el7.x86_64
(gdb) bt
#0  0x00007fd798ad4fe9 in CChannel::sendto (this=0xd77f000090922f88, addr=0x7fd7889792c0, packet=...) at /home/user1/srt/srtcore/channel.cpp:459
#1  0x00007fd798aff33f in CSndQueue::sendto (this=0x7fd7880248a0, addr=0x7fd7889792c0, packet=...) at /home/user1/srt/srtcore/queue.cpp:610
#2  0x00007fd798ae6d67 in CUDT::sendCtrl (this=0x7fd7882dcd30, pkttype=UMSG_ACK, lparam=0x0, rparam=0x0, size=0) at /home/user1/srt/srtcore/core.cpp:6444
#3  0x00007fd798aeb380 in CUDT::checkTimers (this=0x7fd7882dcd30) at /home/user1/srt/srtcore/core.cpp:8411
#4  0x00007fd798b010b4 in CRcvQueue::worker_ProcessAddressedPacket (this=0x7fd78806b570, id=855480738, unit=0x7fd5347007f8, addr=0x7fd6a37fdd10) at /home/user1/srt/srtcore/queue.cpp:1599
#5  0x00007fd798b00a4f in CRcvQueue::worker (param=0x7fd78806b570) at /home/user1/srt/srtcore/queue.cpp:1208
#6  0x00007fd797d46dd5 in start_thread () from /lib64/libpthread.so.0
#7  0x00007fd797a6fead in clone () from /lib64/libc.so.6
(gdb) p *this
Cannot access memory at address 0xd77f000090922f88
(gdb) up
#1  0x00007fd798aff33f in CSndQueue::sendto (this=0x7fd7880248a0, addr=0x7fd7889792c0, packet=...) at /home/user1/srt/srtcore/queue.cpp:610
610	   m_pChannel->sendto(addr, packet);
(gdb) p *this
$1 = {m_WorkerThread = 15456072446169808801, m_pSndUList = 0xd77f000020234b89, m_pChannel = 0xd77f000090922f88, m_pTimer = 0xd77f000010d69988, m_WindowLock = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 16777216, __kind = 0, __spins = 0, 
      __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 15 times>, "\001", '\000' <repeats 23 times>, __align = 0}, m_WindowCond = {__data = {__lock = 0, __futex = 16777216, __total_seq = 16777216, __wakeup_seq = 0, 
      __woken_seq = 0, __mutex = 0xd77f0000c0480288, __nwaiters = 33554432, __broadcast_seq = 0}, __size = "\000\000\000\000\000\000\000\001\000\000\000\001", '\000' <repeats 20 times>, "\210\002H\300\000\000\177\327\000\000\000\002\000\000\000", 
    __align = 72057594037927936}, m_bClosing = 136, m_ExitCond = {__data = {__lock = 0, __futex = 0, __total_seq = 0, __wakeup_seq = 0, __woken_seq = 0, __mutex = 0x0, __nwaiters = 0, __broadcast_seq = 0}, __size = '\000' <repeats 47 times>, __align = 0}}
(gdb) up
#2  0x00007fd798ae6d67 in CUDT::sendCtrl (this=0x7fd7882dcd30, pkttype=UMSG_ACK, lparam=0x0, rparam=0x0, size=0) at /home/user1/srt/srtcore/core.cpp:6444
6444	         nbsent = m_pSndQueue->sendto(m_pPeerAddr, ctrlpkt);
(gdb) p *m_pSndQueue
$2 = {m_WorkerThread = 15456072446169808801, m_pSndUList = 0xd77f000020234b89, m_pChannel = 0xd77f000090922f88, m_pTimer = 0xd77f000010d69988, m_WindowLock = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 16777216, __kind = 0, __spins = 0, 
      __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 15 times>, "\001", '\000' <repeats 23 times>, __align = 0}, m_WindowCond = {__data = {__lock = 0, __futex = 16777216, __total_seq = 16777216, __wakeup_seq = 0, 
      __woken_seq = 0, __mutex = 0xd77f0000c0480288, __nwaiters = 33554432, __broadcast_seq = 0}, __size = "\000\000\000\000\000\000\000\001\000\000\000\001", '\000' <repeats 20 times>, "\210\002H\300\000\000\177\327\000\000\000\002\000\000\000", 
    __align = 72057594037927936}, m_bClosing = 136, m_ExitCond = {__data = {__lock = 0, __futex = 0, __total_seq = 0, __wakeup_seq = 0, __woken_seq = 0, __mutex = 0x0, __nwaiters = 0, __broadcast_seq = 0}, __size = '\000' <repeats 47 times>, __align = 0}}
(gdb)

maxsharabayko · 2019-03-05T15:27:02Z

@mrfrodl Please specify the version of SRT you are using, and the OS.

mrfrodl · 2019-03-05T16:07:39Z

OS is Centos 7. The version is ethouris@49406b2. I am also trying to reproduce the issue with v1.3.2 (no crash yet) but it happpens so rarely that I thought it best to ask.

mrfrodl · 2019-03-07T09:05:27Z

Hi, I have an update. Version 1.3.2 crashed due to heap corruption.

Crash message

*** Error in `./transcoder': free(): corrupted unsorted chunks: 0x00007fdce0011760 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x81489)[0x7fdceda01489]
/opt/transcoder/srt-v1.3.2-debug/lib64/libsrt.so.1(_ZN10CRcvBufferD1Ev+0x8b)[0x7fdcee20109d]
/opt/transcoder/srt-v1.3.2-debug/lib64/libsrt.so.1(_ZN4CUDTD1Ev+0x6d)[0x7fdcee20acc7]
/opt/transcoder/srt-v1.3.2-debug/lib64/libsrt.so.1(_ZN10CUDTSocketD2Ev+0x70)[0x7fdcee1e1db8]
/opt/transcoder/srt-v1.3.2-debug/lib64/libsrt.so.1(_ZN10CUDTUnited12removeSocketEi+0x5c4)[0x7fdcee1e843a]
/opt/transcoder/srt-v1.3.2-debug/lib64/libsrt.so.1(_ZN10CUDTUnited18checkBrokenSocketsEv+0x622)[0x7fdcee1e7de6]
/opt/transcoder/srt-v1.3.2-debug/lib64/libsrt.so.1(_ZN10CUDTUnited14garbageCollectEPv+0x5b)[0x7fdcee1e95e7]
/lib64/libpthread.so.0(+0x7dd5)[0x7fdcedd54dd5]
/lib64/libc.so.6(clone+0x6d)[0x7fdceda7dead]
======= Memory map: ========
[...]

ethouris · 2019-03-20T19:39:37Z

I've experienced some crashes in this code when I tried to add new objects later to be deleted by array delete. I had to find an alternative solution.

If you can find some procedure that can repro this, I'll be really appreciated. Even if this would have to be repeated 1000 times in a row to get once successful - doesn't matter, I can put it into the machine until it crashes.

Such a repro, even if it won't bring me closer to a solution, may at least help me get closer to it and later prove that the problem is taken care of.

At least for this backtrace, even not having the direct pointer, I believe the crash happens in this line:

   delete [] m_pUnit;

There's no direct reason for it to crash - it's a private field, just once allocated in the constructor, not accessed outside the class. This looks then rather like a manipulation on the memory level.

mrfrodl · 2019-03-21T14:21:13Z

Steps to reproduce

sender and receiver processes are on one machine
sender is listener
artificial packet loss and latency on localhost using netem (sudo tc qdisc add dev lo root netem limit 40000 delay 90ms loss 2.0%)
have the linger option enabled

Sender process

listen and accept one connection
send data for ~30 seconds
close connection and repeat from 1.

Receiver process

start the process (initialize SRT)
connect and keep receiving for about 22 seconds
close the connection
cleanup SRT, exit process and start again from 1.

Notes

sender has the same SRT context the whole time (srt_startup -> srt_cleanup) as opposed to the receiver where the context is created and destroyed in each iteration
iteration times differ slightly for sender and receiver so that they meet in different phases of their cycle every time

Observed outcome

It usually takes hours (sometimes even days) to reproduce. Either of the processes (sender, receiver) may crash. I saw the receiver crash with SIGSEGV and the sender with SIGABRT (corrupted heap). But I suppose both could have the same underlying condition (writing outside allocated memory) which could end both ways.

maxsharabayko · 2019-03-21T14:49:11Z

@ethouris
The issue is likely to be due to SRTO_LINGER socket option. We have had a short e-mail discussion with @mrfrodl. He is reporting no crashes for five days now with linger disabled (please correct me if that has changed).
In the case of linger enabled, the program is likely to crash if there is some data left in the sending buffer at the moment of socket closure. I think that can be reproduced easier by modifying the CSndBuffer::getCurrBufSize() function to always return a positive value. This will turn on the linger algo in CUDT::close() and further on.
I've had the issue with srt-live-transmit (non-blocking mode).

mrfrodl · 2019-03-25T17:35:35Z

Update: I had the reproducer running with disabled SRTO_LINGER and it hasn't crashed in over a week. I consider this a stable workaround of the issue.

ethouris · 2019-03-26T08:51:04Z

Ok, I have this repro'd with this procedure and netem loss 10%. In the log I can see that the main thread is closing the socket and informed about that it has finished closing, whereas the sender buffer at this moment is preparing a packet to send (in the SndQ:worker thread), and the crash happens exactly at this moment. The "lingering", what is interesting, is already finished at this moment. Investigating.

maxsharabayko · 2019-03-26T09:40:57Z

@mrfrodl What is the value of SRTO_RCVSYN and SRTO_SNDSYN in your experiments. By default both are set to true.

mrfrodl · 2019-03-26T10:09:44Z

I set both SRTO_RCVSYN and SRTO_SNDSYN to false on the sender side and to true on the receiver side.

ethouris · 2019-03-28T10:14:17Z

@mrfrodl I made a fix #627.

During experiments I detected that the problem is mainly due to undefined order of destruction of global objects in C++. If srt_cleanup is honestly called at the very end of the program, this doesn't happen.

The problem is that with non-blocking mode and linger on, the socket is not really closed, it's only flagged for closing, but rest of the facilities continue to work, until there are no more data to process, which will be then taken over by the GC thread. When srt_cleanup is called, all these sockets will be wiped out.

Could you please check this fix if this fixes the problem?

maxsharabayko · 2019-08-09T12:35:28Z

Assuming #627 fixes this. Closing. But don't hesitate to reopen if required.

fboucher67 · 2019-08-09T14:55:03Z

Calling setClosing() on recvQ and sendQ does not fix the issue. Silencing messages does not fix the issue. When calling close() on a file descriptor, another thread may create a file descriptor and obtain the same value. recvQ and sendQ then do recvfrom and sendto operations on a file descriptor that belongs to another thread. Before closing a socket, we must guarantee that it is not being used. After calling setClosing() we must wait for recvQ and sendQ to have tested the m_bClosing flag. A synchronization is missing for the m_bClosing flag.

maxsharabayko · 2019-08-09T15:36:33Z

@fboucher67 Thanks for the update! Reopened.

ethouris added the Status: Pending label Mar 21, 2018

ethouris added Priority: Medium Type: Bug Indicates an unexpected problem or unintended behavior Status: In Progress and removed Status: Pending labels Mar 22, 2018

maxsharabayko added this to the v.1.3.3 milestone Mar 5, 2019

maxsharabayko pinned this issue Mar 13, 2019

maxsharabayko added Status: Abandoned There is no reply from the issue reporter and removed Status: In Progress labels May 29, 2019

maxsharabayko modified the milestones: v.1.3.3, v.1.3.4 May 29, 2019

maxsharabayko unpinned this issue Jul 5, 2019

maxsharabayko removed this from the v1.3.4 milestone Aug 9, 2019

maxsharabayko closed this as completed Aug 9, 2019

maxsharabayko reopened this Aug 9, 2019

maxsharabayko added this to the v1.4.1 milestone Aug 9, 2019

maxsharabayko removed the Status: Abandoned There is no reply from the issue reporter label Aug 9, 2019

fboucher67 mentioned this issue Aug 9, 2019

close Channel after stopping recv send queues #808

Merged

maxsharabayko modified the milestones: v1.4.1, v1.3.4 Aug 9, 2019

maxsharabayko added the [core] Area: Changes in SRT library core label Aug 12, 2019

maxsharabayko closed this as completed in #808 Aug 19, 2019

clime mentioned this issue Oct 19, 2020

[BUG] undefined behavior errors: #1614

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UDP socket closed while still used by other threads #306

UDP socket closed while still used by other threads #306

fboucher67 commented Mar 16, 2018

ethouris commented Mar 21, 2018

fboucher67 commented Mar 21, 2018

ethouris commented Mar 21, 2018

fboucher67 commented Mar 21, 2018

ethouris commented Mar 22, 2018

gegles commented Oct 16, 2018

fboucher67 commented Oct 16, 2018

gegles commented Oct 16, 2018

ethouris commented Oct 23, 2018

gegles commented Oct 23, 2018

mrfrodl commented Mar 5, 2019

maxsharabayko commented Mar 5, 2019

mrfrodl commented Mar 5, 2019

mrfrodl commented Mar 7, 2019

ethouris commented Mar 20, 2019

mrfrodl commented Mar 21, 2019

maxsharabayko commented Mar 21, 2019

mrfrodl commented Mar 25, 2019

ethouris commented Mar 26, 2019

maxsharabayko commented Mar 26, 2019

mrfrodl commented Mar 26, 2019

ethouris commented Mar 28, 2019

maxsharabayko commented Aug 9, 2019

fboucher67 commented Aug 9, 2019

maxsharabayko commented Aug 9, 2019

UDP socket closed while still used by other threads #306

UDP socket closed while still used by other threads #306

Comments

fboucher67 commented Mar 16, 2018

ethouris commented Mar 21, 2018

fboucher67 commented Mar 21, 2018

ethouris commented Mar 21, 2018

fboucher67 commented Mar 21, 2018

ethouris commented Mar 22, 2018

gegles commented Oct 16, 2018

fboucher67 commented Oct 16, 2018

gegles commented Oct 16, 2018

ethouris commented Oct 23, 2018

gegles commented Oct 23, 2018

mrfrodl commented Mar 5, 2019

maxsharabayko commented Mar 5, 2019

mrfrodl commented Mar 5, 2019

mrfrodl commented Mar 7, 2019

ethouris commented Mar 20, 2019

mrfrodl commented Mar 21, 2019

Steps to reproduce

Observed outcome

maxsharabayko commented Mar 21, 2019

mrfrodl commented Mar 25, 2019

ethouris commented Mar 26, 2019

maxsharabayko commented Mar 26, 2019

mrfrodl commented Mar 26, 2019

ethouris commented Mar 28, 2019

maxsharabayko commented Aug 9, 2019

fboucher67 commented Aug 9, 2019

maxsharabayko commented Aug 9, 2019