-
Notifications
You must be signed in to change notification settings - Fork 863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UDP socket closed while still used by other threads #306
Comments
Please check if this is still so in |
With dev branch I get the following error message: |
Well, if you can see any IPE in the logs (Internal Program Error), it's wise to raise the alarm. This message appears when the call to This message is probably not a big deal, but still, it means that something isn't completely taken care of. In this particular case it forces the receiver worker to stop reading by destroying its socket; actually the sender shouldn't send any more data when this happens in file mode, and in live mode you just interrupt the transmission. You can try to experiment with this reordering, but I have kinda doubts about this. Firstmost that must be ensured before you delete a queue for particular direction is that the worker thread for that queue surely exited, otherwise you'll get an immediate crash. And the channel should not be actively used at the moment when it's deleted, as the close is initiated in the main thread usually, whereas the physical deletion is in a GC thread. From your report it looks like that at the moment when the channel was being closed, the receiver worker thread was still running. OTOH, the destruction of CRcvQueue and CSndQueue includes joining their worker threads, so you must somehow force them to exit before you call this destructor, otherwise you'll cause a hangup. |
It took more time but the same deadlock reproduced using dev branch. GC thread is waiting for RcvQueue worker thread to stop. RcvQueue worker thread is stuck indefinitely in a call to recvmsg() with a file descriptor that has been closed. This problem is easier to reproduce when calling srt_cleanup() function, since that forces the GC to remove all remaining sockets. Please do not force the receive worker to stop by destroying its socket. That is not a good idea for unix or linux file descriptors. For unix or linux, the SO_RCVTIMEO option is set on the UDP socket, the CRcvQueue worker thread will exit at the next timeout (if the socket handle remains valid). |
Ok, what is needed to do:
This should make the queues most probably exit at the moment when the previous reading form the channel (or sending) was done, and the flag checkpoint in the |
@fboucher67 @ethouris I think I am facing this same issue! About 50% of the time, after closing the SRT socket, the thread seems stuck on ::rcvmsg()... Any fix or workaround? |
The workaround that I'm currently using is to close the socket/channel after deleting the RcvQueue (this is from CUDTUnited::removeSocket function): if (0 == m->second.m_iRefCount) |
Thanks @fboucher67. I used your fix, at first I thought it would help (at least on a localhost test it seemed to), but on a real connection, I am still seeing the deadlock. I will test further to validate this theory. Cheers. G. |
If this change fixes the problem, could you please submit a PR? |
Like I said, it may have helped, but it does not seem that the problem is completely gone. I just don't know enough to tell. But if it doesn't hurt and helps some situation, @fboucher67 should definitely submit a PR. |
Hi, I noticed occasional crashes of the library and wonder if they were the same issue as described here. This is how it crashed last time
|
@mrfrodl Please specify the version of SRT you are using, and the OS. |
OS is Centos 7. The version is ethouris@49406b2. I am also trying to reproduce the issue with v1.3.2 (no crash yet) but it happpens so rarely that I thought it best to ask. |
Hi, I have an update. Version 1.3.2 crashed due to heap corruption. Crash message
|
I've experienced some crashes in this code when I tried to add new objects later to be deleted by array delete. I had to find an alternative solution. If you can find some procedure that can repro this, I'll be really appreciated. Even if this would have to be repeated 1000 times in a row to get once successful - doesn't matter, I can put it into the machine until it crashes. Such a repro, even if it won't bring me closer to a solution, may at least help me get closer to it and later prove that the problem is taken care of. At least for this backtrace, even not having the direct pointer, I believe the crash happens in this line:
There's no direct reason for it to crash - it's a private field, just once allocated in the constructor, not accessed outside the class. This looks then rather like a manipulation on the memory level. |
Steps to reproduce
Sender process
Receiver process
Notes
Observed outcomeIt usually takes hours (sometimes even days) to reproduce. Either of the processes (sender, receiver) may crash. I saw the receiver crash with SIGSEGV and the sender with SIGABRT (corrupted heap). But I suppose both could have the same underlying condition (writing outside allocated memory) which could end both ways. |
@ethouris |
Update: I had the reproducer running with disabled SRTO_LINGER and it hasn't crashed in over a week. I consider this a stable workaround of the issue. |
Ok, I have this repro'd with this procedure and netem loss 10%. In the log I can see that the main thread is closing the socket and informed about that it has finished closing, whereas the sender buffer at this moment is preparing a packet to send (in the SndQ:worker thread), and the crash happens exactly at this moment. The "lingering", what is interesting, is already finished at this moment. Investigating. |
@mrfrodl What is the value of |
I set both |
During experiments I detected that the problem is mainly due to undefined order of destruction of global objects in C++. If The problem is that with non-blocking mode and linger on, the socket is not really closed, it's only flagged for closing, but rest of the facilities continue to work, until there are no more data to process, which will be then taken over by the GC thread. When Could you please check this fix if this fixes the problem? |
Assuming #627 fixes this. Closing. But don't hesitate to reopen if required. |
Calling setClosing() on recvQ and sendQ does not fix the issue. Silencing messages does not fix the issue. When calling close() on a file descriptor, another thread may create a file descriptor and obtain the same value. recvQ and sendQ then do recvfrom and sendto operations on a file descriptor that belongs to another thread. Before closing a socket, we must guarantee that it is not being used. After calling setClosing() we must wait for recvQ and sendQ to have tested the m_bClosing flag. A synchronization is missing for the m_bClosing flag. |
@fboucher67 Thanks for the update! Reopened. |
CUDTUnited::removeSocket function closes the channel before destroying RcvQueue and SndQueue objects. RcvQueue is actively using the UDP socket handle in a call to recvmsg(). Closing the socket does NOT cause recvmsg() to exit with an error message, it sometimes hangs indefinitely. This causes random deadlocks, corruption or crashes when using file descriptor.
Possible fix is to call channel close after delete of RcvQueue and SndQueue.
The text was updated successfully, but these errors were encountered: