-
Notifications
You must be signed in to change notification settings - Fork 371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Runaway LWPs/threads on recording daemon #808
Comments
Since only a static number of poller threads is created in
a reasonable supposition for the runaway thread growth is that some or all of the libraries used by the recording daemon do their work in threads of their own, and that these threads are not exiting properly:
But, I have no way of identifying which library might be the problem. One thing I did try was to remove |
One thing I did learn from attaching
|
Here are the ffmpeg library versions in use:
I suppose one thing I haven't tried is sourcing them from a place other than https://linuxize.com/post/how-to-install-ffmpeg-on-centos-7/ |
One thing I am trying now is a much newer version of the ffmpeg packages from something called http://awel.domblogger.net/7/media/x86_64/repoview/awel-media-release.html Much newer versions:
So far it is looking promising, but as it is after business hours call volumes have collapsed, so I can't really get truly meaningful feedback until tomorrow possibly. |
So far, I've got this situation:
And about 30 LWPs. That number drops to 28 or 26 from time to time, spikes to 32 or so. Doesn't seem to be moving much beyond this level, but neither do the call volumes. Is there any insight on the relationship between the threads spawned by the recording daemon and the call volumes? It's very difficult to tell if the upgrade of the |
I wasn't aware that libavfilter (or ffmpeg libs in general) would spawn any threads. There's certainly nothing in the code that would instruct it to do that. Gonna have to look into what it's doing there. |
Just as a data point following the ffmpeg libs update 👍 It is now after 21:00 here, well outside of business hours, and there are no calls --
There have not been any for quite some time. Yet, there are 38 LWPs spawned off of
The recording daemon was invoked without Here is the state of the 38 processes:
Attaching to a process at random:
The same seems to be true of the others which are in the I guess a key question is: given the zero call load, and no calls in the sink...
Overall, the state is:
... why are there 38 LWPs? Why aren't these processes being wound back down? Another interesting wrinkle -- it looks like the core
Looking at all these calls, they seem to have one thing in common, e.g.
Not sure if that bears somehow upon the issue. |
Lastly, the handles still open by the core process number 14:
And these are the precise descriptors held open by each LWP/subprocess:
|
As another data point from this morning ("Serious Call Volumes" have not started yet):
Needless to say, it's a bit hard to make sense of this, though it does seem to be an improvement from the runaway increase of before. But after 9 AM calls will spike into much higher territory and then we can say more. All the "superfluous" processes beyond the initial workers spawned are from `libavutil, as before:
And, the number of WAV file handles held by the recording daemon as a whole has increased -- to 34 on this particular host. As before, a salient characteristic of the Call-IDs of all the calls whose handles are being held open is that they seemed to have been timed-out streams:
|
I cannot help but think that there is some clearer relationship between the number of "stale" file handles opened from "timed out" calls and the number of deadlocked processes, though I cannot find it. There is certainly a correlation; overall, the more such handles, the more processes. But exactly how much more I am unable to establish; it seems to vary, and the process count isn't accounted for by the number of stale handles per se. |
Now that we have had production loads all day, I think the verdict is in: the
Moreover, the stale WAV file handles have grown commensurately:
|
Can you tell if it's leaking memory also? |
That's hard to say. But with 1000+ LWPs, we can be sure it is using a prodigious amount of memory. :-) |
Well yes, I suppose the threads themselves are using up memory too... My guess is that there's some kind of close/destroy/free/cleanup invocation missing somewhere. Are you able to run this under valgrind? Not recommended for production as performance is horrible, but in a test/lab environment? |
I don't think I can do that, no. What do you make of the fact that the stale WAV handles seem to be tied to streams which disappeared from a timeout? |
Can you confirm that for sure? Because the recording daemon doesn't really care about how a call was closed, timeout or otherwise. Once the metadata spool file gets deleted, the call is closed. Assuming the metadata spool files actually do get deleted? |
I can confirm that all the file handles that remain held open, as rendered by |
And I can say that there are only 20 entries at the moment in |
I can also say that the chronology of the stuck LWPs and the timestamps of timed out calls line up oddly well:
And:
|
What about the reverse though? Did any calls that were not closed from a timeout also result in a stale file/LWP? |
I can confirm that every single one of the file handles held open by the recording daemon corresponds to a timed out call:
The line count there is precisely identical to the one returned by |
But the metadata spool files have been deleted regardless? |
Correct -- none of the Call-IDs found in |
Oh, no those are not the metadata spool files. Check the directory you have configured as |
Oh, I see. I put the |
Ah. Ok. Don't do that. Use a separate spool directory. Try that for starters. |
Okay. But can I ask why? :) This was non-obvious. |
Because the recording daemon watches the spool directory for changes using inotify, reads each file upon changes, and if it writes the recordings to the same directory, it gets confused. I'm not sure if that fixes what you're seeing, but it's a first step. |
I have changed the spool directory to |
Well, that's odd. Now there are no metadata files being written to the spool directory, even though new calls are coming in, e.g.
|
Did you change the spool directory on rtpengine's side too? |
No, I didn't. I just realised that. I can't do that without dropping production calls, which would be a problem. Let me see what I can do to handle that 'gracefully'. |
I assume that
|
Yes. |
You can also leave the spool dir unchanged and just change the output dir for the recordings. |
I just thought of that too. :-) Stand by... |
Okay, I have moved |
Well, call volumes have collapsed since it's nearly 18:00. I only have 8 targets up right now. So, the pickings are slim for large-scale troubleshooting. But I have a sense this did not fix the issue -- there are a few new LWPs stuck in the
However, all the file handles held open right now are for live calls, so I'm going to have to see if those handles disappear afterward. |
Well, one promising sign ... there was a call for which a file handle was held open before:
... which has since closed due to a timeout:
... and the file handle has disappeared:
On the other hand, the LWP count still seems wildly at odds with the amount of streams total on the system:
It'll occasionally decrease by 2 or so, but overall the trend is to increase and increase. That makes me pessimistic that the directory change fixed the issue. |
We're just going to have to wait until tomorrow to get any real results. |
Well, some cause for optimism, though I don't want to call it prematurely until we see tomorrow's production call loads. Nevertheless, since I made the suggested change, we have dropped to zero call load on that RTPEngine and got back down to the default ten threads (absent a
This is not a result I had seen before. |
Hi @rfuchs, your suggestion to separate the spool and recording directories appears to have solved the problem. Thank you very much! If you don't mind, I'm going to submit a PR with amendments to the README to caution against this for other users. Putting the .meta data in the directory with the actual recordings file was probably not a behaviour you anticipated, but adverse consequences of doing so were neither documented nor obvious to those who don't know how the daemon works. :-) |
Sounds good, thanks. I'm thinking of even having a check to refuse startup when something like this is configured. |
#810 has been submitted. |
reported in #808 Change-Id: I00e26d09d7557221dfdaf105559fb7eaa5ab3e50
I am running RTPEngine mr6.5.4.2 built from source on EL7, plus recording-daemon from the same suite.
libav*
dependencies come from thenux-dextop
repo. RTPEngine is writing frames into the/proc
sink (--recording-method=proc
) and the recording daemon is writing out mixed mono WAVs, withfile
-only metadata, no DB, and all in all the following invocation options:What I am seeing is runaway growth in the number of worker threads spawned by the recording daemon, wildly disproportionate to the number of RTPEngine targets:
Almost all of them appear to be in a
futex
state, so I assume some sort of deadlock, e.g.The way this issue was detected is that the recording daemon started complaining about running into file descriptor limits ("Too many open files" error), which struck me as curious given the relatively small number of concurrent streams recorded and the fact that the recording daemon is running as EUID/EGID
root
.However, what I have found is that every one of those LWPs has several hundred open descriptors. For instance, PID 8635 above:
This seems to be the story with all the LWPs:
Since the descriptor count is exactly the same across all the LWPs, I assume this is because they are cloned into every LWP. But regardless, it contributes to a rather large cumulative descriptor count across all the LWPs for that process:
The number of LWPs steadily increases. We found it at a peak of 1200 before restarting the recording daemon. At that point, we seem to have bumped into the system-wide FD limit:
This situation appears to play out regardless of whether the recording daemon is invoked with a certain number of
--num-threads=...
explicitly, or left at the defaults (as now).There is nothing interesting in the logs (until the "Too many open files" messages start). Just fairly routine things like:
And:
The text was updated successfully, but these errors were encountered: