-
-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't try to maintain a pool of buffers #340
base: master
Are you sure you want to change the base?
Conversation
I believe this was causing memory leaks, becuase we would try to allocate (and keep in memory!) a pool of buffers for each unique size of message, and the set of "unique sizes of messages" is essentially unbounded, at least in my case.
Hmm, okay I did some more digging and see that the pools themselves were an attempt to resolve memory fragmentation issues (in e.g. #202). So, this is unlikely to fix the underlying issues, but I do think that, at least for my use case of a single camera, the pooling by size only exacerbates those issues, for the reasons outlined above, and should be rolled back unless I'm greatly misunderstanding what's going on here. One pooling strategy that I could see working is, if we could use buffers larger than a given message, we could create a fixed number of buckets across a range of common sizes, then round the message size up to the nearest bucket and pull from the appropriate pool. I don't know if that would mess with things downstream expecting that |
I haven't managed to find a good solution for the memory issue and I've tried a lot of things. From valgrind and other profiling tools I have used it appears to be in the buffers. It is the reason I added the pools, I did some tests and although yes we do get an assortments of pools it didn't create too many of them since most messages I saw where of similar sizes. I did work out the padding NAL unit for h264 and h265 at one point with buckets like you suggested but it was removed to simplify the code. I am tempted to drop the pools again and try just plain buffers but it can slow down the frame rate because of excessive allocations and cause memory fragmentation. The |
Hmm that's many more pools then the one I did. the 512 one is probably the audio pool since it is small and usually the same. I think I should look into putting back in the NAL padding and using pools in a certain range. Might want to work out the distrubution to see what a good bucket size would be. |
I've got to get to sleep so I'll check in later, but thanks for the back and forth, and for your work keeping this thing going and hunting down the bugs! I very much appreciate this handy tool, it has allowed me to have a second angle on the little interactive holiday tree livestream I've run for the last four years. If you want to have a little fun and see it in action, you can change the lights on my tree at https://tree.c9a.dev/ - the lefthand (far side) view is being served courtesy of neolink :) |
I ran into memory issues similar to those in #286, notably the
gst_poll_read_control: assertion 'set != NULL' failed
errors.My first clue was watching
lsof
: neolink continually opened new fd's until it hit a limit (~1024 on my system), at which point it started throwing the assertion errors, so it seemed like there was something opening way too many fd's.I then did some poking around with
gdb
andG_DEBUG=fatal-criticals
as recommended in this mailing list, which gave me a backtrace for those failed assertions and led me to this code.I believe this was causing excess memory usage because we were allocating (and keeping permanently in memory, as best I can tell) a pool of pools: for each unique size of message, we allocated a new pool of buffers for that size, and then held onto it forever.
Which means the memory usage of this pool of pools is a function of the number of unique message sizes seen - and at least in my case, judging by some debug logging, the message sizes were essentially random, or at least fairly evenly distributed. Thus, the number of pools we were allocating was functionally unbounded, and we were therefore trying to allocate a lot of permanent pools.
I definitely don't understand the details of what's going on in the rest of the code around this, but this seems like a pretty self-contained change. It seems to me that size-specific buffer pools are intended to be an optimization for the case where you have a small, defined set of common sizes of messages. This does not seem to be the case here: the message sizes, at least in my case, were essentially random.
I thus removed the global pool of pools altogether, and instead simply allocate a new (scoped/limited-lifetime) buffer for every message, which presumably only lives as long as the message.