Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't try to maintain a pool of buffers #340

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

cincodenada
Copy link

@cincodenada cincodenada commented Dec 29, 2024

I ran into memory issues similar to those in #286, notably the gst_poll_read_control: assertion 'set != NULL' failed errors.

My first clue was watching lsof: neolink continually opened new fd's until it hit a limit (~1024 on my system), at which point it started throwing the assertion errors, so it seemed like there was something opening way too many fd's.

I then did some poking around with gdb and G_DEBUG=fatal-criticals as recommended in this mailing list, which gave me a backtrace for those failed assertions and led me to this code.

I believe this was causing excess memory usage because we were allocating (and keeping permanently in memory, as best I can tell) a pool of pools: for each unique size of message, we allocated a new pool of buffers for that size, and then held onto it forever.

Which means the memory usage of this pool of pools is a function of the number of unique message sizes seen - and at least in my case, judging by some debug logging, the message sizes were essentially random, or at least fairly evenly distributed. Thus, the number of pools we were allocating was functionally unbounded, and we were therefore trying to allocate a lot of permanent pools.

I definitely don't understand the details of what's going on in the rest of the code around this, but this seems like a pretty self-contained change. It seems to me that size-specific buffer pools are intended to be an optimization for the case where you have a small, defined set of common sizes of messages. This does not seem to be the case here: the message sizes, at least in my case, were essentially random.

I thus removed the global pool of pools altogether, and instead simply allocate a new (scoped/limited-lifetime) buffer for every message, which presumably only lives as long as the message.

I believe this was causing memory leaks, becuase we would try to
allocate (and keep in memory!) a pool of buffers for each unique size of
message, and the set of "unique sizes of messages" is essentially
unbounded, at least in my case.
@cincodenada
Copy link
Author

cincodenada commented Dec 29, 2024

Hmm, okay I did some more digging and see that the pools themselves were an attempt to resolve memory fragmentation issues (in e.g. #202).

So, this is unlikely to fix the underlying issues, but I do think that, at least for my use case of a single camera, the pooling by size only exacerbates those issues, for the reasons outlined above, and should be rolled back unless I'm greatly misunderstanding what's going on here.

One pooling strategy that I could see working is, if we could use buffers larger than a given message, we could create a fixed number of buckets across a range of common sizes, then round the message size up to the nearest bucket and pull from the appropriate pool. I don't know if that would mess with things downstream expecting that buffer size == message size though.

@QuantumEntangledAndy
Copy link
Owner

I haven't managed to find a good solution for the memory issue and I've tried a lot of things. From valgrind and other profiling tools I have used it appears to be in the buffers. It is the reason I added the pools, I did some tests and although yes we do get an assortments of pools it didn't create too many of them since most messages I saw where of similar sizes.

I did work out the padding NAL unit for h264 and h265 at one point with buckets like you suggested but it was removed to simplify the code. I am tempted to drop the pools again and try just plain buffers but it can slow down the frame rate because of excessive allocations and cause memory fragmentation.

The (scoped/limited-lifetime) buffer dosen't seem to be dropped by gstreamer correctly and seems to persist far after it has been displayed by the client causing memory leaks.

@cincodenada
Copy link
Author

Yeah, memory issues are frustrating! I do think that it is worth removing the pools, or at least limiting pools to common message sizes somehow - maybe just the one you mention. Especially since at least on my machine, each pool seems to open an fd, and thus very quickly runs out of fd's (and thus I believe fails to create pools from then on out).

I think the problem comes in that while it may be technically true that most messages are of similar sizes, that's not the metric that matters. What matters is, how many unique sizes we see.

I added a little debug line to log the size of each message being requested, and after running for less than 15 minutes, I've logged just over 25000 messages. About half of those were indeed the same size (519 bytes), but the other 6000+ pools were used at most 11 times in those 15 minutes. More than half were used only once, and another quarter were used twice:
poolusage

So, while the one 519-byte pool is indeed reducing fragmentation (I'm guessing this is the NAL unit size you mentioned?), we've also created 6000 pools that aren't really doing much. And again, since the pools seem to be allocating an fd, I doubt the pools that are being used once a minute at most are worth the cost of maintaining them.

@QuantumEntangledAndy
Copy link
Owner

Hmm that's many more pools then the one I did. the 512 one is probably the audio pool since it is small and usually the same. I think I should look into putting back in the NAL padding and using pools in a certain range. Might want to work out the distrubution to see what a good bucket size would be.

@cincodenada
Copy link
Author

cincodenada commented Dec 29, 2024

Here's the distribution of sizes for my little test:
pooldist

Obviously a cluster around 15KB, but again, because the pooling is so granular, no given size was used more than 11 times in the 15 minutes. I just ran the stats again, and the distributions look similar - in nearly an hour, the most-used buckets (besides 519 bytes) got used less than 40 times, so, once a minute seems to be best case, and it remained that about half were used once.

@cincodenada
Copy link
Author

cincodenada commented Dec 29, 2024

I've got to get to sleep so I'll check in later, but thanks for the back and forth, and for your work keeping this thing going and hunting down the bugs!

I very much appreciate this handy tool, it has allowed me to have a second angle on the little interactive holiday tree livestream I've run for the last four years. If you want to have a little fun and see it in action, you can change the lights on my tree at https://tree.c9a.dev/ - the lefthand (far side) view is being served courtesy of neolink :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants