-
-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential memory leak #202
Comments
I'm running mine on Windows, and resorted to just using Task Scheduler to kill and restart the neolink service every 3 hours. Not sure the memory issues have been worked out, and unfortunately I haven't see the maintainer around here for a few weeks. |
I'm running in LXC container in a proxmox server. |
I'm aware. I just don't have the time to deal with this at the moment. Will be awhile before my other commitments clear up and I can get back to coding on this one. I'm suspecting this is happening on gstreamer side of things and I want to look into using a gstreamer buffer pool instead of constantly creating new buffers for each frame. But can't handle it yet |
Perhaps this is frigate creating many clients connections. There was a similar issue elsewhere with the connections from frigate not closing fully and frigate just opening more and more connections. Can't remember what we did to fix that. I might spin up a frigate docker to test this against. |
As another thing to try I could wrote a modified Docker that dumps the valgrind info. Maybe you could you run it? |
I am happy to help with testing, but i would greatly appreciate if you could upload it as a new branch (so i can just replace the image source in Portainer), as im on a course this week |
I can also easily recreate this problem. I currently have a 3G memory limit on the container, and the container gets killed roughly every ~2-3 hours. If you need any help to collect more information on this, I'm more than happy to help. |
@Dinth What is the architecture of your Portainer machine? Can I build just the x86_64 or am I going to need to go the extra mile and build arm as well? |
Im on x86_64, many thanks! |
Ok so the docker will be here (in half an hour) docker pull quantumentangledandy/neolink:test-valgrind The binding of the config and ports are the same as usual BUT there is an extra volume you should mount Valgrind output is only created when the app exits normally not when killed by p.s. the docker image is still building here https://github.com/QuantumEntangledAndy/neolink/actions/runs/8611052761, please wait half hour or so until it is ready before you pull it |
P.s. the docker build succeeded. Please run when you can and post the massif.out when you can |
Heres the generated file massif.out.zip |
Hey. Its a really weird thing, but it seems that since i moved to the test-valgrind branch, my neolink RAM usage has stopped uncontrollably growing. I have restarted the container twice since and it's stops growing at 400mb (as shown by Portainer). I think i still need some more time to test that (since the ram usage was not growing immediatelly but after some time). |
ahh thats because it sigterms:
|
That sigterm is expected. It's the way the |
Can you try pulling latest master branch of the docker? Perhaps something very recent fixed it. |
On another hand, i can see that youre measuring the memory usage of a neolink process, but is it possible that:
I will try to check it, but i have already updated to Docker 26 and something is broken with exec command. Will get back to you on this. Regarding the master branch, i have been on :latest with Watchtower automatically updating neolink |
Nope will be single process. Even if children were spawned valgrind would track them too. There are two main changes here
It's possible the release build breaks something with its code optimisations. |
I could try to run valgrind on the release build. It will just make figuring out what is wrong much harder without the debug symbols |
Anyways I'm off to sleep now so I'll do that tomorrow |
Ive just been running the :latest release for 40 minus and top shows: Looks like its actually neolink process using that memory smaps dump: https://paste.ubuntu.com/p/jd7W6rGDw2/ |
I'm building the realease version of the valgrind docker here https://github.com/QuantumEntangledAndy/neolink/actions/runs/8624926076 should be ready soon |
Alright the valgrind docker is ready, I will test it too |
I could but that would mean a full rebuild. It would be faster is you just changed the CMD in portainer to something like this timeout 7200 valgrind --tool=massif --massif-out-file=/valgrind/massif.out /usr/local/bin/neolink rtsp --config /etc/neolink.toml |
ahh, thats simple, thanks. I am still on a course but i will try to generate a new valgrind log today |
Given the symptoms we are seeing I am suspecting we have memory fragmentation not a memory leak. The fragmentation differs depending on the OSs memory allocator and valgrind replaces the default allocator with its own. This is why it is not showing up in valgrind. |
Here is a valgrind dump from the latest image you build today: massiv.out But even with the valgrind image I still see the growth in memory. First two spikes where a freshly build container from the master branch from yesterday, the small peak at 14:00 was the 30min run of the valgrind container: |
Here is mine massif.out: |
I'm going to run some optimisations based on valgrind The only trouble with this plan is that buffer pools work best with constant size and the h264 frames from the camera are not constant. I've had an idea to create and insert NAL filler frames to pad up to 4kb blocks and it seems to work with h264 but I can't test for h265. |
@fleaz this is your massif.out put through the visualiser Not seeing any spiking in there |
I've pushed my buffer pool onto the test/valgrind branch and I will try to see what other things I could do to reduce fragmentation. It can be difficult to address though |
It seems allocations of the gstreamer buffer is now reduced. There's still a lot in libcrypto which is either the AES or the push notifications |
Seems that the libcrypto alloc are comming from the push notifications. Not sure why there are so many of those that it takes more blocks than the gstreamer stuff. If you can please try the latest [[cameras]]
# Other usual things
push_notifications = false That should turn it off and we can test if this still has memory issues |
I'll need a massif.out of it. The graph by itself is not helpful, but the massif tools tracks the allocations and the deallocations and shows me what part of the code is creating the memory |
Might have found something. I added some prints to debug the length of all the various buffers that are used. One of the buffers went up to 87808 over night which is much too large. The code is only meant to hold 15s of buffered time and it does this by checking by filtering all packets that are >15 behind the last recieved frame I think the issue though is how reolink time stamps I supect it gave me lots of frames in the future that aren't being pruned. I'll see if I can buffer this a different way |
If you could test the next build that would be good, it has a lot of debug prints to show all the buffer lens but hopefully the adjustment to history length will help |
Thanks for diving into this and trying to find the problem! Here is one of the "bad" runs according to my memory graph in Grafana. Hope it helps: I also updated the container and will report if I see any change |
Any updates since the last build? I think its addressed and if so I'd like to push a release |
Pulled a new image (quantumentangledandy/neolink:latest) yesterday and after ~20h the container is a 400MB memory usage. |
My understanding is that the fix is currently only in test/valgrind branch and will be merged to the :latest only after some testing. |
It's actually in both. Since it was an identifiable bug and I felt it should be released to fix issues with those on docker:latest, it's not in a release yet though, so no versions number for it yet . |
That's what I assumed after checking the commit history, therefore I pulled ":latest" which is by (my) definition always current main branch and not current release. And otherwise I would be really surprised why the leak is gone, when the change would not be in this image haha :D |
@QuantumEntangledAndy I have now tried the neolink:test-valgrind image for 2 weeks and havent had a memory leak issue since. I would say it works |
So the leakege is fixed for docker release? |
You can pull latest versions from the actions tab in github and finding one from CI workflow, for example this one https://github.com/QuantumEntangledAndy/neolink/actions/runs/8934277800 builds are made for EVERY push |
At the moment I am addressing issues with pause and VLC, I reduced the buffer size and it seems to not be enough for some clients to determine the stream type before the pause. So want to fix that before a realease |
I whiped up something quick for those wanting to use neolink without having to babysit it consistantly: just make sure you set "restart: always" |
Describe the bug
Hi. I might have found a memory leak, but apologies - i am not really able to provide more details, been dealing with an IT disaster at home and my only access to the docker machine is via Java client connected to iDRAC.
During last night, my pfSense died, together with DHCP server.
I can only provide some screenshots, as my docker machine lost dhcp lease and its offline.
Neolink logs - hundreds of screens of this:
Versions
NVR software: Frigate
Neolink software: (apologies, currently my docker machine is down, but its latest docker image as of 9/12/2023, maintained by watchtower)
The text was updated successfully, but these errors were encountered: