Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential memory leak #202

Open
Dinth opened this issue Dec 10, 2023 · 58 comments
Open

Potential memory leak #202

Dinth opened this issue Dec 10, 2023 · 58 comments
Labels
bug Something isn't working

Comments

@Dinth
Copy link

Dinth commented Dec 10, 2023

Describe the bug
Hi. I might have found a memory leak, but apologies - i am not really able to provide more details, been dealing with an IT disaster at home and my only access to the docker machine is via Java client connected to iDRAC.
During last night, my pfSense died, together with DHCP server.

I can only provide some screenshots, as my docker machine lost dhcp lease and its offline.
Screenshot from 2023-12-10 12-06-57

Neolink logs - hundreds of screens of this:
image

Versions
NVR software: Frigate
Neolink software: (apologies, currently my docker machine is down, but its latest docker image as of 9/12/2023, maintained by watchtower)

@Dinth Dinth added the bug Something isn't working label Dec 10, 2023
@bwthor
Copy link

bwthor commented Dec 11, 2023

I'm running mine on Windows, and resorted to just using Task Scheduler to kill and restart the neolink service every 3 hours. Not sure the memory issues have been worked out, and unfortunately I haven't see the maintainer around here for a few weeks.

@zeluisg
Copy link

zeluisg commented Jan 25, 2024

I'm running in LXC container in a proxmox server.
I still have a steady memory leak, that ends up crashing the container (4GB)

@QuantumEntangledAndy
Copy link
Owner

I'm aware. I just don't have the time to deal with this at the moment. Will be awhile before my other commitments clear up and I can get back to coding on this one.

I'm suspecting this is happening on gstreamer side of things and I want to look into using a gstreamer buffer pool instead of constantly creating new buffers for each frame. But can't handle it yet

@QuantumEntangledAndy
Copy link
Owner

QuantumEntangledAndy commented Apr 8, 2024

I've posted this on the other issue too but thought I'd post here in case you aren't subscribed there. I setup valgrind in some docker containers to test things:

Camera and Client Setup

  • E1
  • Substream
  • Connected over RTSP with ffmpeg for 27 minutes

Debian Bookworm gstreamer1.0-x (1.22.0-3+deb12u1)

Screenshot 2024-04-08 at 15 16 17

Debian Sid gstreamer1.0-x (1.24.1-1 and others)

Screenshot 2024-04-08 at 15 16 31

It seems to be stable at 4MB. Are there any more setup details that could help me find out what is causing this. How are you connecting etc?

@Dinth
Copy link
Author

Dinth commented Apr 8, 2024

Hey. I have just restarted neolink container and noticed that its not working again and takes over 1GB of RAM.
After the restart it was using 60mb of ram but already grew to 100mb after only a couple of minutes.

Ive got one Argus 2 camera and im hooking up neolink to Frigate NVR.

image

@Dinth
Copy link
Author

Dinth commented Apr 8, 2024

After aroundf an hour
image

@QuantumEntangledAndy
Copy link
Owner

Perhaps this is frigate creating many clients connections. There was a similar issue elsewhere with the connections from frigate not closing fully and frigate just opening more and more connections. Can't remember what we did to fix that.

I might spin up a frigate docker to test this against.

@QuantumEntangledAndy
Copy link
Owner

As another thing to try I could wrote a modified Docker that dumps the valgrind info. Maybe you could you run it?

@Dinth
Copy link
Author

Dinth commented Apr 8, 2024

I am happy to help with testing, but i would greatly appreciate if you could upload it as a new branch (so i can just replace the image source in Portainer), as im on a course this week

@Dinth
Copy link
Author

Dinth commented Apr 8, 2024

image
after 10h

@fleaz
Copy link

fleaz commented Apr 8, 2024

I can also easily recreate this problem. I currently have a 3G memory limit on the container, and the container gets killed roughly every ~2-3 hours.

If you need any help to collect more information on this, I'm more than happy to help.

@QuantumEntangledAndy
Copy link
Owner

@Dinth What is the architecture of your Portainer machine? Can I build just the x86_64 or am I going to need to go the extra mile and build arm as well?

@Dinth
Copy link
Author

Dinth commented Apr 9, 2024

@Dinth What is the architecture of your Portainer machine? Can I build just the x86_64 or am I going to need to go the extra mile and build arm as well?

Im on x86_64, many thanks!

@QuantumEntangledAndy
Copy link
Owner

QuantumEntangledAndy commented Apr 9, 2024

Ok so the docker will be here (in half an hour)

docker pull quantumentangledandy/neolink:test-valgrind

The binding of the config and ports are the same as usual

BUT there is an extra volume you should mount /valgrind/, the valgrind output goes in there (specifically /valgrind/massif.out but binding the whole dir is fine too)

Valgrind output is only created when the app exits normally not when killed by docker stop (or other forcefull stopping methods) so to help with that I have added a timeout of 30mins after which it will stop itself and write the file I need which will be massif.out. If you really need to stop it early send a SIGINT (not sure how portainer can do that)

p.s. the docker image is still building here https://github.com/QuantumEntangledAndy/neolink/actions/runs/8611052761, please wait half hour or so until it is ready before you pull it

@QuantumEntangledAndy
Copy link
Owner

P.s. the docker build succeeded. Please run when you can and post the massif.out when you can

@Dinth
Copy link
Author

Dinth commented Apr 9, 2024

Heres the generated file massif.out.zip

@QuantumEntangledAndy
Copy link
Owner

Screenshot 2024-04-09 at 19 17 39

This is the memory profile of that massif.out

@Dinth
Copy link
Author

Dinth commented Apr 9, 2024

Hey. Its a really weird thing, but it seems that since i moved to the test-valgrind branch, my neolink RAM usage has stopped uncontrollably growing. I have restarted the container twice since and it's stops growing at 400mb (as shown by Portainer). I think i still need some more time to test that (since the ram usage was not growing immediatelly but after some time).

@Dinth
Copy link
Author

Dinth commented Apr 9, 2024

ahh thats because it sigterms:

2024-04-09T12:58:42.005180861Z ==9== 

2024-04-09T12:58:42.005248333Z ==9== Process terminating with default action of signal 15 (SIGTERM)

2024-04-09T12:58:42.005593091Z ==9==    at 0x558D719: syscall (syscall.S:38)

2024-04-09T12:58:42.005701720Z ==9==    by 0x1D40836: parking_lot_core::thread_parker::imp::ThreadParker::futex_wait (linux.rs:112)

2024-04-09T12:58:42.005734066Z ==9==    by 0x1D405E3: <parking_lot_core::thread_parker::imp::ThreadParker as parking_lot_core::thread_parker::ThreadParkerT>::park (linux.rs:66)

2024-04-09T12:58:42.005793619Z ==9==    by 0x1D49C9D: parking_lot_core::parking_lot::park::{{closure}} (parking_lot.rs:635)

2024-04-09T12:58:42.005846198Z ==9==    by 0x1D4882E: with_thread_data<parking_lot_core::parking_lot::ParkResult, parking_lot_core::parking_lot::park::{closure_env#0}<parking_lot::condvar::{impl#1}::wait_until_internal::{closure_env#0}, parking_lot::condvar::{impl#1}::wait_until_internal::{closure_env#1}, parking_lot::condvar::{impl#1}::wait_until_internal::{closure_env#2}>> (parking_lot.rs:207)

2024-04-09T12:58:42.005869718Z ==9==    by 0x1D4882E: parking_lot_core::parking_lot::park (parking_lot.rs:600)

2024-04-09T12:58:42.005889943Z ==9==    by 0x1D4EFD8: parking_lot::condvar::Condvar::wait_until_internal (condvar.rs:333)

2024-04-09T12:58:42.005942304Z ==9==    by 0x1C3484D: parking_lot::condvar::Condvar::wait (condvar.rs:256)

2024-04-09T12:58:42.006016975Z ==9==    by 0x1CED5BB: tokio::loom::std::parking_lot::Condvar::wait (parking_lot.rs:149)

2024-04-09T12:58:42.006041703Z ==9==    by 0x1C636F4: tokio::runtime::park::Inner::park (park.rs:116)

2024-04-09T12:58:42.006061069Z ==9==    by 0x1C642D6: tokio::runtime::park::CachedParkThread::park::{{closure}} (park.rs:254)

2024-04-09T12:58:42.006077276Z ==9==    by 0x1C64475: tokio::runtime::park::CachedParkThread::with_current::{{closure}} (park.rs:268)

2024-04-09T12:58:42.006175963Z ==9==    by 0x1C21327: std::thread::local::LocalKey<T>::try_with (local.rs:286)

2024-04-09T12:58:42.006196923Z ==9==    by 0x1C64415: tokio::runtime::park::CachedParkThread::with_current (park.rs:268)

2024-04-09T12:58:42.006212884Z ==9==    by 0x1C6425D: tokio::runtime::park::CachedParkThread::park (park.rs:254)

2024-04-09T12:58:42.006268175Z ==9==    by 0x976218: tokio::runtime::park::CachedParkThread::block_on (park.rs:285)

2024-04-09T12:58:42.006436287Z ==9==    by 0xA9C604: tokio::runtime::context::blocking::BlockingRegionGuard::block_on (blocking.rs:66)

2024-04-09T12:58:42.006534954Z ==9==    by 0x397195: tokio::runtime::scheduler::multi_thread::MultiThread::block_on::{{closure}} (mod.rs:87)

2024-04-09T12:58:42.006628761Z ==9==    by 0xBD5663: tokio::runtime::context::runtime::enter_runtime (runtime.rs:65)

2024-04-09T12:58:42.006654879Z ==9==    by 0x397150: tokio::runtime::scheduler::multi_thread::MultiThread::block_on (mod.rs:86)

2024-04-09T12:58:42.006696927Z ==9==    by 0xA9FE94: tokio::runtime::runtime::Runtime::block_on (runtime.rs:351)

2024-04-09T12:58:42.006766288Z ==9==    by 0x909AAF: neolink::main (main.rs:85)

2024-04-09T12:58:42.006823569Z ==9==    by 0x98384A: core::ops::function::FnOnce::call_once (function.rs:250)

2024-04-09T12:58:42.006968811Z ==9==    by 0x64D28D: std::sys_common::backtrace::__rust_begin_short_backtrace (backtrace.rs:155)

2024-04-09T12:58:42.007054233Z ==9==    by 0x449190: std::rt::lang_start::{{closure}} (rt.rs:166)

2024-04-09T12:58:42.007226028Z ==9==    by 0x24E21F0: call_once<(), (dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> (function.rs:284)

2024-04-09T12:58:42.007281819Z ==9==    by 0x24E21F0: do_call<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> (panicking.rs:554)

2024-04-09T12:58:42.007343138Z ==9==    by 0x24E21F0: try<i32, &(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> (panicking.rs:518)

2024-04-09T12:58:42.007402426Z ==9==    by 0x24E21F0: catch_unwind<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> (panic.rs:142)

2024-04-09T12:58:42.007460927Z ==9==    by 0x24E21F0: {closure#2} (rt.rs:148)

2024-04-09T12:58:42.007525706Z ==9==    by 0x24E21F0: do_call<std::rt::lang_start_internal::{closure_env#2}, isize> (panicking.rs:554)

2024-04-09T12:58:42.007574276Z ==9==    by 0x24E21F0: try<isize, std::rt::lang_start_internal::{closure_env#2}> (panicking.rs:518)

2024-04-09T12:58:42.007616162Z ==9==    by 0x24E21F0: catch_unwind<std::rt::lang_start_internal::{closure_env#2}, isize> (panic.rs:142)

2024-04-09T12:58:42.007658013Z ==9==    by 0x24E21F0: std::rt::lang_start_internal (rt.rs:148)

2024-04-09T12:58:42.007672256Z ==9==    by 0x449169: std::rt::lang_start (rt.rs:165)

2024-04-09T12:58:42.007696468Z ==9==    by 0x909B5D: main (in /usr/local/bin/neolink)

2024-04-09T12:58:42.009966184Z ==9== 

2024-04-09T12:59:03.310544775Z ==9== Massif, a heap profiler

2024-04-09T12:59:03.310618694Z ==9== Copyright (C) 2003-2017, and GNU GPL'd, by Nicholas Nethercote

2024-04-09T12:59:03.310633571Z ==9== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info

2024-04-09T12:59:03.310646373Z ==9== Command: /usr/local/bin/neolink rtsp --config /etc/neolink.toml

2024-04-09T12:59:03.310658339Z ==9== 

@QuantumEntangledAndy
Copy link
Owner

That sigterm is expected. It's the way the timeout command signals the app to stop. Nothing wrong with it at all

@QuantumEntangledAndy
Copy link
Owner

Can you try pulling latest master branch of the docker? Perhaps something very recent fixed it.

@Dinth
Copy link
Author

Dinth commented Apr 9, 2024

On another hand, i can see that youre measuring the memory usage of a neolink process, but is it possible that:

  • Multiple neolink processes are being spawned?
  • something else in the container is causing memory clog?

I will try to check it, but i have already updated to Docker 26 and something is broken with exec command. Will get back to you on this.

Regarding the master branch, i have been on :latest with Watchtower automatically updating neolink

@QuantumEntangledAndy
Copy link
Owner

Nope will be single process. Even if children were spawned valgrind would track them too.

There are two main changes here

  • ulimit -n 1024
  • using the debug build not the release

It's possible the release build breaks something with its code optimisations.

@QuantumEntangledAndy
Copy link
Owner

I could try to run valgrind on the release build. It will just make figuring out what is wrong much harder without the debug symbols

@QuantumEntangledAndy
Copy link
Owner

Anyways I'm off to sleep now so I'll do that tomorrow

@Dinth
Copy link
Author

Dinth commented Apr 9, 2024

Ive just been running the :latest release for 40 minus and top shows:
image

Looks like its actually neolink process using that memory

smaps dump: https://paste.ubuntu.com/p/jd7W6rGDw2/

@QuantumEntangledAndy
Copy link
Owner

I'm building the realease version of the valgrind docker here https://github.com/QuantumEntangledAndy/neolink/actions/runs/8624926076 should be ready soon

@QuantumEntangledAndy
Copy link
Owner

Alright the valgrind docker is ready, I will test it too

@QuantumEntangledAndy
Copy link
Owner

QuantumEntangledAndy commented Apr 10, 2024

I could but that would mean a full rebuild. It would be faster is you just changed the CMD in portainer to something like this

timeout 7200 valgrind --tool=massif --massif-out-file=/valgrind/massif.out /usr/local/bin/neolink rtsp --config /etc/neolink.toml

@Dinth
Copy link
Author

Dinth commented Apr 10, 2024

ahh, thats simple, thanks. I am still on a course but i will try to generate a new valgrind log today

@QuantumEntangledAndy
Copy link
Owner

Given the symptoms we are seeing I am suspecting we have memory fragmentation not a memory leak. The fragmentation differs depending on the OSs memory allocator and valgrind replaces the default allocator with its own. This is why it is not showing up in valgrind.

@fleaz
Copy link

fleaz commented Apr 10, 2024

Here is a valgrind dump from the latest image you build today: massiv.out

But even with the valgrind image I still see the growth in memory. First two spikes where a freshly build container from the master branch from yesterday, the small peak at 14:00 was the 30min run of the valgrind container:
2024-04-10T14:49:59,535882416+02:00

@Dinth
Copy link
Author

Dinth commented Apr 10, 2024

Here is mine massif.out:
massif.out.zip
I have been running it for 2 hours, the memory usage grew to 800mb during that time - seems that its growing slower with valgrind running but its growing. Anyhow, i have opened the file in massif visualizer already and its not showing up any excessive memory usage at all.
Maybe it is just fragmentation not a real memory usage as you mentioned, but i havent noticed similar behaviour in any other of my containers, even a large postresql database docker sticks to 6gb. Not to mention crashes when the memory usage grows too much

@QuantumEntangledAndy
Copy link
Owner

I'm going to run some optimisations based on valgrind dhat and see if I can reduce allocations with some buffer pools. That's should help a bit with the fragmentation.

The only trouble with this plan is that buffer pools work best with constant size and the h264 frames from the camera are not constant. I've had an idea to create and insert NAL filler frames to pad up to 4kb blocks and it seems to work with h264 but I can't test for h265.

@QuantumEntangledAndy
Copy link
Owner

@fleaz this is your massif.out put through the visualiser

Screenshot 2024-04-11 at 11 11 30

Not seeing any spiking in there

@QuantumEntangledAndy
Copy link
Owner

I've pushed my buffer pool onto the test/valgrind branch and I will try to see what other things I could do to reduce fragmentation. It can be difficult to address though

@QuantumEntangledAndy
Copy link
Owner

It seems allocations of the gstreamer buffer is now reduced. There's still a lot in libcrypto which is either the AES or the push notifications

@QuantumEntangledAndy
Copy link
Owner

Seems that the libcrypto alloc are comming from the push notifications. Not sure why there are so many of those that it takes more blocks than the gstreamer stuff.

If you can please try the latest test/valgrind with

[[cameras]]
# Other usual things
push_notifications = false

That should turn it off and we can test if this still has memory issues

@Dinth
Copy link
Author

Dinth commented Apr 11, 2024

its still happening on latest test-valgrind:
image
but so far i only tested WITHOUT push_notifications = false.
I will disable push notifications in the afternoon and try again

@Dinth
Copy link
Author

Dinth commented Apr 11, 2024

same thing with push_notifications =false
image

@fleaz
Copy link

fleaz commented Apr 11, 2024

Apparently the problem is sometimes way worse than on other runs 🤔
I have created tooling to move the massif file after every run so they don't get overwritten every time, and I will try to provide you a file of one of those bad runs. Maybe they are more helpful to you :)

Screenshot shows the 30min runs of the Valgrind container over the last 12h:
2024-04-11T22:19:35,833406447+02:00

@QuantumEntangledAndy
Copy link
Owner

I'll need a massif.out of it. The graph by itself is not helpful, but the massif tools tracks the allocations and the deallocations and shows me what part of the code is creating the memory

@QuantumEntangledAndy
Copy link
Owner

Might have found something. I added some prints to debug the length of all the various buffers that are used.

One of the buffers went up to 87808 over night which is much too large. The code is only meant to hold 15s of buffered time and it does this by checking by filtering all packets that are >15 behind the last recieved frame

I think the issue though is how reolink time stamps I supect it gave me lots of frames in the future that aren't being pruned. I'll see if I can buffer this a different way

@QuantumEntangledAndy
Copy link
Owner

If you could test the next build that would be good, it has a lot of debug prints to show all the buffer lens but hopefully the adjustment to history length will help

@fleaz
Copy link

fleaz commented Apr 17, 2024

Thanks for diving into this and trying to find the problem!

Here is one of the "bad" runs according to my memory graph in Grafana. Hope it helps:
massif_2024-04-17_2025.txt

I also updated the container and will report if I see any change

@QuantumEntangledAndy
Copy link
Owner

Any updates since the last build? I think its addressed and if so I'd like to push a release

@Dinth
Copy link
Author

Dinth commented Apr 26, 2024

much better on mine side
After 4h:
image

@fleaz
Copy link

fleaz commented Apr 29, 2024

Pulled a new image (quantumentangledandy/neolink:latest) yesterday and after ~20h the container is a 400MB memory usage.
Still feels a bit high for what it does, but the actual leak appears to be gone. Great work and thanks for investing so much time and effort into this 🥳

@Dinth
Copy link
Author

Dinth commented Apr 29, 2024

Pulled a new image (quantumentangledandy/neolink:latest) yesterday and after ~20h the container is a 400MB memory usage. Still feels a bit high for what it does, but the actual leak appears to be gone. Great work and thanks for investing so much time and effort into this 🥳

My understanding is that the fix is currently only in test/valgrind branch and will be merged to the :latest only after some testing.
Try the test branch, but you can replace the default command with the command from the :latest branch to prevent auto restarts

@QuantumEntangledAndy
Copy link
Owner

It's actually in both. Since it was an identifiable bug and I felt it should be released to fix issues with those on docker:latest, it's not in a release yet though, so no versions number for it yet .

@fleaz
Copy link

fleaz commented Apr 30, 2024

That's what I assumed after checking the commit history, therefore I pulled ":latest" which is by (my) definition always current main branch and not current release.

And otherwise I would be really surprised why the leak is gone, when the change would not be in this image haha :D

@Pytonballoon810
Copy link

@QuantumEntangledAndy I have now tried the neolink:test-valgrind image for 2 weeks and havent had a memory leak issue since. I would say it works

@pr0phe
Copy link

pr0phe commented May 3, 2024

So the leakege is fixed for docker release?
@QuantumEntangledAndy When regular version will be fixed 2?

@QuantumEntangledAndy
Copy link
Owner

You can pull latest versions from the actions tab in github and finding one from CI workflow, for example this one https://github.com/QuantumEntangledAndy/neolink/actions/runs/8934277800 builds are made for EVERY push

@QuantumEntangledAndy
Copy link
Owner

At the moment I am addressing issues with pause and VLC, I reduced the buffer size and it seems to not be enough for some clients to determine the stream type before the pause. So want to fix that before a realease

@Pytonballoon810
Copy link

I whiped up something quick for those wanting to use neolink without having to babysit it consistantly:
#286 (comment)

just make sure you set "restart: always"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants