-
Notifications
You must be signed in to change notification settings - Fork 536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ROS2][BUG] Possible memory leak: bad_alloc thrown #136
Comments
So some fluxuation in memory is to be expected. I’ve run the ROS1 version of this on a 10 hour bag in order to test this and found that it grows, it also drops a bunch on regular occasions. Because of the scheduler and other non-deterministic effects, after running a few times I’m reasonably confident the ROS1 version doesn’t actually grow with time. Now ROS2 admittedly I havent profiled. However did you make sure to have interactive mode param off? If not, there will be a linear increase in memory, check the Readme, I outline it there. Also make sure the issue is in the package and not ROS2 itself, which so wouldn’t surprise me. For bad alloc, its best to have symbols and figure out where that’s happened specifically. I don’t have the cycles today to look at your zip, can you give me the tl;dr on that? Also, you’re welcome to address some of these problems with PRs ;-) This time of year spare time to look at problems like this is few and far between for me |
Interactive mode should be off as localization Node is running. |
Stacktrace of the crash after ~4 hours of running:
Need to find out if it has something to do with these messages coming up: |
Another crash with SIGSEGV after ~1.5hours:
|
These are crashes in what looks like message filters, can you open a ticket with them (feel free to link this ticket)? I don’t see anything in the traceback that makes me believe there’s anything wrong with this codebase, unless I’m missing a key line. It never even its this codebase before the crash, its in the message filters. Its likely related. What DDS vendor are you using? Have you tried Cyclone? Are you using ROS2 Dashing or Eloquent? The signal failures are likely related. |
Yep, you are right. The stacktraces were taken when using opensplice and dashing. Will also test tomorrow with eloquent and cyclone. |
Might also be opensplice, that's a little less well supported than fast RTPS and Cyclone. I'd open the ticket in message filters, those folks are the right people to bring this to |
Still fighting with this issue and crashes. And it seems its never the same. Latest crash log:
Eloquent, Ubuntu 18.04, Kernel 5.0, CycloneDDs |
Again Looks like now there is a bug in
|
This is really interesting, I have never run into these issues in ROS2 yet. I don't think any of these issues relate to this code at all, none of these tracebacks even get to the toolbox, they're all crashing at the ROS, TF, or message filters layer. Your TF one (not the last one you posted, the one before that) what line did it fail on? Is that TF related to the toolbox's TF subscribers or TF related to the message filters (since those use TF to filter for valid transforms). You should submit a ticket to these upstream repos and see if they have any suggestions, link the ticket back here. How are you building ROS2 eloquent, source or debians? Maybe there's something not playing well in your source build? Have you tried cleaning everything out, and with clean paths, trying again? I don't think you've mentioned in the thread - what's the processor and memory you have? |
I also don't think so, but it's something what this library uses and my other nodes not, like message filters.
Sorry, but I don' know anymore. I can't see any message filter calls in the trace, so I think its 's the toolbox subscriber.
No source build, released debian packages, no testing packages.
I think that I cleaned the workspace a few times, at least when changing to eloquent.
Core i5, 8GB RAM My robot is active 24/7, not driving all the time but in its charge station and being monitored for memory, disk, crashes, etc... |
I also saw ticket https://github.com/ros2/message_filters/issues/43#issue-531922468 (to link issues) |
@maxlein an interesting question to me would be what would happen if you were to throttle your laser from 100hz to 40hz. If its an async issue maybe you see this go away since the issue is a message filter, ask for less of them at a given time. If it were to stop happening, or greatly reduce, that would be valuable insights |
So I now throttled the scan rate from 25Hz to 12.5Hz, and it runs for about 10 hours now. |
That makes sense then that its probably a race condition like Jacob mentioned. Something interesting would be to see if you change the QoS to only have a depth of 1 if this goes away as there arent multiple messages queued up. |
FYI: Seems to work right now with message depth of 1. |
Since its below this package and we really dont NEED more than 1, can you PR the updates? Lets just square this away |
Any updates? |
This issue isn't technically "fixed" but its functionally patched & the appropriate ticket is filed with message filters. Closing now. |
So yesterday I tried to let the localization_node run over night on a real robot and see what happens:
And it looks like a memory leak after an hour of running:
Memory is increasing by about 1 to 2 MB/minute.
From the size I would say scans are leaking but I will need test with valgrind...
Scan bandwidth:
average: 148.05KB/s
at 25HzAlso system memory was only half used, so there is a memory limit inside the node (stack_size_to_use ?)
Update:
Valgrind log of localization node with gazebo simulation
slam_dbg.zip
The text was updated successfully, but these errors were encountered: