Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ROS2][BUG] Possible memory leak: bad_alloc thrown #136

Closed
maxlein opened this issue Nov 29, 2019 · 19 comments
Closed

[ROS2][BUG] Possible memory leak: bad_alloc thrown #136

maxlein opened this issue Nov 29, 2019 · 19 comments

Comments

@maxlein
Copy link
Contributor

maxlein commented Nov 29, 2019

So yesterday I tried to let the localization_node run over night on a real robot and see what happens:

And it looks like a memory leak after an hour of running:

image

Memory is increasing by about 1 to 2 MB/minute.
From the size I would say scans are leaking but I will need test with valgrind...
Scan bandwidth:
average: 148.05KB/s at 25Hz

[slam_toolbox-3] 1574957432.331870833: [osg_1.slam_toolbox] [INFO]   [signalFailure] Drop message: frame 'microscan_link' at time 1574957431.934 for reason(0)
[slam_toolbox-3] 1574958451.382982459: [osg_1.slam_toolbox] [INFO]   [signalFailure] Drop message: frame 'microscan_link' at time 1574957871.735 for reason(0)
[slam_toolbox-3] 1574958451.422358798: [osg_1.slam_toolbox] [INFO]   [signalFailure] Drop message: frame 'microscan_link' at time 1574958451.381 for reason(0)
[slam_toolbox-3] terminate called after throwing an instance of 'std::bad_alloc'
[slam_toolbox-3]   what():  std::bad_alloc
[ERROR] [slam_toolbox-3]: process has died

Also system memory was only half used, so there is a memory limit inside the node (stack_size_to_use ?)

Update:
Valgrind log of localization node with gazebo simulation
slam_dbg.zip

@maxlein maxlein changed the title [ROS2][BUG] bad_alloc thrown [ROS2][BUG] Possible memory leak: bad_alloc thrown Nov 29, 2019
@SteveMacenski
Copy link
Owner

SteveMacenski commented Nov 29, 2019

So some fluxuation in memory is to be expected. I’ve run the ROS1 version of this on a 10 hour bag in order to test this and found that it grows, it also drops a bunch on regular occasions. Because of the scheduler and other non-deterministic effects, after running a few times I’m reasonably confident the ROS1 version doesn’t actually grow with time.

Now ROS2 admittedly I havent profiled. However did you make sure to have interactive mode param off? If not, there will be a linear increase in memory, check the Readme, I outline it there. Also make sure the issue is in the package and not ROS2 itself, which so wouldn’t surprise me.

For bad alloc, its best to have symbols and figure out where that’s happened specifically. I don’t have the cycles today to look at your zip, can you give me the tl;dr on that?

Also, you’re welcome to address some of these problems with PRs ;-) This time of year spare time to look at problems like this is few and far between for me

@maxlein
Copy link
Contributor Author

maxlein commented Nov 30, 2019

Interactive mode should be off as localization Node is running.
Also I forgot to set ulimit -c, so I have no Core dump of the Bad alloc.
But atm its running over the weekend and should give results.

@maxlein
Copy link
Contributor Author

maxlein commented Dec 2, 2019

Stacktrace of the crash after ~4 hours of running:

Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x7f5780231f40 (LWP 531))]
(gdb) backtrace
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007f577e30e801 in __GI_abort () at abort.c:79
#2  0x00007f577e963957 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007f577e969ab6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f577e969af1 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007f577e969d24 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f577e96a29c in operator new(unsigned long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007f577fb2e4ff in __gnu_cxx::new_allocator<unsigned long>::allocate (this=0x7ffe4913fac0, __n=<optimized out>) at /usr/include/c++/7/ext/new_allocator.h:111
#8  std::allocator_traits<std::allocator<unsigned long> >::allocate (__a=..., __n=<optimized out>) at /usr/include/c++/7/bits/alloc_traits.h:436
#9  std::_Vector_base<unsigned long, std::allocator<unsigned long> >::_M_allocate (this=0x7ffe4913fac0, __n=<optimized out>) at /usr/include/c++/7/bits/stl_vector.h:172
#10 std::vector<unsigned long, std::allocator<unsigned long> >::_M_allocate_and_copy<std::move_iterator<unsigned long*> > (this=0x7ffe4913fac0, __last=..., __first=..., __n=<optimized out>)
    at /usr/include/c++/7/bits/stl_vector.h:1260
#11 std::vector<unsigned long, std::allocator<unsigned long> >::reserve (__n=<optimized out>, this=0x7ffe4913fac0) at /usr/include/c++/7/bits/vector.tcc:73
#12 tf2_ros::MessageFilter<sensor_msgs::msg::LaserScan_<std::allocator<void> > >::add (this=0x560ef462ac00, evt=...) at /opt/ros/dashing/include/tf2_ros/message_filter.h:314
#13 0x00007f577fb2c5dc in std::function<void (message_filters::MessageEvent<sensor_msgs::msg::LaserScan_<std::allocator<void> > const> const&)>::operator()(message_filters::MessageEvent<sensor_msgs::msg::LaserScan_<std::allocator<void> > const> const&) const (__args#0=..., this=0x560ef4513958) at /usr/include/c++/7/bits/std_function.h:706
#14 message_filters::CallbackHelper1T<message_filters::MessageEvent<sensor_msgs::msg::LaserScan_<std::allocator<void> > const> const&, sensor_msgs::msg::LaserScan_<std::allocator<void> > >::call (this=0x560ef4513950, 
    event=..., nonconst_force_copy=<optimized out>) at /opt/ros/dashing/include/message_filters/signal1.h:74
#15 0x00007f577fb2a6b0 in message_filters::Signal1<sensor_msgs::msg::LaserScan_<std::allocator<void> > >::call (event=..., this=0x560ef4628108) at /opt/ros/dashing/include/message_filters/signal1.h:117
#16 message_filters::SimpleFilter<sensor_msgs::msg::LaserScan_<std::allocator<void> > >::signalMessage (event=..., this=0x560ef4628108) at /opt/ros/dashing/include/message_filters/simple_filter.h:133
#17 message_filters::Subscriber<sensor_msgs::msg::LaserScan_<std::allocator<void> > >::cb (e=..., this=0x560ef4628100) at /opt/ros/dashing/include/message_filters/subscriber.h:235
#18 message_filters::Subscriber<sensor_msgs::msg::LaserScan_<std::allocator<void> > >::subscribe(rclcpp::Node*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rmw_qos_profile_t)::{lambda(std::shared_ptr<sensor_msgs::msg::LaserScan_<std::allocator<void> > const>)#1}::operator()(std::shared_ptr<sensor_msgs::msg::LaserScan_<std::allocator<void> > const>) const (msg=..., __closure=<optimized out>)
    at /opt/ros/dashing/include/message_filters/subscriber.h:174
#19 std::_Function_handler<void (std::shared_ptr<sensor_msgs::msg::LaserScan_<std::allocator<void> > const>), message_filters::Subscriber<sensor_msgs::msg::LaserScan_<std::allocator<void> > >::subscribe(rclcpp::Node*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rmw_qos_profile_t)::{lambda(std::shared_ptr<sensor_msgs::msg::LaserScan_<std::allocator<void> > const>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<sensor_msgs::msg::LaserScan_<std::allocator<void> > const>&&) (__functor=..., __args#0=...) at /usr/include/c++/7/bits/std_function.h:316
#20 0x00007f577fb53d24 in std::function<void (std::shared_ptr<sensor_msgs::msg::LaserScan_<std::allocator<void> > const>)>::operator()(std::shared_ptr<sensor_msgs::msg::LaserScan_<std::allocator<void> > const>) const (
    __args#0=std::shared_ptr<const sensor_msgs::msg::LaserScan_<std::allocator<void> >> (empty) = {...}, this=0x560ef4628950) at /usr/include/c++/7/bits/std_function.h:706
#21 rclcpp::AnySubscriptionCallback<sensor_msgs::msg::LaserScan_<std::allocator<void> >, std::allocator<void> >::dispatch (this=0x560ef4628910, 
    message=std::shared_ptr<sensor_msgs::msg::LaserScan_<std::allocator<void> >> (use count 7, weak count 0) = {...}, message_info=...) at /opt/ros/dashing/include/rclcpp/any_subscription_callback.hpp:163
#22 0x00007f577fb53f41 in rclcpp::Subscription<sensor_msgs::msg::LaserScan_<std::allocator<void> >, std::allocator<void> >::handle_message (this=0x560ef4628880, message=..., message_info=...)
    at /opt/ros/dashing/include/rclcpp/subscription.hpp:146
#23 0x00007f577f33d29f in rclcpp::executor::Executor::execute_subscription(std::shared_ptr<rclcpp::SubscriptionBase>) () from /opt/ros/dashing/lib/librclcpp.so
#24 0x00007f577f33e2e5 in rclcpp::executor::Executor::execute_any_executable(rclcpp::executor::AnyExecutable&) () from /opt/ros/dashing/lib/librclcpp.so
#25 0x00007f577f3451af in rclcpp::executors::SingleThreadedExecutor::spin() () from /opt/ros/dashing/lib/librclcpp.so

Need to find out if it has something to do with these messages coming up:
[signalFailure] Drop message: frame 'microscan_link' at time 1575060324.799 for reason(0)

@maxlein
Copy link
Contributor Author

maxlein commented Dec 2, 2019

Another crash with SIGSEGV after ~1.5hours:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/home/kin/playground_ws/install/slam_toolbox/lib/slam_toolbox/localization_slam'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f6958cf2977 in tf2_ros::MessageFilter<sensor_msgs::msg::LaserScan_<std::allocator<void> > >::add (this=0x7f69003d1850, evt=...) at /opt/ros/dashing/include/tf2_ros/message_filter.h:378
378               bc_.cancelTransformableRequest(*it);
[Current thread is 1 (Thread 0x7f69593f5f40 (LWP 23208))]
(gdb) backtrace
#0  0x00007f6958cf2977 in tf2_ros::MessageFilter<sensor_msgs::msg::LaserScan_<std::allocator<void> > >::add (this=0x7f69003d1850, evt=...) at /opt/ros/dashing/include/tf2_ros/message_filter.h:378
#1  0x00007f6958cf05dc in std::function<void (message_filters::MessageEvent<sensor_msgs::msg::LaserScan_<std::allocator<void> > const> const&)>::operator()(message_filters::MessageEvent<sensor_msgs::msg::LaserScan_<std::allocator<void> > const> const&) const (__args#0=..., this=0x561eabee6018) at /usr/include/c++/7/bits/std_function.h:706
#2  message_filters::CallbackHelper1T<message_filters::MessageEvent<sensor_msgs::msg::LaserScan_<std::allocator<void> > const> const&, sensor_msgs::msg::LaserScan_<std::allocator<void> > >::call (this=0x561eabee6010, 
    event=..., nonconst_force_copy=<optimized out>) at /opt/ros/dashing/include/message_filters/signal1.h:74
#3  0x00007f6958cee6b0 in message_filters::Signal1<sensor_msgs::msg::LaserScan_<std::allocator<void> > >::call (event=..., this=0x561eac00bb98) at /opt/ros/dashing/include/message_filters/signal1.h:117
#4  message_filters::SimpleFilter<sensor_msgs::msg::LaserScan_<std::allocator<void> > >::signalMessage (event=..., this=0x561eac00bb98) at /opt/ros/dashing/include/message_filters/simple_filter.h:133
#5  message_filters::Subscriber<sensor_msgs::msg::LaserScan_<std::allocator<void> > >::cb (e=..., this=0x561eac00bb90) at /opt/ros/dashing/include/message_filters/subscriber.h:235
#6  message_filters::Subscriber<sensor_msgs::msg::LaserScan_<std::allocator<void> > >::subscribe(rclcpp::Node*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rmw_qos_profile_t)::{lambda(std::shared_ptr<sensor_msgs::msg::LaserScan_<std::allocator<void> > const>)#1}::operator()(std::shared_ptr<sensor_msgs::msg::LaserScan_<std::allocator<void> > const>) const (msg=..., __closure=<optimized out>)
    at /opt/ros/dashing/include/message_filters/subscriber.h:174
#7  std::_Function_handler<void (std::shared_ptr<sensor_msgs::msg::LaserScan_<std::allocator<void> > const>), message_filters::Subscriber<sensor_msgs::msg::LaserScan_<std::allocator<void> > >::subscribe(rclcpp::Node*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rmw_qos_profile_t)::{lambda(std::shared_ptr<sensor_msgs::msg::LaserScan_<std::allocator<void> > const>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<sensor_msgs::msg::LaserScan_<std::allocator<void> > const>&&) (__functor=..., __args#0=...) at /usr/include/c++/7/bits/std_function.h:316
#8  0x00007f6958d17d24 in std::function<void (std::shared_ptr<sensor_msgs::msg::LaserScan_<std::allocator<void> > const>)>::operator()(std::shared_ptr<sensor_msgs::msg::LaserScan_<std::allocator<void> > const>) const (
    __args#0=std::shared_ptr<const sensor_msgs::msg::LaserScan_<std::allocator<void> >> (empty) = {...}, this=0x561eac00c370) at /usr/include/c++/7/bits/std_function.h:706
#9  rclcpp::AnySubscriptionCallback<sensor_msgs::msg::LaserScan_<std::allocator<void> >, std::allocator<void> >::dispatch (this=0x561eac00c330, 
    message=std::shared_ptr<sensor_msgs::msg::LaserScan_<std::allocator<void> >> (use count 7, weak count 0) = {...}, message_info=...) at /opt/ros/dashing/include/rclcpp/any_subscription_callback.hpp:163
#10 0x00007f6958d17f41 in rclcpp::Subscription<sensor_msgs::msg::LaserScan_<std::allocator<void> >, std::allocator<void> >::handle_message (this=0x561eac00c2a0, message=..., message_info=...)
    at /opt/ros/dashing/include/rclcpp/subscription.hpp:146
#11 0x00007f695850129f in rclcpp::executor::Executor::execute_subscription(std::shared_ptr<rclcpp::SubscriptionBase>) () from /opt/ros/dashing/lib/librclcpp.so
#12 0x00007f69585022e5 in rclcpp::executor::Executor::execute_any_executable(rclcpp::executor::AnyExecutable&) () from /opt/ros/dashing/lib/librclcpp.so
#13 0x00007f69585091af in rclcpp::executors::SingleThreadedExecutor::spin() () from /opt/ros/dashing/lib/librclcpp.so
#14 0x00007f6958505ea2 in rclcpp::spin(std::shared_ptr<rclcpp::node_interfaces::NodeBaseInterface>) () from /opt/ros/dashing/lib/librclcpp.so
#15 0x0000561eabb25ef6 in main (argc=<optimized out>, argv=<optimized out>) at /

@SteveMacenski
Copy link
Owner

SteveMacenski commented Dec 2, 2019

These are crashes in what looks like message filters, can you open a ticket with them (feel free to link this ticket)? I don’t see anything in the traceback that makes me believe there’s anything wrong with this codebase, unless I’m missing a key line. It never even its this codebase before the crash, its in the message filters.

Its likely related. What DDS vendor are you using? Have you tried Cyclone? Are you using ROS2 Dashing or Eloquent?

The signal failures are likely related.

@maxlein
Copy link
Contributor Author

maxlein commented Dec 2, 2019

Yep, you are right.
I was just saving all the info here in place to have a better look then.

The stacktraces were taken when using opensplice and dashing.

Will also test tomorrow with eloquent and cyclone.
Because cyclone with dashing had some other bugs which were fixed in eloquent.

@SteveMacenski
Copy link
Owner

Might also be opensplice, that's a little less well supported than fast RTPS and Cyclone. I'd open the ticket in message filters, those folks are the right people to bring this to

@maxlein
Copy link
Contributor Author

maxlein commented Dec 6, 2019

Still fighting with this issue and crashes. And it seems its never the same.

Latest crash log:

double free or corruption

0x00007fb0a2366801 in __GI_abort () at abort.c:79
#2  0x00007fb0a23af897 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7fb0a24dcb9a "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3  0x00007fb0a23b690a in malloc_printerr (str=str@entry=0x7fb0a24de828 "double free or corruption (fasttop)") at malloc.c:5350
#4  0x00007fb0a23b91d5 in _int_free (av=0x7fb0a2711c40 <main_arena>, p=0x55a6cc8c5090, have_lock=<optimized out>) at malloc.c:4230
#5  0x00007fb0a23bb79e in _int_realloc (av=av@entry=0x7fb0a2711c40 <main_arena>, oldp=oldp@entry=0x55a6cc8c4700, oldsize=oldsize@entry=2496, nb=nb@entry=2448) at malloc.c:4650
#6  0x00007fb0a23bef9b in __GI___libc_realloc (oldmem=0x55a6cc8c4710, bytes=2440) at malloc.c:3230
#7  0x00007fb0a1c68349 in rcl_clock_remove_jump_callback () from /opt/ros/eloquent/lib/librcl.so
#8  0x00007fb0a1c69d62 in rcl_timer_fini () from /opt/ros/eloquent/lib/librcl.so
#9  0x00007fb0a342fa92 in ?? () from /opt/ros/eloquent/lib/librclcpp.so
#10 0x00007fb0a342fe29 in rclcpp::TimerBase::~TimerBase() () from /opt/ros/eloquent/lib/librclcpp.so
#11 0x00007fb0a062a179 in ?? () from /opt/ros/eloquent/lib/libtf2_ros.so
#12 0x00007fb0a062a642 in tf2_ros::CreateTimerROS::remove(unsigned long const&) () from /opt/ros/eloquent/lib/libtf2_ros.so
#13 0x00007fb0a06259e3 in ?? () from /opt/ros/eloquent/lib/libtf2_ros.so
#14 0x00007fb09dd33fd4 in tf2::BufferCore::testTransformableRequests() () from /opt/ros/eloquent/lib/libtf2.so
#15 0x00007fb09dd351a2 in tf2::BufferCore::setTransformImpl(tf2::Transform const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::chrono::time_point<std::chrono::_V2::system_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool) () from /opt/ros/eloquent/lib/libtf2.so
#16 0x00007fb09dd35597 in tf2::BufferCore::setTransform(geometry_msgs::msg::TransformStamped_<std::allocator<void> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)
    () from /opt/ros/eloquent/lib/libtf2.so
#17 0x00007fb0a062c41e in tf2_ros::TransformListener::subscription_callback(std::shared_ptr<tf2_msgs::msg::TFMessage_<std::allocator<void> > >, bool) () from /opt/ros/eloquent/lib/libtf2_ros.so
#18 0x00007fb0a0630c88 in std::_Function_handler<void (std::shared_ptr<tf2_msgs::msg::TFMessage_<std::allocator<void> > >), std::_Bind<void (tf2_ros::TransformListener::*(tf2_ros::TransformListener*, std::_Placeholder<1>, bool))(std::shared_ptr<tf2_msgs::msg::TFMessage_<std::allocator<void> > >, bool)> >::_M_invoke(std::_Any_data const&, std::shared_ptr<tf2_msgs::msg::TFMessage_<std::allocator<void> > >&&) ()
   from /opt/ros/eloquent/lib/libtf2_ros.so
#19 0x00007fb0a0636a1d in rclcpp::AnySubscriptionCallback<tf2_msgs::msg::TFMessage_<std::allocator<void> >, std::allocator<void> >::dispatch(std::shared_ptr<tf2_msgs::msg::TFMessage_<std::allocator<void> > >, rmw_message_info_t const&) () from /opt/ros/eloquent/lib/libtf2_ros.so
#20 0x00007fb0a0636e80 in rclcpp::Subscription<tf2_msgs::msg::TFMessage_<std::allocator<void> >, std::allocator<void>, rclcpp::message_memory_strategy::MessageMemoryStrategy<tf2_msgs::msg::TFMessage_<std::allocator<void> >, std::allocator<void> > >::handle_message(std::shared_ptr<void>&, rmw_message_info_t const&) () from /opt/ros/eloquent/lib/libtf2_ros.so
#21 0x00007fb0a33a4cef in rclcpp::executor::Executor::execute_subscription(std::shared_ptr<rclcpp::SubscriptionBase>) () from /opt/ros/eloquent/lib/librclcpp.so
#22 0x00007fb0a33a6015 in rclcpp::executor::Executor::execute_any_executable(rclcpp::executor::AnyExecutable&) () from /opt/ros/eloquent/lib/librclcpp.so
#23 0x00007fb0a33ac6ef in rclcpp::executors::SingleThreadedExecutor::spin() () from /opt/ros/eloquent/lib/librclcpp.so

Eloquent, Ubuntu 18.04, Kernel 5.0, CycloneDDs

@maxlein
Copy link
Contributor Author

maxlein commented Dec 6, 2019

Again double free or corruption:

Looks like now there is a bug in librclcpp.
Interesting thing is, that I have this crashes in this node only, my other 6 nodes run without any of these problems.

Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x7f66f8bc4380 (LWP 23341))]
(gdb) backtrace
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007f66f6c60801 in __GI_abort () at abort.c:79
#2  0x00007f66f6ca9897 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f66f6dd6b9a "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3  0x00007f66f6cb090a in malloc_printerr (str=str@entry=0x7f66f6dd8828 "double free or corruption (fasttop)") at malloc.c:5350
#4  0x00007f66f6cb31d5 in _int_free (av=0x7f66f700bc40 <main_arena>, p=0x5606adb31200, have_lock=<optimized out>) at malloc.c:4230
#5  0x00007f66f6cb579e in _int_realloc (av=av@entry=0x7f66f700bc40 <main_arena>, oldp=oldp@entry=0x5606adb2fc40, oldsize=oldsize@entry=5616, nb=nb@entry=5568) at malloc.c:4650
#6  0x00007f66f6cb8f9b in __GI___libc_realloc (oldmem=0x5606adb2fc50, bytes=5560) at malloc.c:3230
#7  0x00007f66f6562349 in rcl_clock_remove_jump_callback () from /opt/ros/eloquent/lib/librcl.so
#8  0x00007f66f6563d62 in rcl_timer_fini () from /opt/ros/eloquent/lib/librcl.so
#9  0x00007f66f7d29a92 in ?? () from /opt/ros/eloquent/lib/librclcpp.so
#10 0x00007f66f7cb05d2 in rclcpp::memory_strategies::allocator_memory_strategy::AllocatorMemoryStrategy<std::allocator<void> >::remove_null_handles(rcl_wait_set_t*) () from /opt/ros/eloquent/lib/librclcpp.so
#11 0x00007f66f7ca2327 in rclcpp::executor::Executor::wait_for_work(std::chrono::duration<long, std::ratio<1l, 1000000000l> >) () from /opt/ros/eloquent/lib/librclcpp.so
#12 0x00007f66f7ca2775 in rclcpp::executor::Executor::get_next_executable(rclcpp::executor::AnyExecutable&, std::chrono::duration<long, std::ratio<1l, 1000000000l> >) () from /opt/ros/eloquent/lib/librclcpp.so
#13 0x00007f66f7ca66e0 in rclcpp::executors::SingleThreadedExecutor::spin() () from /opt/ros/eloquent/lib/librclcpp.so
#14 0x00007f66f7ca33e2 in rclcpp::spin(std::shared_ptr<rclcpp::node_interfaces::NodeBaseInterface>) () from /opt/ros/eloquent/lib/librclcpp.so

@SteveMacenski
Copy link
Owner

This is really interesting, I have never run into these issues in ROS2 yet.

I don't think any of these issues relate to this code at all, none of these tracebacks even get to the toolbox, they're all crashing at the ROS, TF, or message filters layer. Your TF one (not the last one you posted, the one before that) what line did it fail on? Is that TF related to the toolbox's TF subscribers or TF related to the message filters (since those use TF to filter for valid transforms).

You should submit a ticket to these upstream repos and see if they have any suggestions, link the ticket back here. How are you building ROS2 eloquent, source or debians? Maybe there's something not playing well in your source build? Have you tried cleaning everything out, and with clean paths, trying again?

I don't think you've mentioned in the thread - what's the processor and memory you have?

@maxlein
Copy link
Contributor Author

maxlein commented Dec 6, 2019

I also don't think so, but it's something what this library uses and my other nodes not, like message filters.

Your TF one (not the last one you posted, the one before that) what line did it fail on?

Sorry, but I don' know anymore. I can't see any message filter calls in the trace, so I think its 's the toolbox subscriber.

How are you building ROS2 eloquent, source or debians? Maybe there's something not playing well in your source build?

No source build, released debian packages, no testing packages.

Have you tried cleaning everything out, and with clean paths, trying again?

I think that I cleaned the workspace a few times, at least when changing to eloquent.
But I can try that again.

I don't think you've mentioned in the thread - what's the processor and memory you have?

Core i5, 8GB RAM
Eloquent, Ubuntu 18.04, Kernel 5.0, CycloneDDS

My robot is active 24/7, not driving all the time but in its charge station and being monitored for memory, disk, crashes, etc...

@SteveMacenski
Copy link
Owner

@SteveMacenski
Copy link
Owner

@maxlein an interesting question to me would be what would happen if you were to throttle your laser from 100hz to 40hz. If its an async issue maybe you see this go away since the issue is a message filter, ask for less of them at a given time.

If it were to stop happening, or greatly reduce, that would be valuable insights

@maxlein
Copy link
Contributor Author

maxlein commented Dec 11, 2019

So I now throttled the scan rate from 25Hz to 12.5Hz, and it runs for about 10 hours now.
As I never reached 10 hours before, it looks like its at least reducing the error frequency.

@SteveMacenski
Copy link
Owner

That makes sense then that its probably a race condition like Jacob mentioned.

Something interesting would be to see if you change the QoS to only have a depth of 1 if this goes away as there arent multiple messages queued up.

@maxlein
Copy link
Contributor Author

maxlein commented Jan 10, 2020

FYI: Seems to work right now with message depth of 1.
But I am not fully convinced yet.

@SteveMacenski
Copy link
Owner

SteveMacenski commented Jan 10, 2020

Since its below this package and we really dont NEED more than 1, can you PR the updates? Lets just square this away

@SteveMacenski
Copy link
Owner

Any updates?

@SteveMacenski
Copy link
Owner

This issue isn't technically "fixed" but its functionally patched & the appropriate ticket is filed with message filters. Closing now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants