Exception boost::lock_error thrown from shutdown method #838

mikepurvis · 2016-07-16T18:39:52Z

We see crashes like this as multi-threaded processes like rqt_gui shut down:

../../../../../ros-comm-c9e259436a855db5e5f50f3efe2cec60d1f5786e/clients/roscpp/src/libros/init.cpp in ros::shutdown at line 596

../../../../../ros-comm-c9e259436a855db5e5f50f3efe2cec60d1f5786e/clients/roscpp/src/libros/topic_manager.cpp in ros::TopicManager::shutdown at line 93

/usr/include/boost/thread/lock_types.hpp in unique_lock at line 124

/usr/include/boost/thread/lock_types.hpp in boost::unique_lock<boost::mutex>::lock at line 346

/usr/include/boost/thread/pthread/mutex.hpp in lock at line 116

/usr/include/boost/throw_exception.hpp in boost::throw_exception<boost::lock_error> at line 67

/usr/lib/x86_64-linux-gnu/libstdc++.so.6 in __cxa_throw ()

/usr/lib/x86_64-linux-gnu/libstdc++.so.6 in std::terminate() ()

This is the code:

void TopicManager::shutdown()
{
  boost::mutex::scoped_lock shutdown_lock(shutting_down_mutex_);
  if (shutting_down_)
  {
    return;
  }

  {
    boost::recursive_mutex::scoped_lock lock1(advertised_topics_mutex_);
    boost::mutex::scoped_lock lock2(subs_mutex_);
    shutting_down_ = true;
  }

  ...

It looks like the expected behaviour is to block until the mutex can be acquired, so this is maybe an issue with one thread tearing down the TopicManager and then other finding a mutex which is no longer valid? Maybe there's a better solution here than just wrapping a try block around the whole thing?

The text was updated successfully, but these errors were encountered:

dirk-thomas · 2016-08-05T22:40:25Z

Is there a sequence of steps which demonstrate the issue reliably? And what ROS version are you having the problem on?

The topic manager is being shut down by this location:

ros_comm/clients/roscpp/src/libros/init.cpp

Line 596 in 8741260

TopicManager::instance()->shutdown();

I don't see any other cases on a first look so it would be really interesting what the stack looks like when this problem happens.

mikepurvis · 2016-08-06T00:28:08Z

Nominally we're on Indigo, but with ros_comm 1.12.2 (kinetic) and an older rqt (from prior to the change that broke rqt_rviz).

Our crash reporter shows these primarily coming from rqt instances, but we also see them from gazebo (running various ROS-connected plugins), and even from rosserial_server (eq, when embedded in a process with other ROS code). Here is another rqt example:

  File "libc-start.c", line 287, in __libc_start_main
  ?, in Py_Main ()
  ?, in PyRun_SimpleFileExFlags ()
  ?, in PyErr_PrintEx ()
  ?, in ?? ()
  ?, in Py_Exit ()
  File "exit.c", line 104, in __GI_exit
  File "exit.c", line 82, in __run_exit_handlers
  File "/opt/clearpath/2.1devel/sdk/lib/libroscpp.so", in ros::atexitCallback() ()
  File "/opt/clearpath/2.1devel/sdk/lib/libroscpp.so", in ros::shutdown() ()
  File "/opt/clearpath/2.1devel/sdk/lib/libroscpp.so", in ros::TopicManager::shutdown() ()
  File "/opt/clearpath/2.1devel/sdk/lib/libroscpp.so", in boost::unique_lock<boost::mutex>::lock() ()
  File "/opt/clearpath/2.1devel/sdk/lib/libroscpp.so", in void boost::throw_exception<boost::lock_error>(boost::lock_error const&) ()
  File "/usr/lib/x86_64-linux-gnu/libstdc++.so.6", in __cxa_throw ()
  File "/usr/lib/x86_64-linux-gnu/libstdc++.so.6", in std::terminate() ()
  File "/usr/lib/x86_64-linux-gnu/libstdc++.so.6", in ?? ()
  File "/usr/lib/x86_64-linux-gnu/libstdc++.so.6", in __gnu_cxx::__verbose_terminate_handler() ()
  File "abort.c", line 89, in __GI_abort

What kind of stack trace would be more helpful to see? I might be able to get a reproduction together for this, but I suspect it would be as simple as instantiating NodeHandle instances on a bunch of different threads and then exiting the process.

EDIT: Okay, it's not quite that easy. I've built the proposed trivial example and it seems to exit gracefully each time, even with many threads. I shall continue to experiment and see if I can get a MWE together.

Looks like this may be related to #318, where a user reports similar issues with lock_error instances thrown during cleanup.

dirk-thomas · 2016-08-08T21:33:20Z

Thanks for looking into it. If we have a case which reproducible crashes (even if it is only with a certain probability) I am happy to help.

scott-eddy · 2017-10-11T17:35:58Z

Hi @dirk-thomas and @mikepurvis

I've run into a similar boost::lock error as reported here and in #318. I encountered this error when moving from Indigo to Kinetic and have done some debugging on it. My minimal reproducible example can be cloned from here.

I have a singleton class that has static storage but stores a reference to a node handle. In that singleton class there is a Subscriber and/or ServiceServer member that, when advertised, will cause the exception to be thrown at exit.

I believe that the reason why this happens is that the *Manager singletons (I've looked at Topic and Service but likely all are the same) are held as static boost::shared_ptrs. When a call to ros::start() is made this pointer is created and the underlying *Manager type is constructed with a call to ::new() and the reference count of the shared_ptr is incremented. When the process is terminated the *Managers call shutdown and the reference count of the shared_ptr is decremented to 0 causing the destruction of the underlying Manager and all of its data members.

In my case the user declared Singleton will be then be destructed as it has static storage causing the Subscriber::Impl to call TopicManager::instance()->unsubscribe(). Here the boost::shared_ptr returned by TopicManager::instance() is still valid, but the instance it is pointing to has already been deleted. The lookup of the unsubscribe() function succeeds, but when you attempt to lock the TopicManager::subs_mutex_ boost attempts to lock a pthreads mutex that has not been initialized causing the exception to be thrown. Really the pthread lock call returns EINVAL which happens when "The value specified by mutex does not refer to an initialized mutex object."

I don't know what the best fix for this is. I think that the problem could be avoided by having every node handle have a private boost::shared_ptr to each *Manager. This way on NodeHandle creation the reference count to the underlying *Managers would increase and the destructor for that class would not be called until all NodeHandles were destroyed. I'd be happy to submit a PR for this fix if it would be well received.

mikepurvis · 2017-10-12T15:11:20Z

Wow, thanks for digging into this! Your proposed fix sounds reasonable to me.

goretkin · 2018-04-06T15:07:04Z

Is there a workaround for this? I am coming across this issue when exiting from a roscpp node.

cwecht · 2019-02-23T17:57:56Z

The solution, which @scott-eddy proposed, works, but introduces an ABI-break. The in #1630 avoids this.

cwecht · 2019-02-23T19:39:05Z

There is a similar problem with a Timer in topic_tool/relay. This is the reason, why the relay_stealth.test crashes with an exception.

dirk-thomas added the more-information-needed label Aug 5, 2016

LoganTutt mentioned this issue Jul 10, 2018

ros_comm: Crash on shutdown when using atexit() before connection to master #1452

Closed

cwecht mentioned this issue Feb 23, 2019

/libros/node_handle: keep shared pointer to TopicManager and ServiceM… #1630

Closed

cwecht pushed a commit to cwecht/ros_comm that referenced this issue Mar 4, 2019

/libros/node_handle: alternative way to fix ros#838

39b9938

scott-eddy mentioned this issue Mar 18, 2019

/libros/node_handle: alternative way to fix #838 #1656

Merged

dirk-thomas closed this as completed in #1656 Mar 30, 2019

dirk-thomas pushed a commit that referenced this issue Mar 30, 2019

/libros/node_handle: alternative way to fix #838 (#1656)

2ca744d

tahsinkose pushed a commit to tahsinkose/ros_comm that referenced this issue Apr 15, 2019

/libros/node_handle: alternative way to fix ros#838 (ros#1656)

dc53b6a

dirk-thomas mentioned this issue Aug 4, 2020

changes between 1.12.14 and 1.14.7 for backporting #2015

Merged

dirk-thomas pushed a commit that referenced this issue Aug 4, 2020

/libros/node_handle: alternative way to fix #838 (#1656)

52c3a30

josephduchesne pushed a commit to josephduchesne/ros_comm that referenced this issue Jan 24, 2024

/libros/node_handle: alternative way to fix ros#838 (ros#1656)

5ff2110

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exception boost::lock_error thrown from shutdown method #838

Exception boost::lock_error thrown from shutdown method #838

mikepurvis commented Jul 16, 2016

dirk-thomas commented Aug 5, 2016

mikepurvis commented Aug 6, 2016 •

edited

Loading

dirk-thomas commented Aug 8, 2016

scott-eddy commented Oct 11, 2017

mikepurvis commented Oct 12, 2017

goretkin commented Apr 6, 2018

cwecht commented Feb 23, 2019 •

edited

Loading

cwecht commented Feb 23, 2019

Exception boost::lock_error thrown from shutdown method #838

Exception boost::lock_error thrown from shutdown method #838

Comments

mikepurvis commented Jul 16, 2016

dirk-thomas commented Aug 5, 2016

mikepurvis commented Aug 6, 2016 • edited Loading

dirk-thomas commented Aug 8, 2016

scott-eddy commented Oct 11, 2017

mikepurvis commented Oct 12, 2017

goretkin commented Apr 6, 2018

cwecht commented Feb 23, 2019 • edited Loading

cwecht commented Feb 23, 2019

mikepurvis commented Aug 6, 2016 •

edited

Loading

cwecht commented Feb 23, 2019 •

edited

Loading