Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception boost::lock_error thrown from shutdown method #838

Closed
mikepurvis opened this issue Jul 16, 2016 · 8 comments · Fixed by #1656
Closed

Exception boost::lock_error thrown from shutdown method #838

mikepurvis opened this issue Jul 16, 2016 · 8 comments · Fixed by #1656

Comments

@mikepurvis
Copy link
Member

We see crashes like this as multi-threaded processes like rqt_gui shut down:

../../../../../ros-comm-c9e259436a855db5e5f50f3efe2cec60d1f5786e/clients/roscpp/src/libros/init.cpp in ros::shutdown at line 596

../../../../../ros-comm-c9e259436a855db5e5f50f3efe2cec60d1f5786e/clients/roscpp/src/libros/topic_manager.cpp in ros::TopicManager::shutdown at line 93

/usr/include/boost/thread/lock_types.hpp in unique_lock at line 124

/usr/include/boost/thread/lock_types.hpp in boost::unique_lock<boost::mutex>::lock at line 346

/usr/include/boost/thread/pthread/mutex.hpp in lock at line 116

/usr/include/boost/throw_exception.hpp in boost::throw_exception<boost::lock_error> at line 67

/usr/lib/x86_64-linux-gnu/libstdc++.so.6 in __cxa_throw ()

/usr/lib/x86_64-linux-gnu/libstdc++.so.6 in std::terminate() ()

This is the code:

void TopicManager::shutdown()
{
  boost::mutex::scoped_lock shutdown_lock(shutting_down_mutex_);
  if (shutting_down_)
  {
    return;
  }

  {
    boost::recursive_mutex::scoped_lock lock1(advertised_topics_mutex_);
    boost::mutex::scoped_lock lock2(subs_mutex_);
    shutting_down_ = true;
  }

  ...

It looks like the expected behaviour is to block until the mutex can be acquired, so this is maybe an issue with one thread tearing down the TopicManager and then other finding a mutex which is no longer valid? Maybe there's a better solution here than just wrapping a try block around the whole thing?

@dirk-thomas
Copy link
Member

Is there a sequence of steps which demonstrate the issue reliably? And what ROS version are you having the problem on?

The topic manager is being shut down by this location:

TopicManager::instance()->shutdown();
I don't see any other cases on a first look so it would be really interesting what the stack looks like when this problem happens.

@mikepurvis
Copy link
Member Author

mikepurvis commented Aug 6, 2016

Nominally we're on Indigo, but with ros_comm 1.12.2 (kinetic) and an older rqt (from prior to the change that broke rqt_rviz).

Our crash reporter shows these primarily coming from rqt instances, but we also see them from gazebo (running various ROS-connected plugins), and even from rosserial_server (eq, when embedded in a process with other ROS code). Here is another rqt example:

  File "libc-start.c", line 287, in __libc_start_main
  ?, in Py_Main ()
  ?, in PyRun_SimpleFileExFlags ()
  ?, in PyErr_PrintEx ()
  ?, in ?? ()
  ?, in Py_Exit ()
  File "exit.c", line 104, in __GI_exit
  File "exit.c", line 82, in __run_exit_handlers
  File "/opt/clearpath/2.1devel/sdk/lib/libroscpp.so", in ros::atexitCallback() ()
  File "/opt/clearpath/2.1devel/sdk/lib/libroscpp.so", in ros::shutdown() ()
  File "/opt/clearpath/2.1devel/sdk/lib/libroscpp.so", in ros::TopicManager::shutdown() ()
  File "/opt/clearpath/2.1devel/sdk/lib/libroscpp.so", in boost::unique_lock<boost::mutex>::lock() ()
  File "/opt/clearpath/2.1devel/sdk/lib/libroscpp.so", in void boost::throw_exception<boost::lock_error>(boost::lock_error const&) ()
  File "/usr/lib/x86_64-linux-gnu/libstdc++.so.6", in __cxa_throw ()
  File "/usr/lib/x86_64-linux-gnu/libstdc++.so.6", in std::terminate() ()
  File "/usr/lib/x86_64-linux-gnu/libstdc++.so.6", in ?? ()
  File "/usr/lib/x86_64-linux-gnu/libstdc++.so.6", in __gnu_cxx::__verbose_terminate_handler() ()
  File "abort.c", line 89, in __GI_abort

What kind of stack trace would be more helpful to see? I might be able to get a reproduction together for this, but I suspect it would be as simple as instantiating NodeHandle instances on a bunch of different threads and then exiting the process.


EDIT: Okay, it's not quite that easy. I've built the proposed trivial example and it seems to exit gracefully each time, even with many threads. I shall continue to experiment and see if I can get a MWE together.

Looks like this may be related to #318, where a user reports similar issues with lock_error instances thrown during cleanup.

@dirk-thomas
Copy link
Member

Thanks for looking into it. If we have a case which reproducible crashes (even if it is only with a certain probability) I am happy to help.

@scott-eddy
Copy link

Hi @dirk-thomas and @mikepurvis

I've run into a similar boost::lock error as reported here and in #318. I encountered this error when moving from Indigo to Kinetic and have done some debugging on it. My minimal reproducible example can be cloned from here.

I have a singleton class that has static storage but stores a reference to a node handle. In that singleton class there is a Subscriber and/or ServiceServer member that, when advertised, will cause the exception to be thrown at exit.

I believe that the reason why this happens is that the *Manager singletons (I've looked at Topic and Service but likely all are the same) are held as static boost::shared_ptrs. When a call to ros::start() is made this pointer is created and the underlying *Manager type is constructed with a call to ::new() and the reference count of the shared_ptr is incremented. When the process is terminated the *Managers call shutdown and the reference count of the shared_ptr is decremented to 0 causing the destruction of the underlying Manager and all of its data members.

In my case the user declared Singleton will be then be destructed as it has static storage causing the Subscriber::Impl to call TopicManager::instance()->unsubscribe(). Here the boost::shared_ptr returned by TopicManager::instance() is still valid, but the instance it is pointing to has already been deleted. The lookup of the unsubscribe() function succeeds, but when you attempt to lock the TopicManager::subs_mutex_ boost attempts to lock a pthreads mutex that has not been initialized causing the exception to be thrown. Really the pthread lock call returns EINVAL which happens when "The value specified by mutex does not refer to an initialized mutex object."

I don't know what the best fix for this is. I think that the problem could be avoided by having every node handle have a private boost::shared_ptr to each *Manager. This way on NodeHandle creation the reference count to the underlying *Managers would increase and the destructor for that class would not be called until all NodeHandles were destroyed. I'd be happy to submit a PR for this fix if it would be well received.

@mikepurvis
Copy link
Member Author

Wow, thanks for digging into this! Your proposed fix sounds reasonable to me.

@goretkin
Copy link

goretkin commented Apr 6, 2018

Is there a workaround for this? I am coming across this issue when exiting from a roscpp node.

@cwecht
Copy link
Contributor

cwecht commented Feb 23, 2019

The solution, which @scott-eddy proposed, works, but introduces an ABI-break. The in #1630 avoids this.

@cwecht
Copy link
Contributor

cwecht commented Feb 23, 2019

There is a similar problem with a Timer in topic_tool/relay. This is the reason, why the relay_stealth.test crashes with an exception.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants