fix multi-threaded spinning #867

rhaschke · 2016-08-15T09:10:28Z

This is an attempt to fix #277 in a more fundamental fashion.

As @po1 pointed out in #277 (comment), due to the global mutex, only a single thread was allowed to run/start spinners, even if operating on different callback queues.

As @tfoote pointed out in #277 (comment), the mutex was probably introduced to prevent interleaving access to a callback queue thus guaranteeing in-order execution of queued callbacks.

Obviously, this protection should be local per callback queue (instead of using a global mutex) and it should only be considered for SingleThreadedSpinners as multi-threaded spinners deliberately request asynchronous processing.

This PR attempts to solve that issue by replacing the global mutex with a SpinnerMonitor that keeps track of all callback queues that are currently spinning. If a single-threaded spinner wants to spin a queue in parallel to another spinner, an error can be issued. IMHO, this error should be even fatal.

Alternatively, the callback queue should monitor who is spinning it. However, this would require an API change, which is why I decided for the SpinnerMonitor.

rhaschke · 2016-08-15T09:20:41Z

PR #377 introducing AsyncSpinner::canStart() is incomplete: As another spinner might have been started in between the two calls to canStart() and start(), start() can again silently fail.

A proper solution would be to change API and return the success of start(). However, to avoid this API change(), I suggest either to ROS_ISSUE_BREAK in response to the fatal error or to start spinning in any case, thus making canStart() obsolete again. The next commit decided for the latter.

rhaschke · 2016-08-15T20:57:53Z

Instead of starting spinning in any case as suggested in bce1f4b, I decided to throw an exception, when spinning cannot be safely started. This allows for proper feedback as requested in #277, but generally should abort the program and force the developer to fix his program logic.

Having this in place, allows to activate the spinner tests as proper unittests. They were not automatically run, because (i) there was no introspection into spinner state available and (ii) individual tests had to be run independent of each other.

rhaschke · 2016-08-29T12:03:19Z

Friendly ping.

dirk-thomas · 2016-09-06T22:42:41Z

clients/roscpp/include/ros/spinner.h

   */
-  bool canStart();
+  ROS_DEPRECATED bool canStart();


I am not convinced that we should add a deprecation warning to an already released distribution. Maybe the deprecation warning can be added for L-turtle.

OK. However, if this gets merged, we should directly create an L-turtle branch and apply the deprecation.

Since we will not start building from that branch until early 2017 I don't see I reason to create that branch now. We can create a separate ticket so that the deprecation is not forgotten but done as soon as the branch exists.

dirk-thomas · 2016-09-06T22:43:23Z

Beside my minor comments this looks good to me. @ros/ros_team Any comments?

wjwwood · 2016-09-06T23:16:42Z

The changes lgtm, +1.

tfoote · 2016-09-07T21:23:34Z

This looks like a good approach to me. +1 for deferring the deprecation warning though.

dirk-thomas · 2016-09-08T20:45:32Z

Thanks. Once the comments have been addressed it can be merged.

Using a single recursive mutex disables to run several spinners in parallel (started from different threads) - even if they operate on different callback queues. The SpinnerMonitor keeps a list of spinning callback queues, thus making the monitoring local to callback queues.

This correctly indicates the fatality of the error but allows for graceful quitting too.

- moved test/test_roscpp/test/test_spinners.cpp -> test/test_roscpp/test/src/spinners.cpp - created rostest test/test_roscpp/test/launch/spinners.xml - use thrown exception to evaluate error conditions

fixes unittest Spinners.async

rhaschke · 2016-09-09T17:52:27Z

Rebased to latest kinetic-devel. buildfarm seems to check for cmake warnings now, which made the old branch fail.

dirk-thomas · 2016-09-09T17:58:32Z

The CMake warnings will still be there. They require a genmsg release to be fixed (ros/rosdistro#12586). If those are the only warnings they won't prevent this PR to be merged.

rhaschke · 2016-09-09T19:15:31Z

Good to know. Looks like all other issues are resolved.

rhaschke · 2016-09-13T21:01:48Z

@dirk-thomas: Friendly ping.

dirk-thomas · 2016-09-15T15:51:09Z

clients/roscpp/src/libros/spinner.cpp

  AsyncSpinner s(thread_count_, queue);
  s.start();

  ros::waitForShutdown();
+  s.stop();


This call is redundant and should be removed.

@rhaschke If you could update this that would be great. Otherwise I can apply it during the merge.

tfoote

Overall this looks good, but I'm concerned that this changes the behavior significantly due to possibly throwing and tearing down running systems.

tfoote · 2016-09-15T18:41:42Z

clients/roscpp/src/libros/spinner.cpp

+  {
+    boost::mutex::scoped_lock lock(mutex_);
+    std::map<ros::CallbackQueue*, Entry>::iterator it = spinning_queues_.find(queue);
+    if (it != spinning_queues_.end())


It would be good to have a way to catch the error condition of the else for this if. It should at least ROS_ERROR. And this method should probably have a bool return code.

This is private code, only used in spinner.cpp. As in the existing code, calls to add() and remove() are always paired, the condition should be always fulfilled and thus should be replaced by a ROS_ASSERT. I will do so.

tfoote · 2016-09-15T18:42:36Z

clients/roscpp/src/libros/spinner.cpp

+    if (it != spinning_queues_.end())
+    {
+      if (it->second.tid != boost::thread::id() && it->second.tid != boost::this_thread::get_id())
+        ROS_ERROR("SpinnerMonitor::remove() called from different thread than add().");


Is this an error or a warning? If it's an error it should return false and not continue.

OK, I will turn this into a warning.

tfoote · 2016-09-15T18:46:59Z

clients/roscpp/src/libros/spinner.cpp

+  if (!spinner_monitor.add(callback_queue_, false))
+  {
+    ROS_FATAL_STREAM("AsyncSpinnerImpl: " << DEFAULT_ERROR_MESSAGE);
+    throw std::runtime_error("There is already a single-threaded spinner on this queue");


This is a notable change in behavior to escalate to a runtime_error where the old behavior is to not register the callback queue. We could consider making this change in an upcoming release, but we should not make this change in an existing distro.

I did this on purpose: Indeed, the old behavior simply warned on the console, but otherwise silently continued without registering the callback queue. So, if the user / developer doesn't notice the warning in first place, he might assume everything is fine, but events are never processed on this queue!
The code out there either doesn't encounter this error/exception because AsyncSpinner was correctly used beforehand or - if it encountered the warning - it wasn't processing its events at all, which (hopefully) would have been triggered the developer to look for the error too. Hence, probably / hopefully such code was never released.

I agree with the arguments from both of you. But since Kinetic has already been released for a while any behavior change should be avoided (no matter how unlikely it is that code relies on it). Therefore the same conclusion as above:

ROS_FATAL_STREAM -> ROS_ERROR_STREAM

throw -> return.

I see. However, this is not as trivial as changing throw -> return.

Previous behavior was to accept multiple spinners on a queue as long as they were started from the same thread. To retain this wrong behavior, I need to remember which thread initially started spinning on a particular queue and then allow other threads as well.

Hence, the main improvement remaining from this PR in Kinetic, is the ability to have multiple spinners on different queues.

I don't see why we need initial_tid and the condition check based on it. We only need tid to distinguish single threaded and multi threaded spinners. The same queue can only be handled by one single threaded spinner or any number of multi threaded spinners - no matter what the thread id of the multi threaded spinner is.

@dirk-thomas What you describe is the new and intended behavior and I fully agree. However, @tfoote asked to maintain the previous behavior and not throw. Previously, we could have the following situations:

S1 (single-threaded spinner started in thread 1): will block thread 1 until shutdown. Any further spinners in different threads were not allowed (with an error message).

M1 (multi-threaded spinner started in thread 1): Further spinners started from different threads were not allowed (with an error message).

M1 ... M1 S1 (multi-threaded spinners started in thread 1 and afterwards a single-threaded one started): This was accepted without any errors. But the new behavior is to reject S1!

Restrictions of case 1 + 2 are relaxed with this PR: Other spinners are allowed as long as the operate on a different queue. Thread doesn't matter.

The tricky part is case 3, which - although nonsense - was perfectly valid code before.
In order to maintain the old behavior, I need to remember, which thread the first M-spinner was started in (the initial_tid). If I wouldn't store the initial_tid, but allow S on any thread, this would relax unintended behavior even beyond previous behavior.

The "collision" in case 3 doesn't matter in my opinion. The fact that from which thread the multi-threaded spinners have been started from shouldn't be considered. The only collision to avoid is which queues they handle and to avoid that a single-threaded spinner handles the same queue as a multi-threaded spinner.

I fully agree. But it wasn't like this before!
To retain old behavior as requested by Tully, I need to introduce initial_id. Actually, case 3 might be a common use case: starting some AsyncSpinners before finally entering the ros::spin loop.

As long as both spinners don't spin on the same queue I think that is totally fine - independent from which thread the multi-threaded one was started from. That should be achieve but just removing the initial_tid again and its check, correct?

If spinners operate on different queues, there is no conflict at all and spinners will be started without problems (this is the basic improvement we gain with this PR).

However, I thought we discuss the case, where spinners want to operate on the same queue. There might be code out there, which hits case 3. Removing initial_id and the corresponding check, but allowing to start the S* spinner from any thread after M1 is operating on the queue, will be a weaker check than before: The old code before this PR, at least rejected spinning if S* and M1 were started from different threads. Of course, we would like to reject always, but Tully requested to not do so.

tfoote · 2016-09-15T18:47:21Z

clients/roscpp/src/libros/spinner.cpp

+  if (!spinner_monitor.add(queue, true))
+  {
+    ROS_FATAL_STREAM("SingleThreadedSpinner: " << DEFAULT_ERROR_MESSAGE);
+    throw std::runtime_error("There is already another spinner on this queue");


See other comment about API stability.

The original behavior was printing an error message and then returned gracefully ignoring the new request. This behavior should be maintained. Therefore the ROS_FATAL_STREAM should be replaced with ROS_ERROR_STREAM (since its not fatal anymore, but indicates an error in using the API) and throw should be replaced with a return.

rhaschke · 2016-09-16T10:27:10Z

The last commit 47ed5e9 should be reverted on the L-turtle branch. @dirk-thomas If you create an L-turtle branch, I will file a corresponding PR and enable the deprecation warning in the header (which was removed in 835579f).

dirk-thomas · 2016-09-16T23:15:52Z

This is getting really complicated 😟 I will try to summarize before / goal / after for easier understanding. Just writing this summary based on reading the code took me a good amount of time. Hopefully it will help others to get an overview in the future:

Without this patch
1. The queue is not relevant in any decision
2. The code doesn't consider thread ids at all
3. Every spinner tries to get a recursive lock on a global mutex
4. Therefore multiple spinners started from different threads are ignored
5. Since a multi-threaded spinner does nothing in its blocking thread it was called with it can't be used to start more spinners
6. Since single-threaded spinners execute the events in the same thread that could be used to start other spinners
7. The async spinners are not blocking so the same thread can be used to start other spinners
Desired change of behavior
1. The queue should be considered in the decision if multiple spinners can operate concurrently
2. If spinners operate on different queues there is no need to restrict them
3. Multiple multi-threaded spinners can handle the same queue
With this patch
1. If the queue doesn't overlap allow new spinners (bool can_spin = (it == spinning_queues_.end() || ...)
2. A multi-threaded spinners can be started for the same queue if the already running spinner is also a multi-threaded spinner (bool can_spin = (... || it->second.tid == tid);)
3. If an additional spinner is added for the same queue and they are of diffeerent type (it->second.tid == tid and it->second.initial_tid == tid) they are allowed for backward compatibilty and print a message warning the user that events might not be handled in order
4. If a single-threaded spinner is being started from the same thread as another single-threaded spinner (operating on the same queue) it is being allowed (it->second.tid == tid)

For the case 3.iv I see a problem with the bookkeeping. When the second spinner is added it overrides the existing entry in spinning_queues_ and removes the entry when it finished. I am not sure if this is a relevant case but after the second spinner finished the data structure is inconsistent and doesn't know that the first spinner is still around. Arguably that can only happen when ok() returns false which might not be a real world problem.

Please let me know if I got something wrong (or also if you agree with the summary).

tfoote

That looks like a good summary @dirk-thomas I've been through it too and believe that it's as you've summarized. I had two usability comments on error messages, but otherwise it looks good to me.

tfoote · 2016-09-17T00:09:07Z

clients/roscpp/src/libros/spinner.cpp

+      // single-threaded spinner after several multi-threaded ones, given that they
+      // were started from the same initial thread
+      if (it->second.initial_tid == tid)
+        ROS_ERROR_STREAM("SpinnerMonitor: " << DEFAULT_ERROR_MESSAGE);


This error message should be different to reflect the backwards compatibility incase anyone reads it in depth.

tfoote · 2016-09-17T00:13:28Z

clients/roscpp/src/libros/spinner.cpp

+    Entry(const boost::thread::id &tid,
+          const boost::thread::id &initial_tid) : tid(tid), initial_tid(initial_tid), num_multi_threaded(0) {}
+
+    boost::thread::id tid; // thread id of single-threaded spinner


It would be great to have a comment here that it will be the default value which represents 'Not-a-Thread` if multithreaded.

I mentioned it would be NULL in the reworked commit.

tfoote · 2016-09-17T00:19:09Z

clients/roscpp/src/libros/spinner.cpp

+      spinning_queues_.erase(it);
+    else
+    {
+      ROS_ASSERT(it->second.num_multi_threaded > 0);


It would be great to have a message for this assert like: "Call to SpinnerMonitor::remove() for a multi-threaded spinner cannot be achieved since reference count is not greater than 0."

I opted for a shorted one: "SpinnerMonitor::remove(): Invalid spinner count (0) encountered."

tfoote · 2016-09-17T00:26:43Z

With respect to the double entry for the single threaded spinners. It will likely lead to some possible errors on teardown. It will likely hit this assert and since we know this we could soften that so that we can catch it appropriately.

rhaschke · 2016-09-17T00:31:16Z

I fully agree with 1. and 2, with a minor remark on 1.ii: It's correct that previous code didn't explicitly deal with thread ids. However, this was implicitly done in the recursive mutex remembering the thread that holds the lock.

In principle I also agree to 3. However, I have some remarks:

3.ii: it->second.tid and tid can be either 0 (indicating a multi-threaded spinner) or a proper thread-id (indicating a single-threaded spinner). Hence, the condition bool can_spin = (... || it->second.tid == tid); evaluates true when existing spinners and the new one are multi-threaded (both tid's == 0) or if the new (single-threaded) spinner originates from the same thread as the existing one.
3.iii: The given condition is wrong (description is correct).
!can_spin = (it != spinning_queues_.end() && it->second.tid != tid (same queue, but different type). Thus, the backwards compatibility condition is (it->second.tid != tid and it->second.initial_tid == tid).
3.iv: To be honest, I wasn't (actively) aware of this situation anymore. However, I remember that I thought about that situation. Currently, the SpinnerMonitor is tailored towards the existing SingleThreadedSpinner, which blocks until ROS is shutdown (while ros::ok() {...}). Hence, even if we have nested single-threaded spinners, only the inner-most would remain active. And if he finishes, he finishes because of ROS was shutdown, i.e. the outer spinner won't spin anything anymore. Thus, currently 3.iv is somewhat artificial and wouldn't harm as you noticed as well.
However, thinking about future single-threaded spinners that might be able to finish() without ROS shutdown() your argument is perfectly valid. I adapted the code to do "spinner counting" for both, single- and multi-threaded spinners. This also simplifies code structure.

Allow multiple single-threaded spinners (in same thread) and count their number. Thus single-threaded and multi-threaded spinners are handled similarly.

Previously, we could have the following situations: 1. `S1` (single-threaded spinner started in `thread 1`): will block `thread 1` until shutdown. Any further spinners in different threads were not allowed (with an error message). 2. `M1` (multi-threaded spinner started in `thread 1`): Further spinners started from _different_ threads were not allowed (with an error message). 3. `M1 ... M1 S1` (multi-threaded spinners started in `thread 1` and afterwards a single-threaded one started): This was accepted without any errors. But the new behavior is to reject `S1`! Restrictions of case 1 + 2 are relaxed with this PR: Other spinners are allowed as long as the operate on a different queue. Thread doesn't matter. The tricky part is case 3, which - although nonsense - was perfectly valid code before. In order to maintain the old behavior, I need to remember, which thread the first M-spinner was started in, using the new variable `initial_tid`. * allow spinning of a single-threaded spinner after some multi-threaded ones, as long as they are started from the same thread * don't throw exceptions * disabled corresponding unittests

dirk-thomas · 2016-09-17T14:52:16Z

Thank you for updating the logic. It is definitely easier to follow the flow now. I think this is ready to be merged, Thanks for iterating on this!

rhaschke · 2016-09-17T16:08:48Z

Cool. Thanks for your patience. Could you create a L-turtle branch to revert the backwards compatibility commit there? I will file a corresponding PR then. Or should I simply file an issue as a reminder to revert 91be0e5?

dirk-thomas · 2016-09-19T15:42:55Z

I don't plan to create a branch for the next ROS distro until we start building Debian packages for it on the build farm. Simply because it implies additional overhead. Please go ahead and create an issue to not forget about it. I will also search through the code and look for TODO comments or comments mentioning l-turtle after branching.

…ue_) Due to an upstream bug, it's not possible to start multiple AsyncSpinners from different threads. Filed PR: ros/ros_comm#867 The spinner is now only needed to serve our own callback_queue_ for scene updates, which is only required for syncSceneUpdates() that syncs all kind of scene updates, not only the robot state.

* PSM::waitForCurrentRobotState() + PSM::syncSceneUpdates() * renamed wall_last_state_update_ to last_robot_state_update_wall_time_ * removed PSM::syncSceneUpdates() (and PSM::spinner_, PSM::callback_queue_) Due to an upstream bug, it's not possible to start multiple AsyncSpinners from different threads. Filed PR: ros/ros_comm#867 The spinner is now only needed to serve our own callback_queue_ for scene updates, which is only required for syncSceneUpdates() that syncs all kind of scene updates, not only the robot state. * rviz: execute state update in background ... because we might wait up to 1s for a robot state update * add robot_state update test * waitForRobotToStop() * Revert "wait a second before updating "current" in RViz (#291)" This reverts commit e3ef9a6. * addressed Dave's comments

* PSM::waitForCurrentRobotState() + PSM::syncSceneUpdates() * renamed wall_last_state_update_ to last_robot_state_update_wall_time_ * removed PSM::syncSceneUpdates() (and PSM::spinner_, PSM::callback_queue_) Due to an upstream bug, it's not possible to start multiple AsyncSpinners from different threads. Filed PR: ros/ros_comm#867 The spinner is now only needed to serve our own callback_queue_ for scene updates, which is only required for syncSceneUpdates() that syncs all kind of scene updates, not only the robot state. * rviz: execute state update in background ... because we might wait up to 1s for a robot state update * add robot_state update test * waitForRobotToStop() * Revert "wait a second before updating "current" in RViz (moveit#291)" This reverts commit e3ef9a6. * addressed Dave's comments Conflicts: moveit_ros/planning/planning_scene_monitor/include/moveit/planning_scene_monitor/current_state_monitor.h moveit_ros/planning/planning_scene_monitor/src/planning_scene_monitor.cpp moveit_ros/planning_interface/move_group_interface/src/move_group.cpp moveit_ros/visualization/motion_planning_rviz_plugin/src/motion_planning_frame_planning.cpp

rhaschke force-pushed the spinner_monitoring branch from ff6c27e to bce1f4b Compare August 15, 2016 10:04

rhaschke mentioned this pull request Aug 15, 2016

Race conditions when updating PlanningScene: fixup #716 moveit/moveit_ros#728

Closed

dirk-thomas reviewed Sep 6, 2016
View reviewed changes

dirk-thomas added the enhancement label Sep 6, 2016

dirk-thomas added the requires-changes label Sep 8, 2016

rhaschke added 9 commits September 9, 2016 15:00

making the error fatal

410b982

always start spinning, regardless of error

8c7e7a8

fixup! replace global spinmutex with SpinnerMonitor

45b9aba

throw a std::run_time exception when spinner couldn't be started

e46d81d

This correctly indicates the fatality of the error but allows for graceful quitting too.

activate unittest for spinners

e7f8c4b

- moved test/test_roscpp/test/test_spinners.cpp -> test/test_roscpp/test/src/spinners.cpp - created rostest test/test_roscpp/test/launch/spinners.xml - use thrown exception to evaluate error conditions

correctly count number of started multi-threaded spinners

0dc06b2

fixes unittest Spinners.async

addressed comments

835579f

bug fix: it wasn't reassigned correctly

14700b4

rhaschke force-pushed the spinner_monitoring branch from a72d2cd to 14700b4 Compare September 9, 2016 17:50

dirk-thomas reviewed Sep 15, 2016

View reviewed changes

tfoote reviewed Sep 15, 2016

View reviewed changes

rhaschke added 3 commits September 15, 2016 22:27

removed redundant line of code

9e71906

turn sanity check into assertion

34e84df

turn error into warning

a0c78a0

rhaschke force-pushed the spinner_monitoring branch from 47ed5e9 to 37b30a8 Compare September 16, 2016 21:13

dirk-thomas removed the requires-changes label Sep 16, 2016

tfoote reviewed Sep 17, 2016

View reviewed changes

rhaschke force-pushed the spinner_monitoring branch from 37b30a8 to cf2c6b0 Compare September 17, 2016 00:29

rhaschke added 2 commits September 17, 2016 02:47

code simplification + improved comments

a05d5a2

Allow multiple single-threaded spinners (in same thread) and count their number. Thus single-threaded and multi-threaded spinners are handled similarly.

rhaschke force-pushed the spinner_monitoring branch from cf2c6b0 to 91be0e5 Compare September 17, 2016 00:47

dirk-thomas merged commit cd255f8 into ros:kinetic-devel Sep 17, 2016

rhaschke deleted the spinner_monitoring branch September 18, 2016 01:15

rhaschke mentioned this pull request Sep 20, 2016

deprecate old spinner behavior #900

Closed

dirk-thomas mentioned this pull request Mar 1, 2017

changes between 1.11.20 and 1.12.7 for backporting #1008

Merged

peci1 mentioned this pull request Mar 3, 2021

Avoid uncatchable exception in bagthread #2139

Closed

fix multi-threaded spinning #867

fix multi-threaded spinning #867

Conversation

rhaschke commented Aug 15, 2016

rhaschke commented Aug 15, 2016 • edited Loading

rhaschke commented Aug 15, 2016

rhaschke commented Aug 29, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dirk-thomas commented Sep 6, 2016

wjwwood commented Sep 6, 2016

tfoote commented Sep 7, 2016

dirk-thomas commented Sep 8, 2016

rhaschke commented Sep 9, 2016

dirk-thomas commented Sep 9, 2016

rhaschke commented Sep 9, 2016

rhaschke commented Sep 13, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tfoote left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dirk-thomas Sep 16, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tfoote Sep 15, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhaschke commented Sep 16, 2016

dirk-thomas commented Sep 16, 2016 • edited Loading

tfoote left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tfoote commented Sep 17, 2016

rhaschke commented Sep 17, 2016

dirk-thomas commented Sep 17, 2016

rhaschke commented Sep 17, 2016

dirk-thomas commented Sep 19, 2016

rhaschke commented Aug 15, 2016 •

edited

Loading

dirk-thomas Sep 16, 2016 •

edited

Loading

tfoote Sep 15, 2016 •

edited

Loading

dirk-thomas commented Sep 16, 2016 •

edited

Loading