Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix multi-threaded spinning #867

Merged
merged 14 commits into from
Sep 17, 2016
Merged

Conversation

rhaschke
Copy link
Contributor

This is an attempt to fix #277 in a more fundamental fashion.

As @po1 pointed out in #277 (comment), due to the global mutex, only a single thread was allowed to run/start spinners, even if operating on different callback queues.

As @tfoote pointed out in #277 (comment), the mutex was probably introduced to prevent interleaving access to a callback queue thus guaranteeing in-order execution of queued callbacks.

Obviously, this protection should be local per callback queue (instead of using a global mutex) and it should only be considered for SingleThreadedSpinners as multi-threaded spinners deliberately request asynchronous processing.

This PR attempts to solve that issue by replacing the global mutex with a SpinnerMonitor that keeps track of all callback queues that are currently spinning. If a single-threaded spinner wants to spin a queue in parallel to another spinner, an error can be issued. IMHO, this error should be even fatal.

Alternatively, the callback queue should monitor who is spinning it. However, this would require an API change, which is why I decided for the SpinnerMonitor.

@rhaschke
Copy link
Contributor Author

rhaschke commented Aug 15, 2016

PR #377 introducing AsyncSpinner::canStart() is incomplete: As another spinner might have been started in between the two calls to canStart() and start(), start() can again silently fail.

A proper solution would be to change API and return the success of start(). However, to avoid this API change(), I suggest either to ROS_ISSUE_BREAK in response to the fatal error or to start spinning in any case, thus making canStart() obsolete again. The next commit decided for the latter.

@rhaschke rhaschke force-pushed the spinner_monitoring branch from ff6c27e to bce1f4b Compare August 15, 2016 10:04
@rhaschke
Copy link
Contributor Author

Instead of starting spinning in any case as suggested in bce1f4b, I decided to throw an exception, when spinning cannot be safely started. This allows for proper feedback as requested in #277, but generally should abort the program and force the developer to fix his program logic.

Having this in place, allows to activate the spinner tests as proper unittests. They were not automatically run, because (i) there was no introspection into spinner state available and (ii) individual tests had to be run independent of each other.

@rhaschke
Copy link
Contributor Author

Friendly ping.

*/
bool canStart();
ROS_DEPRECATED bool canStart();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not convinced that we should add a deprecation warning to an already released distribution. Maybe the deprecation warning can be added for L-turtle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. However, if this gets merged, we should directly create an L-turtle branch and apply the deprecation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we will not start building from that branch until early 2017 I don't see I reason to create that branch now. We can create a separate ticket so that the deprecation is not forgotten but done as soon as the branch exists.

@dirk-thomas
Copy link
Member

Beside my minor comments this looks good to me. @ros/ros_team Any comments?

@wjwwood
Copy link
Member

wjwwood commented Sep 6, 2016

The changes lgtm, +1.

@tfoote
Copy link
Member

tfoote commented Sep 7, 2016

This looks like a good approach to me. +1 for deferring the deprecation warning though.

@dirk-thomas
Copy link
Member

Thanks. Once the comments have been addressed it can be merged.

Using a single recursive mutex disables to run several spinners in
parallel (started from different threads) - even if they operate on
different callback queues.

The SpinnerMonitor keeps a list of spinning callback queues, thus
making the monitoring local to callback queues.
This correctly indicates the fatality of the error but allows for graceful quitting too.
- moved test/test_roscpp/test/test_spinners.cpp -> test/test_roscpp/test/src/spinners.cpp
- created rostest test/test_roscpp/test/launch/spinners.xml
- use thrown exception to evaluate error conditions
@rhaschke
Copy link
Contributor Author

rhaschke commented Sep 9, 2016

Rebased to latest kinetic-devel. buildfarm seems to check for cmake warnings now, which made the old branch fail.

@dirk-thomas
Copy link
Member

The CMake warnings will still be there. They require a genmsg release to be fixed (ros/rosdistro#12586). If those are the only warnings they won't prevent this PR to be merged.

@rhaschke
Copy link
Contributor Author

rhaschke commented Sep 9, 2016

Good to know. Looks like all other issues are resolved.

@rhaschke
Copy link
Contributor Author

@dirk-thomas: Friendly ping.

AsyncSpinner s(thread_count_, queue);
s.start();

ros::waitForShutdown();
s.stop();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call is redundant and should be removed.

@rhaschke If you could update this that would be great. Otherwise I can apply it during the merge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Copy link
Member

@tfoote tfoote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good, but I'm concerned that this changes the behavior significantly due to possibly throwing and tearing down running systems.

{
boost::mutex::scoped_lock lock(mutex_);
std::map<ros::CallbackQueue*, Entry>::iterator it = spinning_queues_.find(queue);
if (it != spinning_queues_.end())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to have a way to catch the error condition of the else for this if. It should at least ROS_ERROR. And this method should probably have a bool return code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is private code, only used in spinner.cpp. As in the existing code, calls to add() and remove() are always paired, the condition should be always fulfilled and thus should be replaced by a ROS_ASSERT. I will do so.

if (it != spinning_queues_.end())
{
if (it->second.tid != boost::thread::id() && it->second.tid != boost::this_thread::get_id())
ROS_ERROR("SpinnerMonitor::remove() called from different thread than add().");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an error or a warning? If it's an error it should return false and not continue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will turn this into a warning.

if (!spinner_monitor.add(callback_queue_, false))
{
ROS_FATAL_STREAM("AsyncSpinnerImpl: " << DEFAULT_ERROR_MESSAGE);
throw std::runtime_error("There is already a single-threaded spinner on this queue");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a notable change in behavior to escalate to a runtime_error where the old behavior is to not register the callback queue. We could consider making this change in an upcoming release, but we should not make this change in an existing distro.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this on purpose: Indeed, the old behavior simply warned on the console, but otherwise silently continued without registering the callback queue. So, if the user / developer doesn't notice the warning in first place, he might assume everything is fine, but events are never processed on this queue!
The code out there either doesn't encounter this error/exception because AsyncSpinner was correctly used beforehand or - if it encountered the warning - it wasn't processing its events at all, which (hopefully) would have been triggered the developer to look for the error too. Hence, probably / hopefully such code was never released.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the arguments from both of you. But since Kinetic has already been released for a while any behavior change should be avoided (no matter how unlikely it is that code relies on it). Therefore the same conclusion as above:

  • ROS_FATAL_STREAM -> ROS_ERROR_STREAM
  • throw -> return.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. However, this is not as trivial as changing throw -> return.

Previous behavior was to accept multiple spinners on a queue as long as they were started from the same thread. To retain this wrong behavior, I need to remember which thread initially started spinning on a particular queue and then allow other threads as well.

Hence, the main improvement remaining from this PR in Kinetic, is the ability to have multiple spinners on different queues.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why we need initial_tid and the condition check based on it. We only need tid to distinguish single threaded and multi threaded spinners. The same queue can only be handled by one single threaded spinner or any number of multi threaded spinners - no matter what the thread id of the multi threaded spinner is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dirk-thomas What you describe is the new and intended behavior and I fully agree. However, @tfoote asked to maintain the previous behavior and not throw. Previously, we could have the following situations:

  1. S1 (single-threaded spinner started in thread 1): will block thread 1 until shutdown. Any further spinners in different threads were not allowed (with an error message).
  2. M1 (multi-threaded spinner started in thread 1): Further spinners started from different threads were not allowed (with an error message).
  3. M1 ... M1 S1 (multi-threaded spinners started in thread 1 and afterwards a single-threaded one started): This was accepted without any errors. But the new behavior is to reject S1!

Restrictions of case 1 + 2 are relaxed with this PR: Other spinners are allowed as long as the operate on a different queue. Thread doesn't matter.

The tricky part is case 3, which - although nonsense - was perfectly valid code before.
In order to maintain the old behavior, I need to remember, which thread the first M-spinner was started in (the initial_tid). If I wouldn't store the initial_tid, but allow S on any thread, this would relax unintended behavior even beyond previous behavior.

Copy link
Member

@dirk-thomas dirk-thomas Sep 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "collision" in case 3 doesn't matter in my opinion. The fact that from which thread the multi-threaded spinners have been started from shouldn't be considered. The only collision to avoid is which queues they handle and to avoid that a single-threaded spinner handles the same queue as a multi-threaded spinner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fully agree. But it wasn't like this before!
To retain old behavior as requested by Tully, I need to introduce initial_id. Actually, case 3 might be a common use case: starting some AsyncSpinners before finally entering the ros::spin loop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as both spinners don't spin on the same queue I think that is totally fine - independent from which thread the multi-threaded one was started from. That should be achieve but just removing the initial_tid again and its check, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If spinners operate on different queues, there is no conflict at all and spinners will be started without problems (this is the basic improvement we gain with this PR).

However, I thought we discuss the case, where spinners want to operate on the same queue. There might be code out there, which hits case 3. Removing initial_id and the corresponding check, but allowing to start the S* spinner from any thread after M1 is operating on the queue, will be a weaker check than before: The old code before this PR, at least rejected spinning if S* and M1 were started from different threads. Of course, we would like to reject always, but Tully requested to not do so.

if (!spinner_monitor.add(queue, true))
{
ROS_FATAL_STREAM("SingleThreadedSpinner: " << DEFAULT_ERROR_MESSAGE);
throw std::runtime_error("There is already another spinner on this queue");
Copy link
Member

@tfoote tfoote Sep 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See other comment about API stability.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original behavior was printing an error message and then returned gracefully ignoring the new request. This behavior should be maintained. Therefore the ROS_FATAL_STREAM should be replaced with ROS_ERROR_STREAM (since its not fatal anymore, but indicates an error in using the API) and throw should be replaced with a return.

@rhaschke
Copy link
Contributor Author

The last commit 47ed5e9 should be reverted on the L-turtle branch. @dirk-thomas If you create an L-turtle branch, I will file a corresponding PR and enable the deprecation warning in the header (which was removed in 835579f).

@dirk-thomas
Copy link
Member

dirk-thomas commented Sep 16, 2016

This is getting really complicated 😟 I will try to summarize before / goal / after for easier understanding. Just writing this summary based on reading the code took me a good amount of time. Hopefully it will help others to get an overview in the future:

  1. Without this patch
    1. The queue is not relevant in any decision
    2. The code doesn't consider thread ids at all
    3. Every spinner tries to get a recursive lock on a global mutex
    4. Therefore multiple spinners started from different threads are ignored
    5. Since a multi-threaded spinner does nothing in its blocking thread it was called with it can't be used to start more spinners
    6. Since single-threaded spinners execute the events in the same thread that could be used to start other spinners
    7. The async spinners are not blocking so the same thread can be used to start other spinners
  2. Desired change of behavior
    1. The queue should be considered in the decision if multiple spinners can operate concurrently
    2. If spinners operate on different queues there is no need to restrict them
    3. Multiple multi-threaded spinners can handle the same queue
  3. With this patch
    1. If the queue doesn't overlap allow new spinners (bool can_spin = (it == spinning_queues_.end() || ...)
    2. A multi-threaded spinners can be started for the same queue if the already running spinner is also a multi-threaded spinner (bool can_spin = (... || it->second.tid == tid);)
    3. If an additional spinner is added for the same queue and they are of diffeerent type (it->second.tid == tid and it->second.initial_tid == tid) they are allowed for backward compatibilty and print a message warning the user that events might not be handled in order
    4. If a single-threaded spinner is being started from the same thread as another single-threaded spinner (operating on the same queue) it is being allowed (it->second.tid == tid)

For the case 3.iv I see a problem with the bookkeeping. When the second spinner is added it overrides the existing entry in spinning_queues_ and removes the entry when it finished. I am not sure if this is a relevant case but after the second spinner finished the data structure is inconsistent and doesn't know that the first spinner is still around. Arguably that can only happen when ok() returns false which might not be a real world problem.

Please let me know if I got something wrong (or also if you agree with the summary).

Copy link
Member

@tfoote tfoote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks like a good summary @dirk-thomas I've been through it too and believe that it's as you've summarized. I had two usability comments on error messages, but otherwise it looks good to me.

// single-threaded spinner after several multi-threaded ones, given that they
// were started from the same initial thread
if (it->second.initial_tid == tid)
ROS_ERROR_STREAM("SpinnerMonitor: " << DEFAULT_ERROR_MESSAGE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error message should be different to reflect the backwards compatibility incase anyone reads it in depth.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Entry(const boost::thread::id &tid,
const boost::thread::id &initial_tid) : tid(tid), initial_tid(initial_tid), num_multi_threaded(0) {}

boost::thread::id tid; // thread id of single-threaded spinner
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to have a comment here that it will be the default value which represents 'Not-a-Thread` if multithreaded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned it would be NULL in the reworked commit.

spinning_queues_.erase(it);
else
{
ROS_ASSERT(it->second.num_multi_threaded > 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to have a message for this assert like: "Call to SpinnerMonitor::remove() for a multi-threaded spinner cannot be achieved since reference count is not greater than 0."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted for a shorted one: "SpinnerMonitor::remove(): Invalid spinner count (0) encountered."

@tfoote
Copy link
Member

tfoote commented Sep 17, 2016

With respect to the double entry for the single threaded spinners. It will likely lead to some possible errors on teardown. It will likely hit this assert and since we know this we could soften that so that we can catch it appropriately.

@rhaschke
Copy link
Contributor Author

I fully agree with 1. and 2, with a minor remark on 1.ii: It's correct that previous code didn't explicitly deal with thread ids. However, this was implicitly done in the recursive mutex remembering the thread that holds the lock.

In principle I also agree to 3. However, I have some remarks:

  • 3.ii: it->second.tid and tid can be either 0 (indicating a multi-threaded spinner) or a proper thread-id (indicating a single-threaded spinner). Hence, the condition bool can_spin = (... || it->second.tid == tid); evaluates true when existing spinners and the new one are multi-threaded (both tid's == 0) or if the new (single-threaded) spinner originates from the same thread as the existing one.
  • 3.iii: The given condition is wrong (description is correct).
    !can_spin = (it != spinning_queues_.end() && it->second.tid != tid (same queue, but different type). Thus, the backwards compatibility condition is (it->second.tid != tid and it->second.initial_tid == tid).
  • 3.iv: To be honest, I wasn't (actively) aware of this situation anymore. However, I remember that I thought about that situation. Currently, the SpinnerMonitor is tailored towards the existing SingleThreadedSpinner, which blocks until ROS is shutdown (while ros::ok() {...}). Hence, even if we have nested single-threaded spinners, only the inner-most would remain active. And if he finishes, he finishes because of ROS was shutdown, i.e. the outer spinner won't spin anything anymore. Thus, currently 3.iv is somewhat artificial and wouldn't harm as you noticed as well.
    However, thinking about future single-threaded spinners that might be able to finish() without ROS shutdown() your argument is perfectly valid. I adapted the code to do "spinner counting" for both, single- and multi-threaded spinners. This also simplifies code structure.

Allow multiple single-threaded spinners (in same thread)
and count their number.
Thus single-threaded and multi-threaded spinners are handled similarly.
Previously, we could have the following situations:

1. `S1` (single-threaded spinner started in `thread 1`): will block `thread 1` until shutdown. Any further spinners in different threads were not allowed (with an error message).
2. `M1` (multi-threaded spinner started in `thread 1`): Further spinners started from _different_ threads were not allowed (with an error message).
3. `M1 ... M1 S1` (multi-threaded spinners started in `thread 1` and afterwards a single-threaded one started): This was accepted without any errors. But the new behavior is to reject `S1`!

Restrictions of case 1 + 2 are relaxed with this PR: Other spinners are allowed as long as the operate on a different queue. Thread doesn't matter.

The tricky part is case 3, which - although nonsense - was perfectly valid code before.
In order to maintain the old behavior, I need to remember, which thread the first M-spinner was started in, using the new variable `initial_tid`.

* allow spinning of a single-threaded spinner after some multi-threaded ones, as long as they are started from the same thread
* don't throw exceptions
* disabled corresponding unittests
@dirk-thomas
Copy link
Member

Thank you for updating the logic. It is definitely easier to follow the flow now. I think this is ready to be merged, Thanks for iterating on this!

@dirk-thomas dirk-thomas merged commit cd255f8 into ros:kinetic-devel Sep 17, 2016
@rhaschke
Copy link
Contributor Author

Cool. Thanks for your patience. Could you create a L-turtle branch to revert the backwards compatibility commit there? I will file a corresponding PR then. Or should I simply file an issue as a reminder to revert 91be0e5?

@rhaschke rhaschke deleted the spinner_monitoring branch September 18, 2016 01:15
@dirk-thomas
Copy link
Member

I don't plan to create a branch for the next ROS distro until we start building Debian packages for it on the build farm. Simply because it implies additional overhead. Please go ahead and create an issue to not forget about it. I will also search through the code and look for TODO comments or comments mentioning l-turtle after branching.

Jntzko pushed a commit to Jntzko/moveit that referenced this pull request Oct 7, 2016
…ue_)

Due to an upstream bug, it's not possible to start multiple AsyncSpinners from different threads.
Filed PR: ros/ros_comm#867

The spinner is now only needed to serve our own callback_queue_ for
scene updates, which is only required for syncSceneUpdates() that
syncs all kind of scene updates, not only the robot state.
davetcoleman pushed a commit to moveit/moveit that referenced this pull request Jan 5, 2017
* PSM::waitForCurrentRobotState() + PSM::syncSceneUpdates()

* renamed wall_last_state_update_ to last_robot_state_update_wall_time_

* removed PSM::syncSceneUpdates() (and PSM::spinner_, PSM::callback_queue_)

Due to an upstream bug, it's not possible to start multiple AsyncSpinners from different threads.
Filed PR: ros/ros_comm#867

The spinner is now only needed to serve our own callback_queue_ for
scene updates, which is only required for syncSceneUpdates() that
syncs all kind of scene updates, not only the robot state.

* rviz: execute state update in background

... because we might wait up to 1s for a robot state update

* add robot_state update test

* waitForRobotToStop()

* Revert "wait a second before updating "current" in RViz (#291)"

This reverts commit e3ef9a6.

* addressed Dave's comments
k-okada pushed a commit to k-okada/moveit that referenced this pull request Dec 12, 2017
* PSM::waitForCurrentRobotState() + PSM::syncSceneUpdates()

* renamed wall_last_state_update_ to last_robot_state_update_wall_time_

* removed PSM::syncSceneUpdates() (and PSM::spinner_, PSM::callback_queue_)

Due to an upstream bug, it's not possible to start multiple AsyncSpinners from different threads.
Filed PR: ros/ros_comm#867

The spinner is now only needed to serve our own callback_queue_ for
scene updates, which is only required for syncSceneUpdates() that
syncs all kind of scene updates, not only the robot state.

* rviz: execute state update in background

... because we might wait up to 1s for a robot state update

* add robot_state update test

* waitForRobotToStop()

* Revert "wait a second before updating "current" in RViz (moveit#291)"

This reverts commit e3ef9a6.

* addressed Dave's comments

Conflicts:
	moveit_ros/planning/planning_scene_monitor/include/moveit/planning_scene_monitor/current_state_monitor.h
	moveit_ros/planning/planning_scene_monitor/src/planning_scene_monitor.cpp
	moveit_ros/planning_interface/move_group_interface/src/move_group.cpp
	moveit_ros/visualization/motion_planning_rviz_plugin/src/motion_planning_frame_planning.cpp
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants