-
Notifications
You must be signed in to change notification settings - Fork 914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix multi-threaded spinning #867
Conversation
PR #377 introducing A proper solution would be to change API and return the success of |
ff6c27e
to
bce1f4b
Compare
Instead of starting spinning in any case as suggested in bce1f4b, I decided to throw an exception, when spinning cannot be safely started. This allows for proper feedback as requested in #277, but generally should abort the program and force the developer to fix his program logic. Having this in place, allows to activate the spinner tests as proper unittests. They were not automatically run, because (i) there was no introspection into spinner state available and (ii) individual tests had to be run independent of each other. |
Friendly ping. |
*/ | ||
bool canStart(); | ||
ROS_DEPRECATED bool canStart(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not convinced that we should add a deprecation warning to an already released distribution. Maybe the deprecation warning can be added for L-turtle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. However, if this gets merged, we should directly create an L-turtle branch and apply the deprecation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we will not start building from that branch until early 2017 I don't see I reason to create that branch now. We can create a separate ticket so that the deprecation is not forgotten but done as soon as the branch exists.
Beside my minor comments this looks good to me. @ros/ros_team Any comments? |
The changes lgtm, +1. |
This looks like a good approach to me. +1 for deferring the deprecation warning though. |
Thanks. Once the comments have been addressed it can be merged. |
Using a single recursive mutex disables to run several spinners in parallel (started from different threads) - even if they operate on different callback queues. The SpinnerMonitor keeps a list of spinning callback queues, thus making the monitoring local to callback queues.
This correctly indicates the fatality of the error but allows for graceful quitting too.
- moved test/test_roscpp/test/test_spinners.cpp -> test/test_roscpp/test/src/spinners.cpp - created rostest test/test_roscpp/test/launch/spinners.xml - use thrown exception to evaluate error conditions
fixes unittest Spinners.async
a72d2cd
to
14700b4
Compare
Rebased to latest kinetic-devel. buildfarm seems to check for cmake warnings now, which made the old branch fail. |
The CMake warnings will still be there. They require a genmsg release to be fixed (ros/rosdistro#12586). If those are the only warnings they won't prevent this PR to be merged. |
Good to know. Looks like all other issues are resolved. |
@dirk-thomas: Friendly ping. |
AsyncSpinner s(thread_count_, queue); | ||
s.start(); | ||
|
||
ros::waitForShutdown(); | ||
s.stop(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This call is redundant and should be removed.
@rhaschke If you could update this that would be great. Otherwise I can apply it during the merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks good, but I'm concerned that this changes the behavior significantly due to possibly throwing and tearing down running systems.
{ | ||
boost::mutex::scoped_lock lock(mutex_); | ||
std::map<ros::CallbackQueue*, Entry>::iterator it = spinning_queues_.find(queue); | ||
if (it != spinning_queues_.end()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to have a way to catch the error condition of the else for this if. It should at least ROS_ERROR. And this method should probably have a bool return code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is private code, only used in spinner.cpp. As in the existing code, calls to add() and remove() are always paired, the condition should be always fulfilled and thus should be replaced by a ROS_ASSERT. I will do so.
if (it != spinning_queues_.end()) | ||
{ | ||
if (it->second.tid != boost::thread::id() && it->second.tid != boost::this_thread::get_id()) | ||
ROS_ERROR("SpinnerMonitor::remove() called from different thread than add()."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an error or a warning? If it's an error it should return false and not continue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I will turn this into a warning.
if (!spinner_monitor.add(callback_queue_, false)) | ||
{ | ||
ROS_FATAL_STREAM("AsyncSpinnerImpl: " << DEFAULT_ERROR_MESSAGE); | ||
throw std::runtime_error("There is already a single-threaded spinner on this queue"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a notable change in behavior to escalate to a runtime_error where the old behavior is to not register the callback queue. We could consider making this change in an upcoming release, but we should not make this change in an existing distro.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this on purpose: Indeed, the old behavior simply warned on the console, but otherwise silently continued without registering the callback queue. So, if the user / developer doesn't notice the warning in first place, he might assume everything is fine, but events are never processed on this queue!
The code out there either doesn't encounter this error/exception because AsyncSpinner was correctly used beforehand or - if it encountered the warning - it wasn't processing its events at all, which (hopefully) would have been triggered the developer to look for the error too. Hence, probably / hopefully such code was never released.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the arguments from both of you. But since Kinetic has already been released for a while any behavior change should be avoided (no matter how unlikely it is that code relies on it). Therefore the same conclusion as above:
ROS_FATAL_STREAM
->ROS_ERROR_STREAM
throw
->return
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. However, this is not as trivial as changing throw
-> return
.
Previous behavior was to accept multiple spinners on a queue as long as they were started from the same thread. To retain this wrong behavior, I need to remember which thread initially started spinning on a particular queue and then allow other threads as well.
Hence, the main improvement remaining from this PR in Kinetic, is the ability to have multiple spinners on different queues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why we need initial_tid
and the condition check based on it. We only need tid
to distinguish single threaded and multi threaded spinners. The same queue can only be handled by one single threaded spinner or any number of multi threaded spinners - no matter what the thread id of the multi threaded spinner is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dirk-thomas What you describe is the new and intended behavior and I fully agree. However, @tfoote asked to maintain the previous behavior and not throw. Previously, we could have the following situations:
S1
(single-threaded spinner started inthread 1
): will blockthread 1
until shutdown. Any further spinners in different threads were not allowed (with an error message).M1
(multi-threaded spinner started inthread 1
): Further spinners started from different threads were not allowed (with an error message).M1 ... M1 S1
(multi-threaded spinners started inthread 1
and afterwards a single-threaded one started): This was accepted without any errors. But the new behavior is to rejectS1
!
Restrictions of case 1 + 2 are relaxed with this PR: Other spinners are allowed as long as the operate on a different queue. Thread doesn't matter.
The tricky part is case 3, which - although nonsense - was perfectly valid code before.
In order to maintain the old behavior, I need to remember, which thread the first M-spinner was started in (the initial_tid
). If I wouldn't store the initial_tid
, but allow S on any thread, this would relax unintended behavior even beyond previous behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "collision" in case 3 doesn't matter in my opinion. The fact that from which thread the multi-threaded spinners have been started from shouldn't be considered. The only collision to avoid is which queues they handle and to avoid that a single-threaded spinner handles the same queue as a multi-threaded spinner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fully agree. But it wasn't like this before!
To retain old behavior as requested by Tully, I need to introduce initial_id
. Actually, case 3 might be a common use case: starting some AsyncSpinners before finally entering the ros::spin
loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as both spinners don't spin on the same queue I think that is totally fine - independent from which thread the multi-threaded one was started from. That should be achieve but just removing the initial_tid
again and its check, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If spinners operate on different queues, there is no conflict at all and spinners will be started without problems (this is the basic improvement we gain with this PR).
However, I thought we discuss the case, where spinners want to operate on the same queue. There might be code out there, which hits case 3. Removing initial_id
and the corresponding check, but allowing to start the S* spinner from any thread after M1 is operating on the queue, will be a weaker check than before: The old code before this PR, at least rejected spinning if S* and M1 were started from different threads. Of course, we would like to reject always, but Tully requested to not do so.
if (!spinner_monitor.add(queue, true)) | ||
{ | ||
ROS_FATAL_STREAM("SingleThreadedSpinner: " << DEFAULT_ERROR_MESSAGE); | ||
throw std::runtime_error("There is already another spinner on this queue"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See other comment about API stability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original behavior was printing an error message and then returned gracefully ignoring the new request. This behavior should be maintained. Therefore the ROS_FATAL_STREAM
should be replaced with ROS_ERROR_STREAM
(since its not fatal anymore, but indicates an error in using the API) and throw
should be replaced with a return
.
The last commit 47ed5e9 should be reverted on the L-turtle branch. @dirk-thomas If you create an L-turtle branch, I will file a corresponding PR and enable the deprecation warning in the header (which was removed in 835579f). |
47ed5e9
to
37b30a8
Compare
This is getting really complicated 😟 I will try to summarize before / goal / after for easier understanding. Just writing this summary based on reading the code took me a good amount of time. Hopefully it will help others to get an overview in the future:
For the case 3.iv I see a problem with the bookkeeping. When the second spinner is added it overrides the existing entry in Please let me know if I got something wrong (or also if you agree with the summary). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks like a good summary @dirk-thomas I've been through it too and believe that it's as you've summarized. I had two usability comments on error messages, but otherwise it looks good to me.
// single-threaded spinner after several multi-threaded ones, given that they | ||
// were started from the same initial thread | ||
if (it->second.initial_tid == tid) | ||
ROS_ERROR_STREAM("SpinnerMonitor: " << DEFAULT_ERROR_MESSAGE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error message should be different to reflect the backwards compatibility incase anyone reads it in depth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Entry(const boost::thread::id &tid, | ||
const boost::thread::id &initial_tid) : tid(tid), initial_tid(initial_tid), num_multi_threaded(0) {} | ||
|
||
boost::thread::id tid; // thread id of single-threaded spinner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to have a comment here that it will be the default value which represents 'Not-a-Thread` if multithreaded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned it would be NULL in the reworked commit.
spinning_queues_.erase(it); | ||
else | ||
{ | ||
ROS_ASSERT(it->second.num_multi_threaded > 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to have a message for this assert like: "Call to SpinnerMonitor::remove() for a multi-threaded spinner cannot be achieved since reference count is not greater than 0."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opted for a shorted one: "SpinnerMonitor::remove(): Invalid spinner count (0) encountered."
With respect to the double entry for the single threaded spinners. It will likely lead to some possible errors on teardown. It will likely hit this assert and since we know this we could soften that so that we can catch it appropriately. |
37b30a8
to
cf2c6b0
Compare
I fully agree with 1. and 2, with a minor remark on 1.ii: It's correct that previous code didn't explicitly deal with thread ids. However, this was implicitly done in the recursive mutex remembering the thread that holds the lock. In principle I also agree to 3. However, I have some remarks:
|
Allow multiple single-threaded spinners (in same thread) and count their number. Thus single-threaded and multi-threaded spinners are handled similarly.
Previously, we could have the following situations: 1. `S1` (single-threaded spinner started in `thread 1`): will block `thread 1` until shutdown. Any further spinners in different threads were not allowed (with an error message). 2. `M1` (multi-threaded spinner started in `thread 1`): Further spinners started from _different_ threads were not allowed (with an error message). 3. `M1 ... M1 S1` (multi-threaded spinners started in `thread 1` and afterwards a single-threaded one started): This was accepted without any errors. But the new behavior is to reject `S1`! Restrictions of case 1 + 2 are relaxed with this PR: Other spinners are allowed as long as the operate on a different queue. Thread doesn't matter. The tricky part is case 3, which - although nonsense - was perfectly valid code before. In order to maintain the old behavior, I need to remember, which thread the first M-spinner was started in, using the new variable `initial_tid`. * allow spinning of a single-threaded spinner after some multi-threaded ones, as long as they are started from the same thread * don't throw exceptions * disabled corresponding unittests
cf2c6b0
to
91be0e5
Compare
Thank you for updating the logic. It is definitely easier to follow the flow now. I think this is ready to be merged, Thanks for iterating on this! |
Cool. Thanks for your patience. Could you create a L-turtle branch to revert the backwards compatibility commit there? I will file a corresponding PR then. Or should I simply file an issue as a reminder to revert 91be0e5? |
I don't plan to create a branch for the next ROS distro until we start building Debian packages for it on the build farm. Simply because it implies additional overhead. Please go ahead and create an issue to not forget about it. I will also search through the code and look for TODO comments or comments mentioning l-turtle after branching. |
…ue_) Due to an upstream bug, it's not possible to start multiple AsyncSpinners from different threads. Filed PR: ros/ros_comm#867 The spinner is now only needed to serve our own callback_queue_ for scene updates, which is only required for syncSceneUpdates() that syncs all kind of scene updates, not only the robot state.
* PSM::waitForCurrentRobotState() + PSM::syncSceneUpdates() * renamed wall_last_state_update_ to last_robot_state_update_wall_time_ * removed PSM::syncSceneUpdates() (and PSM::spinner_, PSM::callback_queue_) Due to an upstream bug, it's not possible to start multiple AsyncSpinners from different threads. Filed PR: ros/ros_comm#867 The spinner is now only needed to serve our own callback_queue_ for scene updates, which is only required for syncSceneUpdates() that syncs all kind of scene updates, not only the robot state. * rviz: execute state update in background ... because we might wait up to 1s for a robot state update * add robot_state update test * waitForRobotToStop() * Revert "wait a second before updating "current" in RViz (#291)" This reverts commit e3ef9a6. * addressed Dave's comments
* PSM::waitForCurrentRobotState() + PSM::syncSceneUpdates() * renamed wall_last_state_update_ to last_robot_state_update_wall_time_ * removed PSM::syncSceneUpdates() (and PSM::spinner_, PSM::callback_queue_) Due to an upstream bug, it's not possible to start multiple AsyncSpinners from different threads. Filed PR: ros/ros_comm#867 The spinner is now only needed to serve our own callback_queue_ for scene updates, which is only required for syncSceneUpdates() that syncs all kind of scene updates, not only the robot state. * rviz: execute state update in background ... because we might wait up to 1s for a robot state update * add robot_state update test * waitForRobotToStop() * Revert "wait a second before updating "current" in RViz (moveit#291)" This reverts commit e3ef9a6. * addressed Dave's comments Conflicts: moveit_ros/planning/planning_scene_monitor/include/moveit/planning_scene_monitor/current_state_monitor.h moveit_ros/planning/planning_scene_monitor/src/planning_scene_monitor.cpp moveit_ros/planning_interface/move_group_interface/src/move_group.cpp moveit_ros/visualization/motion_planning_rviz_plugin/src/motion_planning_frame_planning.cpp
This is an attempt to fix #277 in a more fundamental fashion.
As @po1 pointed out in #277 (comment), due to the global mutex, only a single thread was allowed to run/start spinners, even if operating on different callback queues.
As @tfoote pointed out in #277 (comment), the mutex was probably introduced to prevent interleaving access to a callback queue thus guaranteeing in-order execution of queued callbacks.
Obviously, this protection should be local per callback queue (instead of using a global mutex) and it should only be considered for
SingleThreadedSpinners
as multi-threaded spinners deliberately request asynchronous processing.This PR attempts to solve that issue by replacing the global mutex with a
SpinnerMonitor
that keeps track of all callback queues that are currently spinning. If a single-threaded spinner wants to spin a queue in parallel to another spinner, an error can be issued. IMHO, this error should be even fatal.Alternatively, the callback queue should monitor who is spinning it. However, this would require an API change, which is why I decided for the
SpinnerMonitor
.