Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow mixing latched and unlatched publishers. #1991

Merged
merged 4 commits into from
Jul 20, 2020
Merged

Conversation

mikepurvis
Copy link
Member

@mikepurvis mikepurvis commented Jul 7, 2020

This is a naive change (so far) that just removes the check against mixing latched and unlatched publishers and updates the test added in #1544 so that it covers this scenario (it still passes). As far as I can tell, this should be totally safe and in line with the design intent of the original change:

  1. Each Publisher instance now stores its own latched/unlatched status, and its own last_message_ pointer, rather than this being at the TopicManager/singleton level.
  2. When a new subscriber connects, the Publication::peerConnect method iterates its list of callbacks, which will call the push_latched_message_ std::function for any that have it defined.

Would love to hear from @meyerj if there's something lurking here that I've missed.

Fixes #1966.

@meyerj
Copy link
Contributor

meyerj commented Jul 7, 2020

Would love to hear from @meyerj if there's something lurking here that I've missed.

Just commented in #1966 (comment):

Allowing mixed latched and volatile (non-latching) publishers in a single node is error-prone. The singleton publication is either reporting itself as latching or not in the TCPROS connection header. Normal subscribers do not consider this flag and mixed publishers indeed behave as intended.

But rosbag record does not see individual publishers. The bag only contains the information that topic /tf_static advertised by /node_a was latched or not. If the same topic would have been advertised by /node_b as volatile, that is still okay and can be recorded correctly. But during rosbag play there is again only one singleton publication for both, which is latched or not depending on which connection gets advertised first (Re-recording would not work anymore and the connections would be merged into one. That is probably a rare case and acceptable.).

Overall, I agree that removing the check is a viable solution for rosbag play to at least work for the case of two different nodes (i.e. connections) advertising heterogeneously on the same topic. On the other hand, if there would indeed be multiple latching publishers or even a mix of latching and non-latching publishers within the same node, rosbag record cannot record this situation anyway and only pushes the last message from any of the original publishers to new subscribers. This is still an unresolved problem, especially when recording and replaying /tf_static.

@mikepurvis
Copy link
Member Author

mikepurvis commented Jul 8, 2020

@meyerj Thanks for the in-depth explanation. Given these situations:

  1. Topic with one latched publisher each in multiple processes.
  2. Topic with multiple latched publishers in the same process (or multiple processes).

My understanding is that prior to #1544, case 1 would would work initially but not play back from bag properly, and case 2 would not even work initially (you'd get whichever was the most recent message, rather than one per latched publisher).

With #1544, case 1 is now able to be played back from bag—even historical bags, curiously, since the limitation was always just in rosbag's ability to create multiple latched publishers at playback time. Case 2 now works initially, but still can't be accurately bagged since the TCPROS protocol only sees processes, not individual publishers within them.

And since rosbag playback of case 1 is itself case 2, rebagging that will also mess it up (I believe rosbag filter dumps the connection info too, though possibly for different reasons?).

Anyway, I think all that makes sense, but this is the bit that I'm hazy on:

Allowing mixed latched and volatile (non-latching) publishers in a single node is error-prone. The singleton publication is either reporting itself as latching or not in the TCPROS connection header. Normal subscribers do not consider this flag and mixed publishers indeed behave as intended.

As you say, the only/main case where the flag matters is for special subscribers like rosbag record and topic_tools/relay. Since they didn't do the right thing previously, and still can't do it now unless there are protocol changes, I think the only realistic thing to do is just report latched at the process level if any of the underlying publishers are latched— I think that's effectively what the old behaviour was anyway, that as soon as one latched Publisher was created, the underlying singleton marked the topic as such and that was that.

This obviously leaves a race where a subscriber may see the topic as unlatched because the latched publisher in that same process hasn't been instantiated yet. But if I'm correct about the old behaviour then that at least wouldn't be a regression.

@meyerj
Copy link
Contributor

meyerj commented Jul 8, 2020

@mikepurvis Yes, that's a concise summary of the situation, with the following correction:

I think that's effectively what the old behaviour was anyway, that as soon as one latched Publisher was created, the underlying singleton marked the topic as such and that was that.

The first publisher created in a roscpp process on a given topic decides whether the publication is advertised as latching or not in the connection header:

From topic_manager.cpp:

 bool TopicManager::advertise(const AdvertiseOptions& ops, const SubscriberCallbacksPtr& callbacks)
 {
   // [...]
   pub = lookupPublicationWithoutLock(ops.topic);
   // [...]

   if (pub)
   {
     // [...]

-    if (pub->isLatched() != ops.latch)
-    {
-      ROS_ERROR("Tried to advertise on topic [%s] with latch=%d but the topic is already advertised with latch=%d",
-                ops.topic.c_str(), ops.latch, pub->isLatched());
-      return false;
-    }

     pub->addCallbacks(callbacks);

     return true;
   }

   pub = PublicationPtr(boost::make_shared<Publication>(ops.topic, ops.datatype, ops.md5sum, ops.message_definition, ops.queue_size, ops.latch, ops.has_header));
   pub->addCallbacks(callbacks);
   advertised_topics_.push_back(pub);
}

So if there is already a publisher on the given topic and the check is removed as suggested in this PR, then ops.latch is ignored for the second and subsequent publishers. The latching Publisher instances themselves will still push their latest sample to new subscribers in their push_latched_message_ callback, with or without reporting that in the connection header. Or in other words the header flag and as a result the behavior of "special subscribers" is undefined if the check was removed.

As you also suggested, maybe it would be best to remove or deprecate the Publication::latch_ member and return true from Publication::isLatched() and redundant Publication::isLatching() if and only if there is at least one latching Publisher (from V_Callback callbacks)? On the other hand existing subscribers won't be notified when a new latching publisher is added to an existing publication, or when the last latching publisher has been removed. And neither of rosbag record and topic_tools/relay works without hidden pitfalls if they are multiple latched publishers within the same node.

The standard example for a broken situation is a node/process that has multiple StaticTransformPublisher instances and hence multiple /tf_static publishers. For nodes subscribing online, that use case is fixed now with #1544. But rosbag record /tf_static and later rosbag play will only relatch the transforms message that was recorded last.

Anyway, I think all that makes sense, but this is the bit that I'm hazy on:

Allowing mixed latched and volatile (non-latching) publishers in a single node is error-prone. The singleton publication is either reporting itself as latching or not in the TCPROS connection header. Normal subscribers do not consider this flag and mixed publishers indeed behave as intended.

I agree. The issue cannot be fixed in a nice and sane way anyway without changing the TCPROS protocol. So if removing the check at least resolves the use case of rosbag play with mixed latching and non-latching publishers, then it is probably fine to do so. So 👍.

On the other hand, I still cannot think of a good use case where mixing latching and non-latching publishers on the same topic, even in different nodes, is actually required. If that would be strictly forbidden - and maybe even enforced by the ROS master - then there would be no problem with an inconsistently reported latching state in the connection header, and rosbag play would also not trigger the regression reported in #1966, because you cannot record a bag file with conflicting connection infos anymore. However, even with such a strict policy rosbag play does not correctly latch multiple messages originating from multiple publishers within the same node. Could you elaborate on your use case and how you ended up with that bag file?

@meyerj
Copy link
Contributor

meyerj commented Jul 8, 2020

It should also be noted that all above is only valid for roscpp. Other client libraries will behave differently.

For rospy @mgrrx proposed a patch in #146 (comment) (magazino@7c6f5b0), which originally motivated #1544. I still think that this commit should be applied to rospy to achieve at least a consistent behavior of rospy and roscpp nodes having multiple latched publishers (e.g. multiple tf2_ros.StaticTransformBroadcaster).

@mikepurvis
Copy link
Member Author

mikepurvis commented Jul 11, 2020

@meyerj Have implemented now the behaviour where the singleton is marked as latched if any of the instances are. Ideally this would have been just done in the isLatched accessor, but since that accessor was implemented in the header, it would have been inlined and thus broken ABI. So instead I leave the accessor as-is and update the existing member.

Let me know if this approach is acceptable to you.


Regarding the question of valid use-cases, I'm not sure that there are any either, and I've cleaned up the ones I'm aware of in our codebase, but we still have historical bags where a rogue node was publishing rosout as latched or whatever. We also have a situation where bags that had been processed for use with webviz included collapsing together the tf (unlatched) and tf_static (latched) topics, which is now a source of error.

At the end of the day though, I just don't like the idea of an arbitrary runtime incompatibility on the level of MD5 breakage that's buried in code and has to be navigated like this, especially when it's something that bites later on in development (when playing a bag, when combining nodelets, etc).

Copy link
Contributor

@meyerj meyerj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@meyerj Have implemented now the behaviour where the singleton is marked as latched if any of the instances are. Ideally this would have been just done in the isLatched accessor, but since that accessor was implemented in the header, it would have been inlined and thus broken ABI. So instead I leave the accessor as-is and update the existing member.

Let me know if this approach is acceptable to you.

Yes, thanks for the update. It is a reasonable choice to keep the existing latch_ flag up-to-date in addCallbacks() and removeCallbacks() instead of removing it. This likely happens less frequently than new incoming connections from subscribers, where isLatching() is checked for the connection header in transport_subscriber_link.cpp:114.

Strictly speaking latch_ must be protected by the callbacks_mutex_ now, because the flag is not immutable anymore after the construction of a Publication instance. Or it needs to be of type std::atomic<bool>, but that would break the ABI, too. I guess for now it is acceptable that isLatched() and isLatching() eventually report an outdated state due to cache coherency issues, because they do not reflect the full truth anyway if there are mixed latching and non-latching subscribers. But it would be possible to move the implementation to the source and to add a scoped lock of the callbacks_mutex_ without breaking the ABI. There is just no guarantee that code compiled against a previous version of the header did not inline the method and uses the implementation without the lock.

@mikepurvis
Copy link
Member Author

mikepurvis commented Jul 13, 2020

@meyerj Thanks for pointing that out; I've made the suggested change.

@dirk-thomas I think this is good to go in, when you're able to look.

@dirk-thomas
Copy link
Member

Thanks for the patch!

@dirk-thomas dirk-thomas merged commit 80cbe5b into noetic-devel Jul 20, 2020
@dirk-thomas dirk-thomas deleted the fix-1966 branch July 20, 2020 22:39
nim65s added a commit to nim65s/robotpkg that referenced this pull request Mar 12, 2021
Because DEPEND_ABI.ros-comm.noetic?= ros-comm>=1.15

1.15.9 (2020-10-16)
-------------------
* Fix deadlock when service connection is dropped (ros/ros_comm#2074)
* Update maintainers (ros/ros_comm#2075)
* Fix case where accessing cached parameters shuts down another node (ros/ros_comm#2068)
* Fix spelling (ros/ros_comm#2066)
* Fix Lost Wake Bug in ROSOutAppender (ros/ros_comm#2033)
* Fix compatibility issue with boost 1.73 and above (ros/ros_comm#2023)
* Gracefully stop recording upon SIGTERM and SIGINT (ros/ros_comm#2038)
* Use heapq.merge instead of custom merge sort code (ros/ros_comm#2017)
* Fix handling of single quotes in command arguments on Windows (ros/ros_comm#2051)
* Clearer error message (ros/ros_comm#2035)
* Ignore underscores when parsing literal numeric values for Python 3 compatibility (ros/ros_comm#2022)
* Clear cached URI for a node that has gone offline (ros/ros_comm#2010)
* Add skip_cache parameter to rosnode_ping() (ros/ros_comm#2009)
* Install advertisetest (ros/ros_comm#2046)
* Use range instead of xrange for Python 3 compatibility (ros/ros_comm#2013)
* Fix to address CVE-2020-16124 (ros/ros_comm#2065)
* Fix XmlRpcValue::_doubleFormat being unused (ros/ros_comm#2003)

1.15.8 (2020-07-23)
-------------------
* change is_async_connected to use epoll when available (ros/ros_comm#1983)
* allow mixing latched and unlatched publishers (ros/ros_comm#1991)
* remove not existing NodeProxy from rospy __all_\_ (ros/ros_comm#2007)
* fix typo in topics.py (ros/ros_comm#1977)
* fix bad relative import (still Python 2 style) (ros/ros_comm#1973)
* improve shutdown message with duplicate node name (ros/ros_comm#1992)
* remove dependency on rostopic from rostest package (ros/ros_comm#2002)
* fix missing reload() function in Python 3 (ros/ros_comm#1968)
* add latch param to throttle (ros/ros_comm#1944)
* add const versions of XmlRpcValue converting operators (ros/ros_comm#1978)

1.15.7 (2020-05-28)
-------------------
* fix Windows build break (ros/ros_comm#1961)
* fix NameError in launch error handling (ros/ros_comm#1965)

1.15.6 (2020-05-21)
-------------------
* fix a bug that using a destroyed connection object (ros/ros_comm#1950)

1.15.5 (2020-05-15)
-------------------
* check if async socket connect is success or failure before TransportTCP::read() and TransportTCP::write() (ros/ros_comm#1954)
* fix bug that connection drop signal related funtion throw a bad_weak exception (ros/ros_comm#1940)
* multiple latched publishers per process on the same topic (ros/ros_comm#1544)
* fix negative numbers in ros statistics (ros/ros_comm#1531)
* remove extra \n in ROS_DEBUG (ros/ros_comm#1925)
* add option to repeat latched messages at the start of bag splits (ros/ros_comm#1850)
* fix bag migration failures caused by typo in connection_header assignment (ros/ros_comm#1952)
* fix brief description comments after members (ros/ros_comm#1920)
* add --sigint-timeout and --sigterm-timeout parameters (ros/ros_comm#1937)
* roslaunch-check: search dir recursively (ros/ros_comm#1914)
* sort printed nodes by namespace alphabetically (ros/ros_comm#1934)
* remove pycrypto import (not used) (ros/ros_comm#1922)
* avoid infinite recursion in rosrun tab completion when rosbash is not installed (ros/ros_comm#1948)
* fix bare pointer in topic_tools::ShapeShifter (ros/ros_comm#1722)
* clear message queue on simtime jumping back (ros/ros_comm#1518)
* use undefined dynamic_lookup on macOS (ros/ros_comm#1923)
* check if enough FDs are free, instead counting the total free FDs (ros/ros_comm#1929)

1.15.4 (2020-03-19)
-------------------
* restrict boost dependencies to components used (ros/ros_comm#1871)
* add exception for ConnectionAbortedError (ros/ros_comm#1908)
* fix mac trying to use epoll instead of kqueue (ros/ros_comm#1907)
* fix AttributeError: __exit__ (ros/ros_comm#1915, regression from 1.14.4)

1.15.3 (2020-02-28)
-------------------
* remove Boost version check since Noetic only targets platforms with 1.67+ (ros/ros_comm#1903)

1.15.2 (2020-02-25)
-------------------
* export missing Boost dependency (ros/ros_comm#1898)
* add timestamp formatting for rosconsole (ros/ros_comm#1892)

1.15.1 (2020-02-24)
-------------------
* fix missing boost dependencies (ros/ros_comm#1895)
* use setuptools instead of distutils (ros/ros_comm#1870)
* increase time limit of advertisetest/publishtest.test to reduce flakyness (ros/ros_comm#1897)

1.15.0 (2020-02-21)
-------------------
* fix dictionary changed size during iteration (ros/ros_comm#1894)
* update test to pass with old and new yaml (ros/ros_comm#1893)

Packaging changes:
- removed patch-an, as there are no more boost version checks
- updated patch-ao
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[rosbag] Errors playing bags that contain latched and non-latched publishers of the same topic
3 participants