-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update publisher/subscription matched count API documentation. #262
Conversation
Signed-off-by: Michel Hidalgo <michel@ekumenlabs.com>
Writing tests for this portion of the API brings up an interesting question. What's a reasonable latency for pub/sub match events (within the same process, within the same host, within the same network)? We don't have (nor specify) a guaranteed upper bound to rely on Edit: Hmm, there's the graph guard condition. |
Just to clarify, I'm looking for a way to specify within which time interval (in seconds or in received graph events) the API is expected to reflect the state of the system, in a given environment. Otherwise I can't tell which implementation behaves correctly and which doesn't. |
rmw/include/rmw/rmw.h
Outdated
* \param[in] publisher the publisher object to inspect | ||
* \param[out] subscription_count the number of subscriptions matched | ||
* \return `RMW_RET_OK` if successful, or | ||
* \return `RMW_RET_INVALID_ARGUMENT` if either argument is null, or | ||
* \return `RMW_RET_INCORRECT_RMW_IMPLEMENTATION` if node or publisher |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* \return `RMW_RET_INCORRECT_RMW_IMPLEMENTATION` if node or publisher | |
* \return `RMW_RET_INCORRECT_RMW_IMPLEMENTATION` if publisher |
Or maybe I'm wrong. I guess my question is; why is this different than rmw_subscription_count_matched_publishers
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like the right change to me. It's not possible AFAIK for the node which is referenced from within the publisher to not match the publisher, so checking the publisher only is fine I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're spot on. A left over. See 1083ea1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, with @clalancette's point addressed.
Signed-off-by: Michel Hidalgo <michel@ekumenlabs.com>
Thanks for the prompt reviews guys! What do we do about #262 (comment)? I need a boundary for tests in ros2/rmw_implementation#119 and I think it'd be good to make that explicit here. |
I don't think we can put a latency on that which will be meaningful for all rmw implementations... Either in terms of time or number of events. Why does that need to be part of the documentation? |
In my mind, I cannot test for pub/sub matching if I don't have a way to wait for it to happen. If I use the graph guard condition and say, a retry count, that count becomes an implicit requirement for the implementation. I think that it should be explicit (along with all the necessary constraints, like Or not test pub/sub matching (which is a bit of a bummer). |
I don't understand why we need a number in the docs. Why can we not have a retry loop with an empirically calculated number? That's what we do everywhere. When we send a message and check to see it is received we have a wait period and/or a retry loop but that's not documented and I don't think it could be given that it would be impacted by the underlying system. I think we shouldn't put numbers on things like this because it depends too much on the underlying hardware and middleware implementation. |
I know. And we have a bunch of conditionals in gtests, launch tests, pytests, and CMakeLists.txt to cope with the nitty-gritty of the implementations we have in our source tree. Having Now that I think about this again, perhaps we can mention the tests themselves, and document in what conditions
Hmm, wait, is that the case? Would it be OK if a middleware doesn't match publishers and subscriptions that don't stay around for more than, say, 5 minutes? I mean, that's fine, but then (a) generic RMW implementation testing is not possible, and I have to separate these from the rest, and (b) we should add that to the docs! Like, check your RMW implementation documentation to understand what to expect from this API. |
Well we can have large upper bounds, like "a subscription and publisher shall be matched within 2 minutes of their creation", but to make that tighter would just make it hard to avoid flaky tests, and to make it broader makes it less and less useful of a number while still not completely eliminating the chance that some hardware or new middleware implementation not meeting the goal. So if we want to add them that's fine, but I can't give you a number that isn't arbitrary myself. Sorry, I know that's not satisfying but I've been dealing with this while trying to write requirements for rclcpp for Apex and in my opinion there are no satisfying answers. |
Yeah, this is a tough one. For worst, skimming through RMW implementations, it doesn't seem like pub/sub matching triggers graph guard conditions at all -- which is curious considering Perhaps, the conclusion here is that the |
If you're waiting to an event that happens asynchronously, it's impossible to specify an upper time bound. I don't see how the API could improve to avoid this issue, we already have a way of getting a notification, that's the best thing we can do. The only way to write a reasonable test for this kind of things is to put a random upper bound based on previous experimentation (much bigger that the actually measured delays (x10), so the test isn't flaky at all). |
Having some form of notification mechanism for every single asynchronous side-effect, and not just for a subset (e.g. as I mention above, AFAICS pub/sub matching doesn't currently trigger the graph guard condition on any Tier 1 RMW implementation).
That's true generically, but:
Of course, performance expectations will join the mix. And for good reason. No ROS-powered robot stack that I've worked with would tolerate ~10s delays for pub/sub, so I wouldn't deem that as correct behavior.
While that's exactly what's going to happen next (:sweat_smile:), I disagree with the approach. It keeps expectations implicit. And I don't think we'd be happy ever increasing timeouts just to get tests passing. IMHO long term we have to:
|
I don't agree that large timeouts are bad. The tests are there to test that something works, eventually. If we want to know how quickly or reliably it works that is, in my opinion, a performance test and should be run separately and with limits that are appropriate for the hardware it is being run on. For the same reason I do not think we can document what the response time of an API should be, or the throughput or resource overhead. |
That's a bug IMO.
As you commented, this is a performance problem, not an API specification problem. The API documentation doesn't have to say anything about how long an event can take to get triggered.
IMHO, rmw API must not require sleeps at all. |
Re-reading this thread, I think we have fundamentally different views of what to expect from I expect (hope for?) them to be the contract and verification process that any RMW implementation has to comply with and go through, respectively, and not just the tests that ensure our current code doesn't regress. Therefore, while in theory:
are valid points, in practice we cannot test for correct behavior unless the definition of correct includes a time cap, and thus a performance expectation (well, we could wait w/o timeouts or try solve the halting problem 😁). Unless we close that door and we switch to regression testing. Measuring delays and making timeouts 10x those delays makes perfect sense if you're testing for regressions in a given implementation. Unfortunately, it doesn't look like we can do better than regression testing for the time being. I'll make it easy for those timeouts to be tuned though, so that RMW authors can still make some use of these tests. |
Perhaps it is. I don't know what triggering the graph guard condition on every pub/sub match would do to wait set performance. FWIW I couldn't find anything in documentation that'd suggest it should be this way or the other.
Agreed. |
In my experience, the best way to deal with requirements like that is to be relative, rather than absolute. "A must happen before B can happen", "A must happen before the system attempts C", "The system shall not be considered started until A has happened", etc. Specific time limits are too dependent on the hardware, the underlying implementation, and the needs of the application (my application might be happy to wait 5 minutes, yours might not) to be worth it.
I 100% agree with the sentiment that we want the API documentation to be the enforceable contract. However I don't think that contract must include specific time limits even where we can't provide them in any meaningful way. Specifying it in terms of "if this has not happened, you can't do X" makes more sense to me. It's more stateful, but it's a stateful API anyway. |
I agree with that. Still, we will have a finite timeout. A timeout that will just be enough to get This was a very interesting discussion, but I will not push it any further. Thank you all for the time! |
FYI #264 |
Signed-off-by: Michel Hidalgo <michel@ekumenlabs.com>
Signed-off-by: Michel Hidalgo <michel@ekumenlabs.com>
Precisely what the title says.