Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update publisher/subscription matched count API documentation. #262

Merged
merged 2 commits into from
Aug 27, 2020

Conversation

hidmic
Copy link
Contributor

@hidmic hidmic commented Aug 20, 2020

Precisely what the title says.

Signed-off-by: Michel Hidalgo <michel@ekumenlabs.com>
@hidmic
Copy link
Contributor Author

hidmic commented Aug 20, 2020

Writing tests for this portion of the API brings up an interesting question. What's a reasonable latency for pub/sub match events (within the same process, within the same host, within the same network)? We don't have (nor specify) a guaranteed upper bound to rely on , and there's no API to wait for these events.

Edit: Hmm, there's the graph guard condition. rcl tests wait repeatedly on it, though I suspect 9 times is arbitrary. Or maybe not, and within 10 graph events is that reasonable latency I'm after.

@hidmic hidmic requested review from wjwwood, clalancette and ivanpauno and removed request for wjwwood and clalancette August 20, 2020 14:22
@hidmic
Copy link
Contributor Author

hidmic commented Aug 20, 2020

Just to clarify, I'm looking for a way to specify within which time interval (in seconds or in received graph events) the API is expected to reflect the state of the system, in a given environment. Otherwise I can't tell which implementation behaves correctly and which doesn't.

* \param[in] publisher the publisher object to inspect
* \param[out] subscription_count the number of subscriptions matched
* \return `RMW_RET_OK` if successful, or
* \return `RMW_RET_INVALID_ARGUMENT` if either argument is null, or
* \return `RMW_RET_INCORRECT_RMW_IMPLEMENTATION` if node or publisher
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* \return `RMW_RET_INCORRECT_RMW_IMPLEMENTATION` if node or publisher
* \return `RMW_RET_INCORRECT_RMW_IMPLEMENTATION` if publisher

Or maybe I'm wrong. I guess my question is; why is this different than rmw_subscription_count_matched_publishers?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like the right change to me. It's not possible AFAIK for the node which is referenced from within the publisher to not match the publisher, so checking the publisher only is fine I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're spot on. A left over. See 1083ea1.

Copy link
Member

@wjwwood wjwwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, with @clalancette's point addressed.

Signed-off-by: Michel Hidalgo <michel@ekumenlabs.com>
@hidmic
Copy link
Contributor Author

hidmic commented Aug 20, 2020

Thanks for the prompt reviews guys! What do we do about #262 (comment)? I need a boundary for tests in ros2/rmw_implementation#119 and I think it'd be good to make that explicit here.

@wjwwood
Copy link
Member

wjwwood commented Aug 20, 2020

I don't think we can put a latency on that which will be meaningful for all rmw implementations... Either in terms of time or number of events. Why does that need to be part of the documentation?

@hidmic
Copy link
Contributor Author

hidmic commented Aug 20, 2020

I don't think we can put a latency on that which will be meaningful for all rmw implementations... Either in terms of time or number of events. Why does that need to be part of the documentation?

In my mind, I cannot test for pub/sub matching if I don't have a way to wait for it to happen. If I use the graph guard condition and say, a retry count, that count becomes an implicit requirement for the implementation. I think that it should be explicit (along with all the necessary constraints, like for pub/sub within the same process or for pub/sub within the same host with < 10% CPU load).

Or not test pub/sub matching (which is a bit of a bummer).

@wjwwood
Copy link
Member

wjwwood commented Aug 20, 2020

I don't understand why we need a number in the docs. Why can we not have a retry loop with an empirically calculated number? That's what we do everywhere.

When we send a message and check to see it is received we have a wait period and/or a retry loop but that's not documented and I don't think it could be given that it would be impacted by the underlying system.

I think we shouldn't put numbers on things like this because it depends too much on the underlying hardware and middleware implementation.

@hidmic
Copy link
Contributor Author

hidmic commented Aug 20, 2020

I don't understand why we need a number in the docs. Why can we not have a retry loop with an empirically calculated number? That's what we do everywhere.

I know. And we have a bunch of conditionals in gtests, launch tests, pytests, and CMakeLists.txt to cope with the nitty-gritty of the implementations we have in our source tree.

Having test_rmw_implementation is good for RMW implementations coverage, but it also gives us the opportunity to refine rmw API specifications and have a test suite in place that checks if any given implementation reasonably meets those specifications (all for the same price). But to do that, API specifications have to be complete.

Now that I think about this again, perhaps we can mention the tests themselves, and document in what conditions test_rmw_implementation tests are valid.

and middleware implementation.

Hmm, wait, is that the case? Would it be OK if a middleware doesn't match publishers and subscriptions that don't stay around for more than, say, 5 minutes? I mean, that's fine, but then (a) generic RMW implementation testing is not possible, and I have to separate these from the rest, and (b) we should add that to the docs! Like, check your RMW implementation documentation to understand what to expect from this API.

@wjwwood
Copy link
Member

wjwwood commented Aug 20, 2020

Well we can have large upper bounds, like "a subscription and publisher shall be matched within 2 minutes of their creation", but to make that tighter would just make it hard to avoid flaky tests, and to make it broader makes it less and less useful of a number while still not completely eliminating the chance that some hardware or new middleware implementation not meeting the goal.

So if we want to add them that's fine, but I can't give you a number that isn't arbitrary myself. Sorry, I know that's not satisfying but I've been dealing with this while trying to write requirements for rclcpp for Apex and in my opinion there are no satisfying answers.

@hidmic
Copy link
Contributor Author

hidmic commented Aug 20, 2020

Yeah, this is a tough one. For worst, skimming through RMW implementations, it doesn't seem like pub/sub matching triggers graph guard conditions at all -- which is curious considering rcl tests appear to rely on that.

Perhaps, the conclusion here is that the rmw API is lacking the features to enable these tests, and, for the time being, we'll have to move forward without it. That is, loops around checks, with an arbitrary cap like we do elsewhere, and a big TODO for change in the future.

@ivanpauno
Copy link
Member

We don't have (nor specify) a guaranteed upper bound to rely on

If you're waiting to an event that happens asynchronously, it's impossible to specify an upper time bound.
It depends on the platform, on your network, on the rmw implementation, on everything.

I don't see how the API could improve to avoid this issue, we already have a way of getting a notification, that's the best thing we can do.

The only way to write a reasonable test for this kind of things is to put a random upper bound based on previous experimentation (much bigger that the actually measured delays (x10), so the test isn't flaky at all).
That upper bound only avoids unnecessary hangs, and it is not part of the rmw API specification (ideally, it should be large enough to work with currenltly unsupported platforms/rmw implementations, etc).

@hidmic
Copy link
Contributor Author

hidmic commented Aug 21, 2020

I don't see how the API could improve to avoid this issue, we already have a way of getting a notification, that's the best thing we can do.

Having some form of notification mechanism for every single asynchronous side-effect, and not just for a subset (e.g. as I mention above, AFAICS pub/sub matching doesn't currently trigger the graph guard condition on any Tier 1 RMW implementation).

If you're waiting to an event that happens asynchronously, it's impossible to specify an upper time bound.
It depends on the platform, on your network, on the rmw implementation, on everything.

That's true generically, but:

  • It's not pure asynchronous events we're dealing with. We have full control of what's executed and when it's executed (in a single thread, synchronizing threads/processes, take your pick).
  • Perhaps I didn't express myself correctly, but by upper bound I don't expect an specification that can hold ground for every single possible use case. We can tie them to specific tests and/or specific setups (like other engineering disciplines do).

Of course, performance expectations will join the mix. And for good reason. No ROS-powered robot stack that I've worked with would tolerate ~10s delays for pub/sub, so I wouldn't deem that as correct behavior.

The only way to write a reasonable test for this kind of things is to put a random upper bound based on previous experimentation (much bigger that the actually measured delays (x10), so the test isn't flaky at all).

While that's exactly what's going to happen next (:sweat_smile:), I disagree with the approach. It keeps expectations implicit. And I don't think we'd be happy ever increasing timeouts just to get tests passing.

IMHO long term we have to:

  • Have a documented set of tests and/or model platform(s) to complete API specs.
    • And it would be perfectly reasonable to have different specs for different setups and use cases. Perhaps not all RMW implementations can be expected to make the cut in every situation and that's fine.
  • Improve rmw API to sleep less and wait more.

@wjwwood
Copy link
Member

wjwwood commented Aug 21, 2020

I don't agree that large timeouts are bad. The tests are there to test that something works, eventually. If we want to know how quickly or reliably it works that is, in my opinion, a performance test and should be run separately and with limits that are appropriate for the hardware it is being run on.

For the same reason I do not think we can document what the response time of an API should be, or the throughput or resource overhead.

@ivanpauno
Copy link
Member

Having some form of notification mechanism for every single asynchronous side-effect, and not just for a subset (e.g. as I mention above, AFAICS pub/sub matching doesn't currently trigger the graph guard condition on any Tier 1 RMW implementation).

That's a bug IMO.
We could have more fine grained "graph" guard conditions, or we could have a way to pass a condition specifying what graph change we're waiting for (the last option is really tricky to implement).

Of course, performance expectations will join the mix. And for good reason. No ROS-powered robot stack that I've worked with would tolerate ~10s delays for pub/sub, so I wouldn't deem that as correct behavior.

As you commented, this is a performance problem, not an API specification problem.
How long an event takes to get triggered has to be measured in a performance test, not in an unit/integration/system test.

The API documentation doesn't have to say anything about how long an event can take to get triggered.

Improve rmw API to sleep less and wait more.

IMHO, rmw API must not require sleeps at all.
If we're requiring sleeps without providing an alternative wait mechanism, that's a bug and should be fixed.

@hidmic
Copy link
Contributor Author

hidmic commented Aug 21, 2020

Re-reading this thread, I think we have fundamentally different views of what to expect from rmw API documentation and tests.

I expect (hope for?) them to be the contract and verification process that any RMW implementation has to comply with and go through, respectively, and not just the tests that ensure our current code doesn't regress. Therefore, while in theory:

The tests are there to test that something works, eventually.

How long an event takes to get triggered has to be measured in a performance test, not in an unit/integration/system test.

are valid points, in practice we cannot test for correct behavior unless the definition of correct includes a time cap, and thus a performance expectation (well, we could wait w/o timeouts or try solve the halting problem 😁). Unless we close that door and we switch to regression testing. Measuring delays and making timeouts 10x those delays makes perfect sense if you're testing for regressions in a given implementation.

Unfortunately, it doesn't look like we can do better than regression testing for the time being. I'll make it easy for those timeouts to be tuned though, so that RMW authors can still make some use of these tests.

@hidmic
Copy link
Contributor Author

hidmic commented Aug 21, 2020

That's a bug IMO.

Perhaps it is. I don't know what triggering the graph guard condition on every pub/sub match would do to wait set performance. FWIW I couldn't find anything in documentation that'd suggest it should be this way or the other.

IMHO, rmw API must not require sleeps at all.

Agreed.

@gbiggs
Copy link
Member

gbiggs commented Aug 24, 2020

So if we want to add them that's fine, but I can't give you a number that isn't arbitrary myself. Sorry, I know that's not satisfying but I've been dealing with this while trying to write requirements for rclcpp for Apex and in my opinion there are no satisfying answers.

In my experience, the best way to deal with requirements like that is to be relative, rather than absolute. "A must happen before B can happen", "A must happen before the system attempts C", "The system shall not be considered started until A has happened", etc. Specific time limits are too dependent on the hardware, the underlying implementation, and the needs of the application (my application might be happy to wait 5 minutes, yours might not) to be worth it.

I expect (hope for?) them to be the contract and verification process that any RMW implementation has to comply with and go through, respectively, and not just the tests that ensure our current code doesn't regress.

I 100% agree with the sentiment that we want the API documentation to be the enforceable contract. However I don't think that contract must include specific time limits even where we can't provide them in any meaningful way. Specifying it in terms of "if this has not happened, you can't do X" makes more sense to me. It's more stateful, but it's a stateful API anyway.

@hidmic
Copy link
Contributor Author

hidmic commented Aug 24, 2020

Specific time limits are too dependent on the hardware, the underlying implementation, and the needs of the application (my application might be happy to wait 5 minutes, yours might not) to be worth it.

I agree with that. Still, we will have a finite timeout. A timeout that will just be enough to get rmw_fastrtps*_cpp, rmw_cyclonedds_cpp, and rmw_connext_cpp passing tests in our CI machines w/o flakes. And thus test results won't be generic.


This was a very interesting discussion, but I will not push it any further. Thank you all for the time!

@hidmic
Copy link
Contributor Author

hidmic commented Aug 24, 2020

FYI #264

@hidmic
Copy link
Contributor Author

hidmic commented Aug 26, 2020

CI up to test_rmw_implementation, against all Tier 1 RMW implementations:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

@hidmic hidmic merged commit ef2caa5 into master Aug 27, 2020
@delete-merged-branch delete-merged-branch bot deleted the hidmic/update-pub-sub-count-docs branch August 27, 2020 15:55
ahcorde pushed a commit that referenced this pull request Oct 13, 2020
Signed-off-by: Michel Hidalgo <michel@ekumenlabs.com>
ahcorde pushed a commit that referenced this pull request Oct 13, 2020
Signed-off-by: Michel Hidalgo <michel@ekumenlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants