-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rclpy wait for service #127
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks pretty good. I had a few comments/questions inline, mostly nitpicks.
This could benefit from another set of eyes as well
rclpy/rclpy/client.py
Outdated
""" | ||
Block until the service is available. | ||
|
||
:param timeout_sec: Seconds to wait. Block forever if None. Don't wait if <= 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docstring should be updated to reflect that negative values wait forever
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed b68335b
rclpy/rclpy/graph_listener.py
Outdated
# Assumes lock is already held | ||
if self._thread is None: | ||
self._thread = threading.Thread(target=self._runner) | ||
self._thread.daemon = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: you can "daemonize" a thread by passing the keyword argument daemon=True
to the Thread
constructor (same below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed ace6fe0
rclpy/rclpy/client.py
Outdated
|
||
:param timeout_sec: Seconds to wait. Block forever if None. Don't wait if <= 0 | ||
:type timeout_sec: float or None | ||
:rtype: bool true if the service is available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
Use returns
to describe what the function returns and rtype
to say what type is returned
:returns: True if the service is available, otherwise False
:rtype: bool
(Same for all functions docblocks)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed e37969d
rclpy/rclpy/wait_set.py
Outdated
del self._ready_pointers[subscription_pointer] | ||
self._needs_building = True | ||
|
||
def add_subscriptions(self, subscriptions): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: move this function next to add_subscription as it's done below for timers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done 0285da1
rclpy/rclpy/wait_set.py
Outdated
|
||
def remove_subscription(self, subscription_pointer): | ||
del self._subscriptions[subscription_pointer] | ||
del self._ready_pointers[subscription_pointer] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this check that subscription_pointer
is a key of _ready_pointer
before trying to delete it ? (same multiple times below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nvm, the key is always created when an entity is created.
Though now I'm not sure who is making sure than the ready_pointers
not updated in https://github.com/ros2/rclpy/pull/127/files#diff-c7dc6d55f68505f4de26646b4541e2cfR151 (the ones that are not ready) are set to False in the existing ready_pointers
dict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, fixed + test updated to catch that in a8e54fc
@@ -0,0 +1,61 @@ | |||
# Copyright 2017 Open Source Robotics Foundation, Inc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will need to extend the timeout of the nose test as several of these could take some time and the total test time will likely take more than 90 seconds in non ideal cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the non-ideal cases? How long do they take? I upped it by 30 seconds (+4ish seconds per new test) 72e8ea9.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The worst case scenario really depends on the nature of the test, e.g. if a test has a wait_for_service(5s) worst case it will time out and take ~5seconds. For reference our service communication tests have a timeout of 30s. On my machine, this takes between 90 and 100s so I guess 120 is fine. We often revisit test timeout later on given that not all nodes have the same behavior. A good way to estimate the duration would be to run jobs on all platforms with this single test repeated multiple times
rclpy/test/test_client.py
Outdated
@classmethod | ||
def setUpClass(cls): | ||
rclpy.init() | ||
cls.node = rclpy.create_node('TestClient', namespace='/') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Is the namespace necessary here ? or should an empty namespace expand to /
automatically ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessary. This was a copy/paste error b9d015f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the referenced commit didnt modify this line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. Second attempt: 0ef6a4d
rclpy/test/test_utilities.py
Outdated
self.assertGreater(0, rclpy.utilities.timeout_sec_to_nsec(None)) | ||
self.assertGreater(0, rclpy.utilities.timeout_sec_to_nsec(-1)) | ||
self.assertEqual(0, rclpy.utilities.timeout_sec_to_nsec(0)) | ||
self.assertEqual(1000000000, rclpy.utilities.timeout_sec_to_nsec(1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Using the number in nanoseconds makes it harder to read. Could we use S_TO_NS in these tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
import time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: rcl timers are using steady time so I think the tests comparing times should use the same time source
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done c3ab486
rclpy/test/test_wait_set.py
Outdated
self.assertFalse(ws.is_ready(gc_pointer)) | ||
|
||
_rclpy.rclpy_trigger_guard_condition(gc_handle) | ||
# TODO(sloretz) why does the next assertion fail with wait(0)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm good question, would be good to find out why
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've seen that there are some additional things that happen in rmw_fastrtps_cpp
's wait
when the requested timeout is 0, in particular: https://github.com/ros2/rmw_fastrtps/blob/a6ac02905c55118fb1da636f85c8197dde8ef58e/rmw_fastrtps_cpp/src/rmw_wait.cpp#L236
When I was pulling up that link I realised that it looks to be checking hasTriggered
on the guard condition, while getHasTriggered
that was called at L229 will have cleared the flag already. So I suspect that rmw_wait
is incorrectly returning that it timed out even though the guard condition had triggered.
Then as a flow-on effect, it looks like rcl_wait
will clear the guard conditions if rmw_wait
timed out: https://github.com/ros2/rcl/blob/1fa4acac8ca42f61bd0f80c7d3ee5fba3d196e33/rcl/src/rcl/wait.c#L621, so this combination of things might explain the difference in behaviour with wait(0)
.
Have you looked into it already @sloretz? Otherwise I'll continue the investigations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dhood nice find! I hadn't looked into it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ros2/rmw_fastrtps#158 should let you change back to wait(0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rclpy/rclpy/executors.py
Outdated
""" | ||
if timeout_sec is not None and timeout_sec < 0: | ||
timeout_sec = None | ||
# Wait for all work to complete | ||
if timeout_sec is None or timeout_sec >= 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this always evaluating to True ? (given the previous 2 lines)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch 6c77285
rclpy/src/rclpy/_rclpy.c
Outdated
return NULL; | ||
} | ||
if (!PyCapsule_IsValid(pynode, NULL) || !PyCapsule_IsValid(pyclient, NULL)) { | ||
PyErr_Format(PyExc_ValueError, "Expected two PyCapsule as arguments"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This raises ValueError on invalid capsule while rclpy_get_graph_guard_condition raises TypeError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rclpy/rclpy/utilities.py
Outdated
if timeout_sec is None: | ||
# Block forever | ||
return -1 | ||
elif timeout_sec > -1.0 / S_TO_NS and timeout_sec < 1.0 / S_TO_NS: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify the reasoning behind the rounding here?
I think it's going to trip users up because of how the behaviour changes if you just slightly change timeout values around the edges of +/-1ns. E.g. if I'm calculating timeout_sec
values through some algorithm of my own, I'd expect both a -1.1ns timeout and a -0.9ns timeout to be translated to indefinite wait. I would suggest just converting to nanoseconds and then checking if the result is negative.
For the +0.9ns case, it could be argued that this should cause a blocking wait of 1ns, but I don't think we need to be too picky about this since in theory the difference between waiting 0ns and 1ns shouldn't be too drastic (in practice, it seems from https://github.com/ros2/rclpy/pull/127/files#diff-5464c403a7a6a4bd5dc7f394ce03bfdbR48 that that's not the case right now). We should at least clarify in the docblock that it doesn't wait if the timeout is <1ns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal was only to avoid == on a floating point number. The bound can be smaller (gtest EXPECT_DOUBLE_EQ
uses 4 ULP). I chose 1ns because it's the smallest amount of time rcl_wait
supports. How about changing it to 1 ULP on the negative side (block forever if timeout_sec < 0), and 0.5ns on the positive side (don't wait if 0 == round(timeout_sec / S_TO_NS)) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what I was thinking before, but in 8b2008c any negative number blocks, and a timeout less than 1ns doesn't block due to the integer cast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8b2008c matches what I'd expect now
rclpy/test/test_wait_set.py
Outdated
self.assertFalse(ws.is_ready(gc_pointer)) | ||
|
||
_rclpy.rclpy_trigger_guard_condition(gc_handle) | ||
# TODO(sloretz) why does the next assertion fail with wait(0)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've seen that there are some additional things that happen in rmw_fastrtps_cpp
's wait
when the requested timeout is 0, in particular: https://github.com/ros2/rmw_fastrtps/blob/a6ac02905c55118fb1da636f85c8197dde8ef58e/rmw_fastrtps_cpp/src/rmw_wait.cpp#L236
When I was pulling up that link I realised that it looks to be checking hasTriggered
on the guard condition, while getHasTriggered
that was called at L229 will have cleared the flag already. So I suspect that rmw_wait
is incorrectly returning that it timed out even though the guard condition had triggered.
Then as a flow-on effect, it looks like rcl_wait
will clear the guard conditions if rmw_wait
timed out: https://github.com/ros2/rcl/blob/1fa4acac8ca42f61bd0f80c7d3ee5fba3d196e33/rcl/src/rcl/wait.c#L621, so this combination of things might explain the difference in behaviour with wait(0)
.
Have you looked into it already @sloretz? Otherwise I'll continue the investigations
cli.wait_for_service() | ||
cli.call(GetParameters.Request()) | ||
|
||
ws.wait(5 * S_TO_NS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this will be long enough for connext, we tend to use 20s in other service tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that in this case the client and the service server are part of the same participant I would expect the discovery to be very quick and the 5sec timeout to be enough.
Talking briefly with @dhood offline it looks like the current value should be good enough. We would still benefit from a repeated job on this package before merging to convince ourselves and catch potential flakyness before it hits the nighlies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mikaelarguedas @dhood Repeated job in progress http://ci.ros2.org/job/ci_linux/3361/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't appear to be flaky on linux http://ci.ros2.org/job/ci_linux/3361/console#console-section-164. Do the tests take longer on OSX or Windows? I can start a repeated job on them too.
import time | ||
import unittest | ||
|
||
from rcl_interfaces.srv import GetParameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rclcpp defines mocking interfaces to test this kind of things rather than relying on existing rcl_interfaces. It may be worth doing the same things here to avoid confusion as an outsider would expect Parameters to be a thing that exists in rclpy when seeing this service being used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mind if this goes in a follow up PR? test_callback_group.py
, test_destruction.py
, and test_node.py
would also benefit from this.
rclpy/rclpy/executors.py
Outdated
timers_ready = _rclpy.rclpy_get_ready_entities('timer', wait_set) | ||
clients_ready = _rclpy.rclpy_get_ready_entities('client', wait_set) | ||
services_ready = _rclpy.rclpy_get_ready_entities('service', wait_set) | ||
wait_set = _WaitSet() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can't rely on __del__
being called to destroy the allocated C wait set. That might just never happen. Instead the code should explicitly call a cleanup function when it doesn't need wait_set
anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK an explicit method cannot be relied on unless it is implemented using the with
or try/finally
statements. I can add __enter__()
and __exit__()
methods.
I think __del__
is OK here unless the C structure itself is very big. The cases where __del__
won't be called look acceptable to me.
__del__
never called- the interpreter is shutting down
- OS will free memory anyways when the interpreter terminates.
- there is a chain of circular references and cyclic garbage collection is disabled and one the objects in the chain has a reference to an instance of WaitSet
- It's not just the wait set, all the other objects in the chain are leaked too. The user must fix their circular reference bug if they don't want memory leaks.
- the interpreter is shutting down
__del__
not called for a long time- User messed with garbage collector settings
- They've got bigger problems. If this isn't getting collected then a lot of other stuff isn't either
- Something keeping the reference count from hitting zero (ex: An exception was raised and the WaitSet is referenced by a local variable in one of the stack frames in
sys.last_traceback
and another exception hasn't been raised for a while)- Explicit method or not the class instance and member variables will use memory until they are garbage collected (at which point
__del__
will be called unless the interpreter is shutting down). An explicit method can only free the C structure. I'm not sure how big the C structure is; if it's really big then it is worth freeing it sooner.
- Explicit method or not the class instance and member variables will use memory until they are garbage collected (at which point
- User messed with garbage collector settings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am pretty sure that without fiddling with any setting the GC won't collect an instance "immediately" but might only do that after minutes. And if the user want's to ensure that the C structure is deleted there needs to be a way to do so (without triggering the GC). Therefore I think it is better to expect that the calling code invokes a cleanup function explicitly. If the caller fails to do so the __del__
method can this do it for them to be on the safe side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved by adding WaitSet.destroy()
and having the executor use it via with
d8521cd
rclpy/rclpy/graph_listener.py
Outdated
from rclpy.wait_set import WaitSet as _WaitSet | ||
|
||
|
||
class GraphListenerSingleton: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this singleton being destroyed in case of e.g. shutting down rclpy? When is the thread being stopped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In d10ee26 it checks if rclpy was shut down once per second.
8b2008c
to
be25673
Compare
Not merging yet because it appears to push the 1000Hz timer test from flaky to failing 100% of the time on aarch64. Even with The reason for this test failure is probably performance in |
~5% less overhead in wait_for_ready_callbacks()
rclpy/rclpy/executors.py
Outdated
|
||
# Process ready entities one node at a time | ||
for node in nodes: | ||
for tmr in [t for t in timers if wait_set.is_ready(t.timer_pointer)]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this list comprehension only consider the timers
of the current node
?
Same for subscriptions
, clients
and services
below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed by 6352e86
Also refactored some code to a sub-generator
@dirk-thomas replaced comprehensions with |
Improvements result in ~9% less overhead in Edit: Still not fast enough :( Build with 7ee5ead Another try http://ci.ros2.org/job/ci_linux-aarch64/668/ Edit, more stuff I don't expect this to pass, but it will give me more info in the morning |
Changed since review. Executor code moved to #140
Lots has been split off, changed, and merged since this PR. I'll close this expecting to open smaller PRs with these features in the future. |
This adds a blocking
wait_for_service()
to the client. It is implemented using a GraphListener which runs in its own thread - similar to rclcpp. This is needed to prevent wait_for_service() from deadlocking a single threaded executor if someone calls it from a subscriber callback.There is a small amount of refactoring in the executor for the purpose of reusing code in the graph listener. Also after talking with @mikaelarguedas negative timeouts now block forever instead of returning immediately.
connects to #58