Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rclpy wait for service #127

Closed
wants to merge 49 commits into from
Closed

Rclpy wait for service #127

wants to merge 49 commits into from

Conversation

sloretz
Copy link
Contributor

@sloretz sloretz commented Oct 13, 2017

This adds a blocking wait_for_service() to the client. It is implemented using a GraphListener which runs in its own thread - similar to rclcpp. This is needed to prevent wait_for_service() from deadlocking a single threaded executor if someone calls it from a subscriber callback.

There is a small amount of refactoring in the executor for the purpose of reusing code in the graph listener. Also after talking with @mikaelarguedas negative timeouts now block forever instead of returning immediately.

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

connects to #58

@sloretz sloretz added the in progress Actively being worked on (Kanban column) label Oct 13, 2017
@sloretz sloretz self-assigned this Oct 13, 2017
@sloretz sloretz added in review Waiting for review (Kanban column) and removed in progress Actively being worked on (Kanban column) labels Oct 13, 2017
@sloretz sloretz changed the title [WIP] Rclpy wait for service Rclpy wait for service Oct 13, 2017
Copy link
Member

@mikaelarguedas mikaelarguedas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks pretty good. I had a few comments/questions inline, mostly nitpicks.
This could benefit from another set of eyes as well

"""
Block until the service is available.

:param timeout_sec: Seconds to wait. Block forever if None. Don't wait if <= 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docstring should be updated to reflect that negative values wait forever

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed b68335b

# Assumes lock is already held
if self._thread is None:
self._thread = threading.Thread(target=self._runner)
self._thread.daemon = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: you can "daemonize" a thread by passing the keyword argument daemon=True to the Thread constructor (same below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed ace6fe0


:param timeout_sec: Seconds to wait. Block forever if None. Don't wait if <= 0
:type timeout_sec: float or None
:rtype: bool true if the service is available
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:
Use returns to describe what the function returns and rtype to say what type is returned

:returns:  True if the service is available, otherwise False
:rtype: bool 

(Same for all functions docblocks)

Copy link
Contributor Author

@sloretz sloretz Oct 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed e37969d

del self._ready_pointers[subscription_pointer]
self._needs_building = True

def add_subscriptions(self, subscriptions):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: move this function next to add_subscription as it's done below for timers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done 0285da1


def remove_subscription(self, subscription_pointer):
del self._subscriptions[subscription_pointer]
del self._ready_pointers[subscription_pointer]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this check that subscription_pointer is a key of _ready_pointer before trying to delete it ? (same multiple times below)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nvm, the key is always created when an entity is created.
Though now I'm not sure who is making sure than the ready_pointers not updated in https://github.com/ros2/rclpy/pull/127/files#diff-c7dc6d55f68505f4de26646b4541e2cfR151 (the ones that are not ready) are set to False in the existing ready_pointers dict

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, fixed + test updated to catch that in a8e54fc

@@ -0,0 +1,61 @@
# Copyright 2017 Open Source Robotics Foundation, Inc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need to extend the timeout of the nose test as several of these could take some time and the total test time will likely take more than 90 seconds in non ideal cases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the non-ideal cases? How long do they take? I upped it by 30 seconds (+4ish seconds per new test) 72e8ea9.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The worst case scenario really depends on the nature of the test, e.g. if a test has a wait_for_service(5s) worst case it will time out and take ~5seconds. For reference our service communication tests have a timeout of 30s. On my machine, this takes between 90 and 100s so I guess 120 is fine. We often revisit test timeout later on given that not all nodes have the same behavior. A good way to estimate the duration would be to run jobs on all platforms with this single test repeated multiple times

@classmethod
def setUpClass(cls):
rclpy.init()
cls.node = rclpy.create_node('TestClient', namespace='/')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Is the namespace necessary here ? or should an empty namespace expand to / automatically ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary. This was a copy/paste error b9d015f

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the referenced commit didnt modify this line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Second attempt: 0ef6a4d

self.assertGreater(0, rclpy.utilities.timeout_sec_to_nsec(None))
self.assertGreater(0, rclpy.utilities.timeout_sec_to_nsec(-1))
self.assertEqual(0, rclpy.utilities.timeout_sec_to_nsec(0))
self.assertEqual(1000000000, rclpy.utilities.timeout_sec_to_nsec(1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Using the number in nanoseconds makes it harder to read. Could we use S_TO_NS in these tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# See the License for the specific language governing permissions and
# limitations under the License.

import time
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: rcl timers are using steady time so I think the tests comparing times should use the same time source

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done c3ab486

self.assertFalse(ws.is_ready(gc_pointer))

_rclpy.rclpy_trigger_guard_condition(gc_handle)
# TODO(sloretz) why does the next assertion fail with wait(0)?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm good question, would be good to find out why

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen that there are some additional things that happen in rmw_fastrtps_cpp's wait when the requested timeout is 0, in particular: https://github.com/ros2/rmw_fastrtps/blob/a6ac02905c55118fb1da636f85c8197dde8ef58e/rmw_fastrtps_cpp/src/rmw_wait.cpp#L236

When I was pulling up that link I realised that it looks to be checking hasTriggered on the guard condition, while getHasTriggered that was called at L229 will have cleared the flag already. So I suspect that rmw_wait is incorrectly returning that it timed out even though the guard condition had triggered.

Then as a flow-on effect, it looks like rcl_wait will clear the guard conditions if rmw_wait timed out: https://github.com/ros2/rcl/blob/1fa4acac8ca42f61bd0f80c7d3ee5fba3d196e33/rcl/src/rcl/wait.c#L621, so this combination of things might explain the difference in behaviour with wait(0).

Have you looked into it already @sloretz? Otherwise I'll continue the investigations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dhood nice find! I hadn't looked into it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ros2/rmw_fastrtps#158 should let you change back to wait(0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works on my machine. Thanks @dhood 385b18b

"""
if timeout_sec is not None and timeout_sec < 0:
timeout_sec = None
# Wait for all work to complete
if timeout_sec is None or timeout_sec >= 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this always evaluating to True ? (given the previous 2 lines)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch 6c77285

return NULL;
}
if (!PyCapsule_IsValid(pynode, NULL) || !PyCapsule_IsValid(pyclient, NULL)) {
PyErr_Format(PyExc_ValueError, "Expected two PyCapsule as arguments");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This raises ValueError on invalid capsule while rclpy_get_graph_guard_condition raises TypeError

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now they both check if PyCapsule_GetPointer returns NULL, which means they raise ValueError 1628cba. Pull request #129 does the same thing. This will need to be updated to use capsule names in whichever PR lands second.

if timeout_sec is None:
# Block forever
return -1
elif timeout_sec > -1.0 / S_TO_NS and timeout_sec < 1.0 / S_TO_NS:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify the reasoning behind the rounding here?

I think it's going to trip users up because of how the behaviour changes if you just slightly change timeout values around the edges of +/-1ns. E.g. if I'm calculating timeout_sec values through some algorithm of my own, I'd expect both a -1.1ns timeout and a -0.9ns timeout to be translated to indefinite wait. I would suggest just converting to nanoseconds and then checking if the result is negative.

For the +0.9ns case, it could be argued that this should cause a blocking wait of 1ns, but I don't think we need to be too picky about this since in theory the difference between waiting 0ns and 1ns shouldn't be too drastic (in practice, it seems from https://github.com/ros2/rclpy/pull/127/files#diff-5464c403a7a6a4bd5dc7f394ce03bfdbR48 that that's not the case right now). We should at least clarify in the docblock that it doesn't wait if the timeout is <1ns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal was only to avoid == on a floating point number. The bound can be smaller (gtest EXPECT_DOUBLE_EQ uses 4 ULP). I chose 1ns because it's the smallest amount of time rcl_wait supports. How about changing it to 1 ULP on the negative side (block forever if timeout_sec < 0), and 0.5ns on the positive side (don't wait if 0 == round(timeout_sec / S_TO_NS)) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what I was thinking before, but in 8b2008c any negative number blocks, and a timeout less than 1ns doesn't block due to the integer cast.

Copy link
Member

@dhood dhood Oct 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8b2008c matches what I'd expect now

self.assertFalse(ws.is_ready(gc_pointer))

_rclpy.rclpy_trigger_guard_condition(gc_handle)
# TODO(sloretz) why does the next assertion fail with wait(0)?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen that there are some additional things that happen in rmw_fastrtps_cpp's wait when the requested timeout is 0, in particular: https://github.com/ros2/rmw_fastrtps/blob/a6ac02905c55118fb1da636f85c8197dde8ef58e/rmw_fastrtps_cpp/src/rmw_wait.cpp#L236

When I was pulling up that link I realised that it looks to be checking hasTriggered on the guard condition, while getHasTriggered that was called at L229 will have cleared the flag already. So I suspect that rmw_wait is incorrectly returning that it timed out even though the guard condition had triggered.

Then as a flow-on effect, it looks like rcl_wait will clear the guard conditions if rmw_wait timed out: https://github.com/ros2/rcl/blob/1fa4acac8ca42f61bd0f80c7d3ee5fba3d196e33/rcl/src/rcl/wait.c#L621, so this combination of things might explain the difference in behaviour with wait(0).

Have you looked into it already @sloretz? Otherwise I'll continue the investigations

cli.wait_for_service()
cli.call(GetParameters.Request())

ws.wait(5 * S_TO_NS)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will be long enough for connext, we tend to use 20s in other service tests

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that in this case the client and the service server are part of the same participant I would expect the discovery to be very quick and the 5sec timeout to be enough.
Talking briefly with @dhood offline it looks like the current value should be good enough. We would still benefit from a repeated job on this package before merging to convince ourselves and catch potential flakyness before it hits the nighlies

Copy link
Contributor Author

@sloretz sloretz Oct 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't appear to be flaky on linux http://ci.ros2.org/job/ci_linux/3361/console#console-section-164. Do the tests take longer on OSX or Windows? I can start a repeated job on them too.

import time
import unittest

from rcl_interfaces.srv import GetParameters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rclcpp defines mocking interfaces to test this kind of things rather than relying on existing rcl_interfaces. It may be worth doing the same things here to avoid confusion as an outsider would expect Parameters to be a thing that exists in rclpy when seeing this service being used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind if this goes in a follow up PR? test_callback_group.py, test_destruction.py, and test_node.py would also benefit from this.

timers_ready = _rclpy.rclpy_get_ready_entities('timer', wait_set)
clients_ready = _rclpy.rclpy_get_ready_entities('client', wait_set)
services_ready = _rclpy.rclpy_get_ready_entities('service', wait_set)
wait_set = _WaitSet()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can't rely on __del__ being called to destroy the allocated C wait set. That might just never happen. Instead the code should explicitly call a cleanup function when it doesn't need wait_set anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK an explicit method cannot be relied on unless it is implemented using the with or try/finally statements. I can add __enter__() and __exit__() methods.

I think __del__ is OK here unless the C structure itself is very big. The cases where __del__ won't be called look acceptable to me.

  1. __del__ never called
    1. the interpreter is shutting down
      1. OS will free memory anyways when the interpreter terminates.
    2. there is a chain of circular references and cyclic garbage collection is disabled and one the objects in the chain has a reference to an instance of WaitSet
      1. It's not just the wait set, all the other objects in the chain are leaked too. The user must fix their circular reference bug if they don't want memory leaks.
  2. __del__ not called for a long time
    1. User messed with garbage collector settings
      1. They've got bigger problems. If this isn't getting collected then a lot of other stuff isn't either
    2. Something keeping the reference count from hitting zero (ex: An exception was raised and the WaitSet is referenced by a local variable in one of the stack frames in sys.last_traceback and another exception hasn't been raised for a while)
      1. Explicit method or not the class instance and member variables will use memory until they are garbage collected (at which point __del__ will be called unless the interpreter is shutting down). An explicit method can only free the C structure. I'm not sure how big the C structure is; if it's really big then it is worth freeing it sooner.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty sure that without fiddling with any setting the GC won't collect an instance "immediately" but might only do that after minutes. And if the user want's to ensure that the C structure is deleted there needs to be a way to do so (without triggering the GC). Therefore I think it is better to expect that the calling code invokes a cleanup function explicitly. If the caller fails to do so the __del__ method can this do it for them to be on the safe side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved by adding WaitSet.destroy() and having the executor use it via with d8521cd

from rclpy.wait_set import WaitSet as _WaitSet


class GraphListenerSingleton:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this singleton being destroyed in case of e.g. shutting down rclpy? When is the thread being stopped?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In d10ee26 it checks if rclpy was shut down once per second.

@sloretz sloretz force-pushed the rclpy_wait_for_service branch from 8b2008c to be25673 Compare October 24, 2017 16:36
@sloretz
Copy link
Contributor Author

sloretz commented Nov 1, 2017

Not merging yet because it appears to push the 1000Hz timer test from flaky to failing 100% of the time on aarch64. Even with --retest-until-pass 100 the timer test never passed
http://ci.ros2.org/job/ci_linux-aarch64/658/

The reason for this test failure is probably performance in spin_once. Nothing stands out using cProfile on x86_64. The largest amount of time (besides rclpy_wait) is spent doing list comprehensions, so I'll try reducing those first.

~5% less overhead in wait_for_ready_callbacks()

# Process ready entities one node at a time
for node in nodes:
for tmr in [t for t in timers if wait_set.is_ready(t.timer_pointer)]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this list comprehension only consider the timers of the current node?

Same for subscriptions, clients and services below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by 6352e86

@sloretz
Copy link
Contributor Author

sloretz commented Nov 1, 2017

@dirk-thomas replaced comprehensions with list() 0dbf8c7

dirk-thomas
dirk-thomas previously approved these changes Nov 1, 2017
@sloretz
Copy link
Contributor Author

sloretz commented Nov 1, 2017

Improvements result in ~9% less overhead in wait_for_ready_callbacks on x86_64. I started another job to see if that's good enough to get the 1kHz timer test to pass on aarch64 http://ci.ros2.org/job/ci_linux-aarch64/661 http://ci.ros2.org/job/ci_linux-aarch64/663/

Edit: Still not fast enough :(

Build with 7ee5ead Another try http://ci.ros2.org/job/ci_linux-aarch64/668/

Edit, more stuff I don't expect this to pass, but it will give me more info in the morning

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

@sloretz sloretz added in progress Actively being worked on (Kanban column) and removed in review Waiting for review (Kanban column) labels Nov 2, 2017
@sloretz sloretz dismissed dirk-thomas’s stale review November 10, 2017 00:08

Changed since review. Executor code moved to #140

@dhood dhood mentioned this pull request Nov 27, 2017
@sloretz
Copy link
Contributor Author

sloretz commented Jan 4, 2018

Lots has been split off, changed, and merged since this PR. I'll close this expecting to open smaller PRs with these features in the future.

@sloretz sloretz closed this Jan 4, 2018
@sloretz sloretz removed the in progress Actively being worked on (Kanban column) label Jan 4, 2018
@sloretz sloretz deleted the rclpy_wait_for_service branch January 4, 2018 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants