Rclpy wait for service #127

sloretz · 2017-10-13T16:11:26Z

This adds a blocking wait_for_service() to the client. It is implemented using a GraphListener which runs in its own thread - similar to rclcpp. This is needed to prevent wait_for_service() from deadlocking a single threaded executor if someone calls it from a subscriber callback.

There is a small amount of refactoring in the executor for the purpose of reusing code in the graph listener. Also after talking with @mikaelarguedas negative timeouts now block forever instead of returning immediately.

Linux
Linux-aarch64
macOS
Windows

connects to #58

mikaelarguedas

this looks pretty good. I had a few comments/questions inline, mostly nitpicks.
This could benefit from another set of eyes as well

mikaelarguedas · 2017-10-13T18:15:24Z

rclpy/rclpy/client.py

+        """
+        Block until the service is available.
+
+        :param timeout_sec: Seconds to wait. Block forever if None. Don't wait if <= 0


docstring should be updated to reflect that negative values wait forever

Fixed b68335b

mikaelarguedas · 2017-10-13T23:07:08Z

rclpy/rclpy/graph_listener.py

+        # Assumes lock is already held
+        if self._thread is None:
+            self._thread = threading.Thread(target=self._runner)
+            self._thread.daemon = True


Nit: you can "daemonize" a thread by passing the keyword argument daemon=True to the Thread constructor (same below)

Fixed ace6fe0

mikaelarguedas · 2017-10-16T17:44:06Z

rclpy/rclpy/client.py

+
+        :param timeout_sec: Seconds to wait. Block forever if None. Don't wait if <= 0
+        :type timeout_sec: float or None
+        :rtype: bool true if the service is available


Nit:
Use returns to describe what the function returns and rtype to say what type is returned

:returns: True if the service is available, otherwise False :rtype: bool

(Same for all functions docblocks)

Fixed e37969d

mikaelarguedas · 2017-10-16T18:00:20Z

rclpy/rclpy/wait_set.py

+        del self._ready_pointers[subscription_pointer]
+        self._needs_building = True
+
+    def add_subscriptions(self, subscriptions):


Nit: move this function next to add_subscription as it's done below for timers

Done 0285da1

mikaelarguedas · 2017-10-16T18:03:11Z

rclpy/rclpy/wait_set.py

+
+    def remove_subscription(self, subscription_pointer):
+        del self._subscriptions[subscription_pointer]
+        del self._ready_pointers[subscription_pointer]


Should this check that subscription_pointer is a key of _ready_pointer before trying to delete it ? (same multiple times below)

Nvm, the key is always created when an entity is created.
Though now I'm not sure who is making sure than the ready_pointers not updated in https://github.com/ros2/rclpy/pull/127/files#diff-c7dc6d55f68505f4de26646b4541e2cfR151 (the ones that are not ready) are set to False in the existing ready_pointers dict

Oops, fixed + test updated to catch that in a8e54fc

mikaelarguedas · 2017-10-16T18:25:29Z

rclpy/test/test_client.py

@@ -0,0 +1,61 @@
+# Copyright 2017 Open Source Robotics Foundation, Inc.


We will need to extend the timeout of the nose test as several of these could take some time and the total test time will likely take more than 90 seconds in non ideal cases

What are the non-ideal cases? How long do they take? I upped it by 30 seconds (+4ish seconds per new test) 72e8ea9.

The worst case scenario really depends on the nature of the test, e.g. if a test has a wait_for_service(5s) worst case it will time out and take ~5seconds. For reference our service communication tests have a timeout of 30s. On my machine, this takes between 90 and 100s so I guess 120 is fine. We often revisit test timeout later on given that not all nodes have the same behavior. A good way to estimate the duration would be to run jobs on all platforms with this single test repeated multiple times

mikaelarguedas · 2017-10-16T18:34:21Z

rclpy/test/test_client.py

+    @classmethod
+    def setUpClass(cls):
+        rclpy.init()
+        cls.node = rclpy.create_node('TestClient', namespace='/')


Question: Is the namespace necessary here ? or should an empty namespace expand to / automatically ?

Not necessary. This was a copy/paste error b9d015f

Looks like the referenced commit didnt modify this line

Oops. Second attempt: 0ef6a4d

mikaelarguedas · 2017-10-16T18:40:30Z

rclpy/test/test_utilities.py

+        self.assertGreater(0, rclpy.utilities.timeout_sec_to_nsec(None))
+        self.assertGreater(0, rclpy.utilities.timeout_sec_to_nsec(-1))
+        self.assertEqual(0, rclpy.utilities.timeout_sec_to_nsec(0))
+        self.assertEqual(1000000000, rclpy.utilities.timeout_sec_to_nsec(1))


Nit: Using the number in nanoseconds makes it harder to read. Could we use S_TO_NS in these tests?

f0b36e8 and 14f3bc7

mikaelarguedas · 2017-10-16T18:57:30Z

rclpy/test/test_client.py

+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import time


Nit: rcl timers are using steady time so I think the tests comparing times should use the same time source

Done c3ab486

mikaelarguedas · 2017-10-16T19:02:24Z

rclpy/test/test_wait_set.py

+            self.assertFalse(ws.is_ready(gc_pointer))
+
+            _rclpy.rclpy_trigger_guard_condition(gc_handle)
+            # TODO(sloretz) why does the next assertion fail with wait(0)?


hmmm good question, would be good to find out why

I've seen that there are some additional things that happen in rmw_fastrtps_cpp's wait when the requested timeout is 0, in particular: https://github.com/ros2/rmw_fastrtps/blob/a6ac02905c55118fb1da636f85c8197dde8ef58e/rmw_fastrtps_cpp/src/rmw_wait.cpp#L236

When I was pulling up that link I realised that it looks to be checking hasTriggered on the guard condition, while getHasTriggered that was called at L229 will have cleared the flag already. So I suspect that rmw_wait is incorrectly returning that it timed out even though the guard condition had triggered.

Then as a flow-on effect, it looks like rcl_wait will clear the guard conditions if rmw_wait timed out: https://github.com/ros2/rcl/blob/1fa4acac8ca42f61bd0f80c7d3ee5fba3d196e33/rcl/src/rcl/wait.c#L621, so this combination of things might explain the difference in behaviour with wait(0).

Have you looked into it already @sloretz? Otherwise I'll continue the investigations

@dhood nice find! I hadn't looked into it

ros2/rmw_fastrtps#158 should let you change back to wait(0)

Works on my machine. Thanks @dhood 385b18b

mikaelarguedas · 2017-10-17T23:18:32Z

rclpy/rclpy/executors.py

        """
+        if timeout_sec is not None and timeout_sec < 0:
+            timeout_sec = None
        # Wait for all work to complete
        if timeout_sec is None or timeout_sec >= 0:


isn't this always evaluating to True ? (given the previous 2 lines)

Good catch 6c77285

mikaelarguedas · 2017-10-17T23:39:28Z

rclpy/src/rclpy/_rclpy.c

+    return NULL;
+  }
+  if (!PyCapsule_IsValid(pynode, NULL) || !PyCapsule_IsValid(pyclient, NULL)) {
+    PyErr_Format(PyExc_ValueError, "Expected two PyCapsule as arguments");


This raises ValueError on invalid capsule while rclpy_get_graph_guard_condition raises TypeError

Now they both check if PyCapsule_GetPointer returns NULL, which means they raise ValueError 1628cba. Pull request #129 does the same thing. This will need to be updated to use capsule names in whichever PR lands second.

dhood · 2017-10-17T17:43:23Z

rclpy/rclpy/utilities.py

+    if timeout_sec is None:
+        # Block forever
+        return -1
+    elif timeout_sec > -1.0 / S_TO_NS and timeout_sec < 1.0 / S_TO_NS:


Could you clarify the reasoning behind the rounding here?

I think it's going to trip users up because of how the behaviour changes if you just slightly change timeout values around the edges of +/-1ns. E.g. if I'm calculating timeout_sec values through some algorithm of my own, I'd expect both a -1.1ns timeout and a -0.9ns timeout to be translated to indefinite wait. I would suggest just converting to nanoseconds and then checking if the result is negative.

For the +0.9ns case, it could be argued that this should cause a blocking wait of 1ns, but I don't think we need to be too picky about this since in theory the difference between waiting 0ns and 1ns shouldn't be too drastic (in practice, it seems from https://github.com/ros2/rclpy/pull/127/files#diff-5464c403a7a6a4bd5dc7f394ce03bfdbR48 that that's not the case right now). We should at least clarify in the docblock that it doesn't wait if the timeout is <1ns.

The goal was only to avoid == on a floating point number. The bound can be smaller (gtest EXPECT_DOUBLE_EQ uses 4 ULP). I chose 1ns because it's the smallest amount of time rcl_wait supports. How about changing it to 1 ULP on the negative side (block forever if timeout_sec < 0), and 0.5ns on the positive side (don't wait if 0 == round(timeout_sec / S_TO_NS)) ?

Not sure what I was thinking before, but in 8b2008c any negative number blocks, and a timeout less than 1ns doesn't block due to the integer cast.

8b2008c matches what I'd expect now

dhood · 2017-10-17T18:30:27Z

rclpy/test/test_wait_set.py

+            self.assertFalse(ws.is_ready(gc_pointer))
+
+            _rclpy.rclpy_trigger_guard_condition(gc_handle)
+            # TODO(sloretz) why does the next assertion fail with wait(0)?


I've seen that there are some additional things that happen in rmw_fastrtps_cpp's wait when the requested timeout is 0, in particular: https://github.com/ros2/rmw_fastrtps/blob/a6ac02905c55118fb1da636f85c8197dde8ef58e/rmw_fastrtps_cpp/src/rmw_wait.cpp#L236

When I was pulling up that link I realised that it looks to be checking hasTriggered on the guard condition, while getHasTriggered that was called at L229 will have cleared the flag already. So I suspect that rmw_wait is incorrectly returning that it timed out even though the guard condition had triggered.

Then as a flow-on effect, it looks like rcl_wait will clear the guard conditions if rmw_wait timed out: https://github.com/ros2/rcl/blob/1fa4acac8ca42f61bd0f80c7d3ee5fba3d196e33/rcl/src/rcl/wait.c#L621, so this combination of things might explain the difference in behaviour with wait(0).

Have you looked into it already @sloretz? Otherwise I'll continue the investigations

dhood · 2017-10-17T23:21:23Z

rclpy/test/test_wait_set.py

+            cli.wait_for_service()
+            cli.call(GetParameters.Request())
+
+            ws.wait(5 * S_TO_NS)


I don't think this will be long enough for connext, we tend to use 20s in other service tests

Given that in this case the client and the service server are part of the same participant I would expect the discovery to be very quick and the 5sec timeout to be enough.
Talking briefly with @dhood offline it looks like the current value should be good enough. We would still benefit from a repeated job on this package before merging to convince ourselves and catch potential flakyness before it hits the nighlies

@mikaelarguedas @dhood Repeated job in progress http://ci.ros2.org/job/ci_linux/3361/

It doesn't appear to be flaky on linux http://ci.ros2.org/job/ci_linux/3361/console#console-section-164. Do the tests take longer on OSX or Windows? I can start a repeated job on them too.

mikaelarguedas · 2017-10-18T22:55:10Z

rclpy/test/test_client.py

+import time
+import unittest
+
+from rcl_interfaces.srv import GetParameters


rclcpp defines mocking interfaces to test this kind of things rather than relying on existing rcl_interfaces. It may be worth doing the same things here to avoid confusion as an outsider would expect Parameters to be a thing that exists in rclpy when seeing this service being used

Mind if this goes in a follow up PR? test_callback_group.py, test_destruction.py, and test_node.py would also benefit from this.

dirk-thomas · 2017-10-23T18:19:28Z

rclpy/rclpy/executors.py

-                timers_ready = _rclpy.rclpy_get_ready_entities('timer', wait_set)
-                clients_ready = _rclpy.rclpy_get_ready_entities('client', wait_set)
-                services_ready = _rclpy.rclpy_get_ready_entities('service', wait_set)
+            wait_set = _WaitSet()


This can't rely on __del__ being called to destroy the allocated C wait set. That might just never happen. Instead the code should explicitly call a cleanup function when it doesn't need wait_set anymore.

AFAIK an explicit method cannot be relied on unless it is implemented using the with or try/finally statements. I can add __enter__() and __exit__() methods.

I think __del__ is OK here unless the C structure itself is very big. The cases where __del__ won't be called look acceptable to me.

__del__ never called

the interpreter is shutting down

OS will free memory anyways when the interpreter terminates.

there is a chain of circular references and cyclic garbage collection is disabled and one the objects in the chain has a reference to an instance of WaitSet

It's not just the wait set, all the other objects in the chain are leaked too. The user must fix their circular reference bug if they don't want memory leaks.

__del__ not called for a long time

User messed with garbage collector settings

They've got bigger problems. If this isn't getting collected then a lot of other stuff isn't either

Something keeping the reference count from hitting zero (ex: An exception was raised and the WaitSet is referenced by a local variable in one of the stack frames in sys.last_traceback and another exception hasn't been raised for a while)

Explicit method or not the class instance and member variables will use memory until they are garbage collected (at which point __del__ will be called unless the interpreter is shutting down). An explicit method can only free the C structure. I'm not sure how big the C structure is; if it's really big then it is worth freeing it sooner.

I am pretty sure that without fiddling with any setting the GC won't collect an instance "immediately" but might only do that after minutes. And if the user want's to ensure that the C structure is deleted there needs to be a way to do so (without triggering the GC). Therefore I think it is better to expect that the calling code invokes a cleanup function explicitly. If the caller fails to do so the __del__ method can this do it for them to be on the safe side.

Resolved by adding WaitSet.destroy() and having the executor use it via with d8521cd

dirk-thomas · 2017-10-23T18:19:35Z

rclpy/rclpy/graph_listener.py

+from rclpy.wait_set import WaitSet as _WaitSet
+
+
+class GraphListenerSingleton:


How is this singleton being destroyed in case of e.g. shutting down rclpy? When is the thread being stopped?

In d10ee26 it checks if rclpy was shut down once per second.

sloretz · 2017-11-01T16:58:12Z

Not merging yet because it appears to push the 1000Hz timer test from flaky to failing 100% of the time on aarch64. Even with --retest-until-pass 100 the timer test never passed
http://ci.ros2.org/job/ci_linux-aarch64/658/

The reason for this test failure is probably performance in spin_once. Nothing stands out using cProfile on x86_64. The largest amount of time (besides rclpy_wait) is spent doing list comprehensions, so I'll try reducing those first.

~5% less overhead in wait_for_ready_callbacks()

dirk-thomas · 2017-11-01T18:08:23Z

rclpy/rclpy/executors.py

+
+                # Process ready entities one node at a time
+                for node in nodes:
+                    for tmr in [t for t in timers if wait_set.is_ready(t.timer_pointer)]:


Shouldn't this list comprehension only consider the timers of the current node?

Same for subscriptions, clients and services below.

Fixed by 6352e86

Also refactored some code to a sub-generator

sloretz · 2017-11-01T18:35:30Z

@dirk-thomas replaced comprehensions with list() 0dbf8c7

sloretz · 2017-11-01T18:39:15Z

Improvements result in ~9% less overhead in wait_for_ready_callbacks on x86_64. I started another job to see if that's good enough to get the 1kHz timer test to pass on aarch64 ~~http://ci.ros2.org/job/ci_linux-aarch64/661~~ http://ci.ros2.org/job/ci_linux-aarch64/663/

Edit: Still not fast enough :(

Build with 7ee5ead Another try http://ci.ros2.org/job/ci_linux-aarch64/668/

Edit, more stuff I don't expect this to pass, but it will give me more info in the morning

Linux
Linux-aarch64
macOS
Windows

Changed since review. Executor code moved to #140

sloretz · 2018-01-04T16:16:30Z

Lots has been split off, changed, and merged since this PR. I'll close this expecting to open smaller PRs with these features in the future.

sloretz added the in progress Actively being worked on (Kanban column) label Oct 13, 2017

sloretz self-assigned this Oct 13, 2017

sloretz mentioned this pull request Oct 13, 2017

Use wait_for_service ros2/examples#185

Merged

sloretz added in review Waiting for review (Kanban column) and removed in progress Actively being worked on (Kanban column) labels Oct 13, 2017

sloretz changed the title ~~[WIP] Rclpy wait for service~~ Rclpy wait for service Oct 13, 2017

mikaelarguedas reviewed Oct 16, 2017

View reviewed changes

mikaelarguedas reviewed Oct 17, 2017

View reviewed changes

dhood reviewed Oct 18, 2017

View reviewed changes

mikaelarguedas reviewed Oct 18, 2017

View reviewed changes

dirk-thomas reviewed Oct 23, 2017

View reviewed changes

sloretz force-pushed the rclpy_wait_for_service branch from 8b2008c to be25673 Compare October 24, 2017 16:36

sloretz added 17 commits October 25, 2017 12:56

Dirty: First proof of concept for wait_for_service implementation

372c1a9

Moved wait set code to its own class for code reuse

817990a

Added timeout_sec_to_nsec()

9b91a62

wait_for_service() implemented with timers

371c917

Added unit tests for timeout_sec_to_nsec()

b17f497

Added test for WaitSet class

35898a6

Use negative timeouts to mean block forever

5b1ca62

Double quotes to single quotes

489bc0e

Added wait_for_service() tests and fixed bugs it caught

f854466

Eliminate blind exception warning

9255e1f

Reduce flakiness of test by increasing time to 0.1s

f89a222

Comment says negative timeouts block forever

39ac231

Use :returns:

3237ed8

Move add_subscriptions()

7acb2a6

arugments -> arguments

4720d3f

Daemon as keyword arg

395934a

Remove unnecessary namespace argument

1c9a098

Executor optimizations

dd86a13

~5% less overhead in wait_for_ready_callbacks()

dirk-thomas suggested changes Nov 1, 2017

View reviewed changes

sloretz added 2 commits November 1, 2017 11:24

Fixed executor yielding entities to wrong node

6352e86

Also refactored some code to a sub-generator

Use list() only where necessary

0dbf8c7

dirk-thomas previously approved these changes Nov 1, 2017

View reviewed changes

sloretz added 4 commits November 1, 2017 12:05

Docstring in imperitive mood

1a2ce96

Executors reuse iterator

7ee5ead

moved some wait_set code into C

3628db4

Avoid another list comprehension

b71d1c5

sloretz added in progress Actively being worked on (Kanban column) and removed in review Waiting for review (Kanban column) labels Nov 2, 2017

sloretz added 6 commits November 2, 2017 17:50

Replaced WaitSet with C code in executor

9d9113d

Remove test code

adfddf6

Use lists instead of set

315db4e

Use locally defined function instead of member function

e078c47

Shorten code using macro

5c92085

Move everything to new wait_set code

70b8652

sloretz mentioned this pull request Nov 10, 2017

Executor bug fixes and WaitSet refactor #140

Closed

sloretz mentioned this pull request Nov 15, 2017

Asyncronous wait_for_service #58

Open

dhood mentioned this pull request Nov 27, 2017

waitset -> wait_set #152

Merged

sloretz mentioned this pull request Nov 28, 2017

Added timeout_sec_to_nsec utility #160

Merged

sloretz closed this Jan 4, 2018

sloretz removed the in progress Actively being worked on (Kanban column) label Jan 4, 2018

sloretz deleted the rclpy_wait_for_service branch January 4, 2018 16:16

sloretz mentioned this pull request Jan 29, 2020

Fix object destruction order #497

Merged

		@@ -0,0 +1,61 @@
		# Copyright 2017 Open Source Robotics Foundation, Inc.

		from rclpy.wait_set import WaitSet as _WaitSet


		class GraphListenerSingleton:

Rclpy wait for service #127

Rclpy wait for service #127

Conversation

sloretz commented Oct 13, 2017 • edited Loading

mikaelarguedas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sloretz Oct 17, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhood Oct 19, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sloretz Oct 19, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sloretz commented Nov 1, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sloretz commented Nov 1, 2017

sloretz commented Nov 1, 2017 • edited Loading

sloretz commented Jan 4, 2018

sloretz commented Oct 13, 2017 •

edited

Loading

sloretz Oct 17, 2017 •

edited

Loading

dhood Oct 19, 2017 •

edited

Loading

sloretz Oct 19, 2017 •

edited

Loading

sloretz commented Nov 1, 2017 •

edited

Loading