Skip to content
This repository has been archived by the owner on Dec 9, 2021. It is now read-only.

Change AllowMulticast to spdp #34

Merged
merged 1 commit into from
Aug 12, 2019

Conversation

rotu
Copy link
Contributor

@rotu rotu commented Jun 27, 2019

In my testing, topics with reliable durability across a wifi network can only sustain about 30-60 messages per second (on a Ubiquiti AC Pro Router). Disabling multicast allows 800-1000 messages per second.

@rotu rotu force-pushed the disable_data_multicast branch from af57b2b to 5e21732 Compare June 27, 2019 18:19
@rotu
Copy link
Contributor Author

rotu commented Jun 29, 2019

Point of clarification: this change does not prevent multiple listeners from subscribing to the same topic - it just makes OpenSplice use a single multicast stream for each. This could result in more bandwidth uses in some cases, but reduces latency by a large multiple when WiFi is involved. https://tools.ietf.org/id/draft-mcbride-mboned-wifi-mcast-problem-statement-01.html

The downside to this change is an increase in bandwidth use over a wired network when you have a topic with lots of subscribers.

@wjwwood
Copy link
Member

wjwwood commented Jul 11, 2019

As requested in our triage meeting, this is the issue where we discussed (at length, sorry about that) multicast for data transmission with Fast-RTPS:

ros2/rmw_fastrtps#80

Specifically see this comment:

ros2/rmw_fastrtps#81 (comment)

Which seems to imply Fast-RTPS changed it's default behavior in a similar fashion to this issue.

@wjwwood wjwwood removed their assignment Jul 11, 2019
@clalancette
Copy link
Contributor

The change is clearly XML, and valid for Opensplice, so it is fine from that perspective. It also seems to match what Fast-RTPS is doing, if I understand correctly.

@cwyark , could you maybe comment here and tell us what the pros and cons of this change are?

@cwyark
Copy link

cwyark commented Jul 17, 2019

@clalancette @rotu opensplice_cmake_module/config/ros_ospl.xml is better to keep for the general case. e.g wired network or inter-process. For some special case, e.g wifi multicasting, it is suggested to create your own .xml, by specifying a bash environment:

OSPL_URI=file://<path-to-your-custom-xml>

For example:

$> export OSPL_URI=file:///opt/ospl/HDE/x86_64.darwin10_clang/etc/config/ospl.xml
$> ros2 run demo_nodes_cpp talker  <- opensplice in this process will apply the custom .xml

By doing this, you can flexible tune opensplice performance at development stage. Even you can have different *.xml for different ROS2 nodes for a better system wild optimization.

@clalancette
Copy link
Contributor

@clalancette @rotu opensplice_cmake_module/config/ros_ospl.xml is better to keep for the general case.

I agree that ros_ospl.xml should be for the general case, but I think that in today's world, the general case has to include wireless.

Can you explain the downsides of using spdp on an ethernet network?

@rotu
Copy link
Contributor Author

rotu commented Jul 17, 2019

It was a shock to me that, out of the box, OpenSplice was choking on as few as 50 messages per second on a network connection capable of 300 Mbps. My response as a user was "OpenSplice is defective" not "OpenSplice is poorly tuned". It makes sense for the out-of-the-box settings to be written for the first usage of OpenSplice. As it stands, multicast on is a premature optimization with big drawbacks. It also makes sense that there should be documentation somewhere for how to tune it for best performance, including advice to turn multicast back on where appropriate.

@cwyark
Copy link

cwyark commented Jul 24, 2019

@clalancette if changing from

<AllowMulticast>true</AllowMulticast>

to

<AllowMulticast>spdp</AllowMulticast>

messages transporting between DataReaders and DataWriters will turn into unicast, but reserve the multicast ability to discovery each other.
For example, there are 1 DataWriter and 3 DataReader, and 1 DataWriter want to publish 1 message to these 3 DataReader.

if set AllowMulticast=true

  • At discovery stage: use IGMP join multicast group to search DataReaders.
  • At message transporting stage: DataWriter only need to send 1 multicast udp packet to multicast group (e.g. 239.255.0.x). And the router will help to forward the packets to these DataReaders.

if set AllowMulticast=spdp

  • At discovery stage: use IGMP join multicast group to search DataReaders.
  • At message transporting stage: DataWriter need to send 3 unicast udp packets to the 3 DataReaders. If there are 50 DataReaders, DataWriter need to send 50 udp packets.

Since wired network (e.g. ethernet switch) has ability to perform a real multicast, AllowMulticast=true will help reduce the total bandwidth of the network.
But there has no real multicast on WiFi, every multicast packet is still broadcast , so too many multicast packets between wireless devices will cause broadcast storming. (too many collision detection).

In conclusion,
AllowMulticast=true

  • Pros: reduce the bandwidth at wire network.
  • Cons: broadcast storming at wireless network.
    AllowMulticast=spdp
  • Pros: to prevent from broadcast storming at wireless network.
  • Cons: bandwidth will increase linearly when the number of DataReaders increase both in wired and wireless network.

@clalancette
Copy link
Contributor

@clalancette if changing from

<AllowMulticast>true</AllowMulticast>

to

<AllowMulticast>spdp</AllowMulticast>

@cwyark Thank you, that is a great explanation of the differences between them.

Given that SPDP is already the default in Fast-RTPS, then I think we should do the same for Opensplice. I'm going to run CI on this change and merge assuming it passes.

@clalancette
Copy link
Contributor

CI:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

@clalancette
Copy link
Contributor

So, a bunch of the CI warnings are known issues. However, there are enough failing tests with Opensplice that makes me suspect this change. For instance, https://ci.ros2.org/job/ci_linux/7643/testReport/junit/projectroot/test_pendulum__rmw_opensplice_cpp/test_pendulum__rmw_opensplice_cpp/ is known to be flaky on aarch64 (see ros2/build_farmer#133), but here it failed on amd64 as well. macOS also has a bunch of failures in the opensplice tests which don't seem to be present in the nightly (https://ci.ros2.org/view/nightly/job/nightly_osx_debug/1321/#showFailuresLink).

Therefore, I don't think we can merge this as-is. @rotu Can you take a look locally and see if you can reproduce/fix the failures? Thanks.

@rotu
Copy link
Contributor Author

rotu commented Jul 24, 2019

@clalancette
I can't repro the issue. Here's the output on my machine:

$ uname -a
Linux rhea 4.15.0-55-generic #60-Ubuntu SMP Tue Jul 2 18:22:20 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
$ colcon test --packages-select-regex ".*pendulum.*"
Starting >>> pendulum_msgs
Finished <<< pendulum_msgs [0.06s]
Starting >>> pendulum_control
Finished <<< pendulum_control [10.3s]            

Summary: 2 packages finished [10.8s]
$ colcon test-result --verbose
Summary: 51 tests, 0 errors, 0 failures, 4 skipped

It looks like the test is failing in its setup, before it even gets to RMW stuff:

[pendulum_demo-2] Couldn't set scheduling priority and policy: Operation not permitted
[pendulum_demo-2] mlockall failed: Cannot allocate memory
[pendulum_demo-2] Couldn't lock all cached virtual memory.
[pendulum_demo-2] Pagefaults from reading pages not yet mapped into RAM will be recorded.

Also, this is mathematically impossible, so I'm guessing the test is broken in other ways:

[pendulum_logger-1] Mean latency: 4458224516038787409289673575093816669551304146650037671236339690561781237058564279746291212376660112936812899278369000229134709514347391053183640763362416971774278560649182152566312699581486525144027387385646723574136033038253156096752904762288198647808.000000 ns
[pendulum_logger-1] Min latency: 1835355439 ns
[pendulum_logger-1] Max latency: 1047684450 ns

@rotu
Copy link
Contributor Author

rotu commented Jul 24, 2019

Identified one issue: ros2/rosidl#395

@rotu
Copy link
Contributor Author

rotu commented Jul 24, 2019

Alright I'm able to sometimes produce a failure. I think the issue is that rttest is not typesafe threadsafe. Trying to update the process priority might cover up some of these errors, but rttest shares statistics objects between threads without locking, and there is potential for non-atomic changes to it (I think that's the weirdness with mean latency).

@rotu
Copy link
Contributor Author

rotu commented Aug 5, 2019

@clalancette Are you still concerned about the CI status?

@clalancette
Copy link
Contributor

Here's a new CI on Linux amd64, just to see how things go. If this goes well, I'll do another CI run on macOS as well (I'm reluctant to do it at the moment since our macOS capacity is low):

  • Linux Build Status

In my testing, topics with reliable durability across a wifi network can only sustain about 30-60 messages per second (on a Ubiquiti AC Pro Router). Disabling multicast allows 800-1000 messages per second.

Signed-off-by: Dan Rose <dan@digilabs.io>
@rotu rotu force-pushed the disable_data_multicast branch from 5e21732 to e4a5ba8 Compare August 6, 2019 18:41
@rotu
Copy link
Contributor Author

rotu commented Aug 6, 2019

@clalancette needed changes from 7b46150
Rebased and pushed. Please rerun CI.

@clalancette
Copy link
Contributor

  • Linux Build Status

@rotu
Copy link
Contributor Author

rotu commented Aug 6, 2019

Blech. This same darn test failing. Does this tend to pass reliably before this PR?

@clalancette
Copy link
Contributor

Blech. This same darn test failing. Does this tend to pass reliably before this PR?

So this is the thing; looking at https://ci.ros2.org/view/nightly/job/nightly_linux_debug/ , I don't see that test failing for at least the last 10 nightlies (I stopped looking after that). So it does suggest to me that this PR is causing it, but I don't really know why.

@rotu
Copy link
Contributor Author

rotu commented Aug 6, 2019

I'm also seeing mathematically impossible results. Standard deviation can't be higher than the maximum absolution deviation (max-min).

[pendulum_demo-2] - Min: 21879 ns
[pendulum_demo-2] - Max: 18772095 ns
[pendulum_demo-2] - Mean: 2.56384e+06 ns
[pendulum_demo-2] - Standard deviation: 1.35819e+08

@clalancette
Copy link
Contributor

I'm also seeing mathematically impossible results. Standard deviation can't higher than the maximum absolution deviation (max-min).

Yeah, those are clearly bogus (probably uninitialized memory). I took a quick look, and it wasn't obvious that it was uninitialized, but I didn't have time today to really dig into it.

@clalancette
Copy link
Contributor

Yeah, those are clearly bogus (probably uninitialized memory). I took a quick look, and it wasn't obvious that it was uninitialized, but I didn't have time today to really dig into it.

All right. I think that the mathematically impossible results are because of two different bugs in the pendulum stuff.

  1. There's some potential for uninitialized memory. https://github.com/ros2/demos/blob/master/pendulum_control/src/pendulum_demo.cpp#L249 is getting called at some fixed frequency, but the very first time it calls set_rtt_results_message, there's no guarantee that the results structure has any valid data filled out yet. I'll (eventually) open a PR to initialize results to zero.
  2. There's a type issue between the {min,max}_latency as reported by realtime_support, and how we send it over the network. In https://github.com/ros2/realtime_support/blob/master/rttest/include/rttest/rttest.h#L45 , min and max latency are defined as integers. However, when we go to publish them during the pendulum demo, they get shoved into uint64 (https://github.com/ros2/demos/blob/master/pendulum_msgs/msg/RttestResults.msg#L8). Having latency be able to be negative is a bit weird, but seems to be intended by https://github.com/ros2/realtime_support/blob/master/rttest/src/rttest.cpp#L224 . I have another local change to switch that to an int64.

With both of those fixes in place, I now get to a place where I sometimes see min_latency as 2^31-1 and max_latency as -2^31 on the very first iteration, which should be consistent even if it is somewhat weird. We probably still need to rethink what is going on there, since that really shouldn't report anything unless there is something to report.

I'll continue to poke at the bugs in the pendulum test, but I highly doubt the fixes will have any impact on these test failures.

@clalancette
Copy link
Contributor

See ros2/realtime_support#81 and ros2/demos#385 for some fixes to the pendulum_control demos.

@rotu
Copy link
Contributor Author

rotu commented Aug 10, 2019

Also, if you know a way to bring up the entire test history for "projectroot.test_pendulum__rmw_opensplice_cpp.test_pendulum__rmw_opensplice_cpp", I'd like to know. I don't understand Jenkins' interface at all

@clalancette
Copy link
Contributor

Hm, interesting that it is failing on CI jobs, but not on the nightlies. Maybe it is pretty flaky on amd64 as well, and the nightlies are just lucky? I'm not sure.

Also, if you know a way to bring up the entire test history for "projectroot.test_pendulum__rmw_opensplice_cpp.test_pendulum__rmw_opensplice_cpp", I'd like to know. I don't understand Jenkins' interface at all

The Jenkins UI is a mystery to me too, so I doubt I'll be a help here.

That all being said, we can find out for sure whether my changes in the pendulum_control code makes a difference here. I'm going to launch CI with these changes plus those changes to see what happens.

@clalancette
Copy link
Contributor

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

@rotu
Copy link
Contributor Author

rotu commented Aug 12, 2019

I’m pleasantly surprised it passed! Judging by the output, it looks like aarch64’s failures are also due to uninitialized memory.

@clalancette
Copy link
Contributor

I’m pleasantly surprised it passed! Judging by the output, it looks like aarch64’s failures are also due to uninitialized memory.

That....shouldn't have happened with my latest patches. Those patches make it so that no data is published until it has been initialized at least once. That suggests that there is another uninitialized piece of memory in there, but I didn't see any of them when I was looking. Also, I didn't see those problems when I ran CI on my other patches, but maybe it is random.

In any case, I think it is clear that this PR isn't causing the problem (maybe just exacerbating it). I'm going to merge this one, then get my other fixes reviewed and merged. There is probably additional follow-up work to do on it then.

@clalancette
Copy link
Contributor

clalancette commented Aug 13, 2019

I had a dumb bug in my patches to fix the uninitialized memory. I'm re-running CI there (which will obviously include this change), but I'm fairly sure that will fix the problem now.

rotu added a commit to RoverRobotics-forks/rosidl_typesupport_opensplice that referenced this pull request Aug 30, 2019
In my testing, topics with reliable durability across a wifi network can only sustain about 30-60 messages per second (on a Ubiquiti AC Pro Router). Disabling multicast allows 800-1000 messages per second.

Signed-off-by: Dan Rose <dan@digilabs.io>
@rotu rotu mentioned this pull request Aug 30, 2019
clalancette pushed a commit that referenced this pull request Sep 3, 2019
In my testing, topics with reliable durability across a wifi network can only sustain about 30-60 messages per second (on a Ubiquiti AC Pro Router). Disabling multicast allows 800-1000 messages per second.

Signed-off-by: Dan Rose <dan@digilabs.io>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants