-
Notifications
You must be signed in to change notification settings - Fork 11
Change AllowMulticast to spdp #34
Change AllowMulticast to spdp #34
Conversation
af57b2b
to
5e21732
Compare
Point of clarification: this change does not prevent multiple listeners from subscribing to the same topic - it just makes OpenSplice use a single multicast stream for each. This could result in more bandwidth uses in some cases, but reduces latency by a large multiple when WiFi is involved. https://tools.ietf.org/id/draft-mcbride-mboned-wifi-mcast-problem-statement-01.html The downside to this change is an increase in bandwidth use over a wired network when you have a topic with lots of subscribers. |
As requested in our triage meeting, this is the issue where we discussed (at length, sorry about that) multicast for data transmission with Fast-RTPS: Specifically see this comment: ros2/rmw_fastrtps#81 (comment) Which seems to imply Fast-RTPS changed it's default behavior in a similar fashion to this issue. |
The change is clearly XML, and valid for Opensplice, so it is fine from that perspective. It also seems to match what Fast-RTPS is doing, if I understand correctly. @cwyark , could you maybe comment here and tell us what the pros and cons of this change are? |
@clalancette @rotu opensplice_cmake_module/config/ros_ospl.xml is better to keep for the general case. e.g wired network or inter-process. For some special case, e.g wifi multicasting, it is suggested to create your own .xml, by specifying a bash environment: OSPL_URI=file://<path-to-your-custom-xml> For example: $> export OSPL_URI=file:///opt/ospl/HDE/x86_64.darwin10_clang/etc/config/ospl.xml
$> ros2 run demo_nodes_cpp talker <- opensplice in this process will apply the custom .xml By doing this, you can flexible tune opensplice performance at development stage. Even you can have different *.xml for different ROS2 nodes for a better system wild optimization. |
I agree that Can you explain the downsides of using spdp on an ethernet network? |
It was a shock to me that, out of the box, OpenSplice was choking on as few as 50 messages per second on a network connection capable of 300 Mbps. My response as a user was "OpenSplice is defective" not "OpenSplice is poorly tuned". It makes sense for the out-of-the-box settings to be written for the first usage of OpenSplice. As it stands, multicast on is a premature optimization with big drawbacks. It also makes sense that there should be documentation somewhere for how to tune it for best performance, including advice to turn multicast back on where appropriate. |
@clalancette if changing from <AllowMulticast>true</AllowMulticast> to <AllowMulticast>spdp</AllowMulticast> messages transporting between DataReaders and DataWriters will turn into unicast, but reserve the multicast ability to discovery each other. if set AllowMulticast=true
if set AllowMulticast=spdp
Since wired network (e.g. ethernet switch) has ability to perform a real multicast, AllowMulticast=true will help reduce the total bandwidth of the network. In conclusion,
|
@cwyark Thank you, that is a great explanation of the differences between them. Given that SPDP is already the default in Fast-RTPS, then I think we should do the same for Opensplice. I'm going to run CI on this change and merge assuming it passes. |
So, a bunch of the CI warnings are known issues. However, there are enough failing tests with Opensplice that makes me suspect this change. For instance, https://ci.ros2.org/job/ci_linux/7643/testReport/junit/projectroot/test_pendulum__rmw_opensplice_cpp/test_pendulum__rmw_opensplice_cpp/ is known to be flaky on aarch64 (see ros2/build_farmer#133), but here it failed on amd64 as well. macOS also has a bunch of failures in the opensplice tests which don't seem to be present in the nightly (https://ci.ros2.org/view/nightly/job/nightly_osx_debug/1321/#showFailuresLink). Therefore, I don't think we can merge this as-is. @rotu Can you take a look locally and see if you can reproduce/fix the failures? Thanks. |
@clalancette
It looks like the test is failing in its setup, before it even gets to RMW stuff:
Also, this is mathematically impossible, so I'm guessing the test is broken in other ways:
|
Identified one issue: ros2/rosidl#395 |
Alright I'm able to sometimes produce a failure. I think the issue is that rttest is not |
@clalancette Are you still concerned about the CI status? |
In my testing, topics with reliable durability across a wifi network can only sustain about 30-60 messages per second (on a Ubiquiti AC Pro Router). Disabling multicast allows 800-1000 messages per second. Signed-off-by: Dan Rose <dan@digilabs.io>
5e21732
to
e4a5ba8
Compare
@clalancette needed changes from 7b46150 |
Blech. This same darn test failing. Does this tend to pass reliably before this PR? |
So this is the thing; looking at https://ci.ros2.org/view/nightly/job/nightly_linux_debug/ , I don't see that test failing for at least the last 10 nightlies (I stopped looking after that). So it does suggest to me that this PR is causing it, but I don't really know why. |
I'm also seeing mathematically impossible results. Standard deviation can't be higher than the maximum absolution deviation (max-min). [pendulum_demo-2] - Min: 21879 ns |
Yeah, those are clearly bogus (probably uninitialized memory). I took a quick look, and it wasn't obvious that it was uninitialized, but I didn't have time today to really dig into it. |
All right. I think that the mathematically impossible results are because of two different bugs in the pendulum stuff.
With both of those fixes in place, I now get to a place where I sometimes see I'll continue to poke at the bugs in the pendulum test, but I highly doubt the fixes will have any impact on these test failures. |
See ros2/realtime_support#81 and ros2/demos#385 for some fixes to the pendulum_control demos. |
Thank you for looking into that, @clalancette. I'm trying to compare this to the test results where this last succeeded, but it looks like it doesn't run every build and NEARLY EVERY RUN is failing, even before I touched this code. I think this test is faulty. |
Also, if you know a way to bring up the entire test history for "projectroot.test_pendulum__rmw_opensplice_cpp.test_pendulum__rmw_opensplice_cpp", I'd like to know. I don't understand Jenkins' interface at all |
Hm, interesting that it is failing on CI jobs, but not on the nightlies. Maybe it is pretty flaky on amd64 as well, and the nightlies are just lucky? I'm not sure.
The Jenkins UI is a mystery to me too, so I doubt I'll be a help here. That all being said, we can find out for sure whether my changes in the pendulum_control code makes a difference here. I'm going to launch CI with these changes plus those changes to see what happens. |
I’m pleasantly surprised it passed! Judging by the output, it looks like aarch64’s failures are also due to uninitialized memory. |
That....shouldn't have happened with my latest patches. Those patches make it so that no data is published until it has been initialized at least once. That suggests that there is another uninitialized piece of memory in there, but I didn't see any of them when I was looking. Also, I didn't see those problems when I ran CI on my other patches, but maybe it is random. In any case, I think it is clear that this PR isn't causing the problem (maybe just exacerbating it). I'm going to merge this one, then get my other fixes reviewed and merged. There is probably additional follow-up work to do on it then. |
I had a dumb bug in my patches to fix the uninitialized memory. I'm re-running CI there (which will obviously include this change), but I'm fairly sure that will fix the problem now. |
In my testing, topics with reliable durability across a wifi network can only sustain about 30-60 messages per second (on a Ubiquiti AC Pro Router). Disabling multicast allows 800-1000 messages per second. Signed-off-by: Dan Rose <dan@digilabs.io>
In my testing, topics with reliable durability across a wifi network can only sustain about 30-60 messages per second (on a Ubiquiti AC Pro Router). Disabling multicast allows 800-1000 messages per second.