-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast-RTPS cross vendor tests are failing frequently on Windows #246
Comments
Before integrating FastRTPS v1.7.0 and the changes in rmw_fastrtps to support this version, they were tested against your CI ( link ). As far as I know your CI checks communication between Fast RTPS and Connext. Do you know what changes are made in rmw_fastrtps after merging v1.7.0? |
We are trying to help investigating which could be the problem. We were analyzing your CI jobs. There is something we don't understand. We don't know how you CI jobs are internally and surely there is a reason. Why did job 1024 fail but next day job 1025 work successfully? It seems they used the same configuration and there are no significant changes in involved repositories. Is there something we don't get? some configuration that change between nightly jobs? Thanks. |
While looking at last night's jobs I saw that 1042 also has communication issues between connext and fastrtps. These failures are logged in ros2/build_farmer#153 and look to have first appeared in 1035.
It's not an explanatory reason, but our CI defaults to re-running failed tests up to 10 times to see if they'll pass. This is a mitigation against tests that may flake due to network or other variable conditions but it also muddies the waters when trying to pinpoint the exact start of issues that don't occur every time. That the tests sometimes fail and sometimes don't suggest the issue is not reliably reproduced. Although it has certainly become more reliable to reproduce on Windows and possibly in debug configurations on Linux now as well. |
As I have stated here it seems connext is now too restrictive on guidPrefix values. Could you check with eProsima/Fast-DDS#353 ? |
Thanks for linking that @MiguelCompany. I've triggered a build of our communication tests with the retest-until-pass setting reduced from 10 to 3. Edit: Added a run on Linux in the Debug configuration where we have somtimes seen failures as well (see ros2/build_farmer#153) |
There is warning output during cross-communication reported in ros2/demos#293 but (Fast-RTPS <-> Connext is toward the end of the description). It doesn't appear that communication was inhibited so those warnings may or may not be related. |
@nuclearsandwich @wjwwood On the Linux debug build, I see that Connext is failing to initialize the rcl node on some tests where Fast-RTPS is not involved. For instance, this test says
|
@nuclearsandwich @wjwwood FYI, eProsima/Fast-DDS#353 has been merged on master. |
Cross-vendor tests between Fast-RTPS and OpenSplice are also having issues: ros2/system_tests#322 Recent example: https://ci.ros2.org/view/nightly/job/nightly_win_extra_rmw_rel/196/ |
During the lead-up to Crystal, I tested cross-vendor support on my Windows 10 Virtual Machine, and did not find any issues. In one of the hangouts, @wjwwood put forward the hypothesis that FastRTPS, OpenSplice, and Connext may be choosing different network interfaces in order to do discovery or connectivity, leading the nodes to fail to discover each other. This is only a hypothesis and has not been validated or researched in any way. |
We fixed an issue sending multicast in a Windows machine with several interfaces and some of them disconnected. We think that issue should fix this one. Can you confirm? |
v1.7.1 incorporates eProsima/Fast-DDS#394. Can you verify your nightly job works as expected? Thanks |
I think this can be closed now that ros2/ros2#814 has been merged |
Bug report
Required Info:
Steps to reproduce issue
In terminal A:
In terminal B:
Expected behavior
They communicate and the listener receives data from the talker.
Actual behavior
Nothing is received by the listener.
Additional information
This occurs when swapping which is using Fast-RTPS/Connext (talker vs listener), and is resolved if you use either Fast-RTPS or Connext on both sides.
This is likely the root cause of new failures in our
test_communication
tests which do cross-vendor testing.Screenshots:
The text was updated successfully, but these errors were encountered: