Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check if async connect() is success before read() or wirte() in TransportTCP #1202

Closed
wants to merge 1 commit into from

Conversation

bxwllzz
Copy link
Contributor

@bxwllzz bxwllzz commented Oct 25, 2017

I found a bug in transport_tcp.cpp when I using ROS in Windows Subsystem for Linux (WSL).
TransportTCP::read() did not check if aync socket is conneted. If socket is asynchronous, it may fails at recv() if its slow to connect (e.g. happens with wireless).
TransportTCP::read() and TransportTCP::write() needs to check if its connected or not if it's asynchronous.

Detailed debug log

Test Environment:

  • PC 1 (Windows Subsystem for Linux, wired): RViz
  • PC 2 (Ubuntu 16.04, wireless): roscore and other nodes
...
[DEBUG] [1508756684.267337700]: Retrying connection to [pc2:60159] for topic [/kinect/rgb/image_color]
[DEBUG] [1508756684.387001100]: Resolved publisher host [pc2] to [192.168.123.211] for socket [24]
[DEBUG] [1508756684.387432900]: setsockopt failed to set SO_KEEPALIVE on socket [24] [pc2:60159 on socket 24]
[DEBUG] [1508756684.387574800]: setsockopt failed to set TCP_KEEPCNT on socket [24] [pc2:60159 on socket 24]
[DEBUG] [1508756684.388021100]: Adding tcp socket [24] to pollset
[DEBUG] [1508756684.388515700]: Async connect() in progress to [pc2:60159] on socket [24]
[DEBUG] [1508756684.388849400]: recv() on socket [24] failed with error [传输端点尚未连接] (ENGLISH TRANSLATION: The transmission endpoint is not connected yet)
[DEBUG] [1508756684.389211800]: TCP socket [24] closed
[DEBUG] [1508756684.389618800]: Connection::drop(0)
[DEBUG] [1508756684.389992600]: Connection to publisher [TCPROS connection on port 3678 to [pc2:60159 on socket 24]] to topic [/kinect/rgb/image_color] dropped
...

Fix

I add function is_async_connected() to io.h. Given a socket which is still async connecting, this function will check if the connection is success, failed or still processing.

And I add a private member async_connected_ to TransportTCP.
Before every read() or write() on async socket, if async_connected_ is false, it will call is_async_connected() to check it. If it's connected, async_connected_ is set to true.

Test

WSL and Ubuntu

I have tested this fix in my multi-machine project using the patched ros_comm (https://github.com/bxwllzz/ros_comm/tree/kinetic-fix-transport-tcp)

  • PC 1 (WSL, wireless): ROS Kinetic, RViz
  • PC 2 (Ubuntu 16.04, wireless): ROS Kinetic, roscore and other nodes

Only on Ubuntu

  • PC (Ubuntu 16.04): ROS Kinetic, roscore, RViz and other nodes

test_roscpp

I have also run unit_tests by running catkin_make run_tests in ros_comm package for both WSL and Ubuntu. catkin_test_results build/test_results/test_roscpp/ reports no failures.

Maybe more tests about is_async_connected() is needed for native Windows. Because select() in WinSock is a little different from that in Unix. I have read documents about WinSock connect() select() and Unix connect() seriously to implemente is_async_connected().

If other WSL users wants to try this fix for ROS Kinetic ahead, please clone my branch kinetic-fix-transport-tcp to your catkin workspace.

@mikepurvis
Copy link
Member

This looks sane, and hasn't broken anything for us, but the usual caveats apply around ABI compatibility.

@mikepurvis
Copy link
Member

mikepurvis commented Nov 15, 2017

Dropping this for now because it doesn't merge cleanly with #1217, which Clearpath needs to have in the lunar-edge branch.

@gavanderhoorn
Copy link
Contributor

@mikepurvis wrote:

[..] which we definitely need.

'we'?

@mikepurvis
Copy link
Member

mikepurvis commented Nov 15, 2017

Sorry— the hitlist-labeled issues are patched onto the xxxx-edge branch by a semi-automated process, and that's the branch from which Clearpath creates our regular CI builds. However, it's starting to cause some confusion, so we may look at options for managing this process in a way that's more outside the Github infrastructure. I've edited the comment above to be less opaque.

@randaz81
Copy link

Can you please consider merging this before "future-release"?
I think this is critical for a safe operation on WSL (Windows 10 Pro, Build 16299)
.

@dirk-thomas
Copy link
Member

Since the patch breaks ABI compatibility it is not straigh forward to merge this into already released distros. I created #1270 to find a consensus on the general question since I want to avoid this discussion on each PR.

@bxwllzz
Copy link
Contributor Author

bxwllzz commented Dec 19, 2017

@mikepurvis To avoid breaking ABI compatibility, the async_connected_ field could be merged into the flag_ field as below:

class TransportTCP
{
...
   enum Flags
   {
     SYNCHRONOUS = 1 << 0,
     ASYNC_CONNECTED = 1 << 1,
   };
...
}

But I think flags_ is not suitable for recording if it is connected.

@mikepurvis mikepurvis changed the base branch from lunar-devel to melodic-devel April 21, 2018 21:04
@gavanderhoorn
Copy link
Contributor

@dirk-thomas @mikepurvis: just curious: is this PR on the list to get merged into Melodic?

Without this change, ROS1 on WSL is almost unusable.

@dirk-thomas
Copy link
Member

@bxwllzz Please consider to resolve the conflicts / rebase this to target the latest state of the target branch in order to be considered for the upcoming Noetic distro.

@dirk-thomas
Copy link
Member

@bxwllzz Closing due to no further reply. Please feel free to open a new PR against noetic-devel.

@gavanderhoorn
Copy link
Contributor

@dirk-thomas: is there any way to get this merged in Noetic after the initial release?

This is still an issue with ROS in WSL which effectively makes it impossible to use ROS on WSL 1.

@dirk-thomas
Copy link
Member

Since it seems to break ABI it is very unlikely to get released into an already released distro (see my previous comment: #1202 (comment)).

Anyone could try to rebase this patch and create a new PR targeting noetic-devel and I am happy to review and merge it. I am just spread way to thin to do it myself atm. It would just need to happen in the next days since the release date for Noetic is May 23rd (and a few days before that we won't do new releases of a core package like this anymore).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants