Allow reuse address when firing up the ros2cli daemon. #947

clalancette · 2024-11-18T15:32:24Z

Especially in tests where the daemon is being repeatedly created and destroyed, it can take some time for the kernel to actually allow the address to be rebinded (even after the old process has exited). This can lead to some situations where we fail to spawn the daemon.

To fix this, set "allow_reuse_address" inside the LocalXMLRPCServer, which will set SO_REUSADDR on the socket.

clalancette · 2024-11-18T15:33:03Z

Pulls: #947
Gist: https://gist.githubusercontent.com/clalancette/c79668ce9bddc836dcdeae6fb121cc2a/raw/2342a02ab486617cbac78c10af6ae3bacbacd367/ros2.repos
BUILD args:
TEST args:
ROS Distro: rolling
Job: ci_launcher
ci_launcher ran: https://ci.ros2.org/job/ci_launcher/14831

Linux
Linux-aarch64
Linux-rhel
Windows

clalancette · 2024-11-18T15:42:26Z

After discussing this with @cottsay a bit, we shouldn't do this on Windows; SO_REUSEADDR with multiple ports is basically UB. See https://learn.microsoft.com/en-us/windows/win32/winsock/using-so-reuseaddr-and-so-exclusiveaddruse . So I'm going to fix this to only make this change on Linux. It does mean that on Windows we still may have the flaky test, but at least we'll fix it on Linux. We'll have to think about how to do this on Windows separately.

clalancette · 2024-11-18T15:50:08Z

And 7cb8b24 fixes it to do just that.

clalancette · 2024-11-18T15:51:38Z

Pulls: #947
Gist: https://gist.githubusercontent.com/clalancette/56d982aa8bd47ed2aab98c1537166534/raw/2342a02ab486617cbac78c10af6ae3bacbacd367/ros2.repos
BUILD args:
TEST args:
ROS Distro: rolling
Job: ci_launcher
ci_launcher ran: https://ci.ros2.org/job/ci_launcher/14832

Linux
Linux-aarch64
Linux-rhel
Windows

Especially in tests where the daemon is being repeatedly created and destroyed, it can take some time for the kernel to actually allow the address to be rebinded (even after the old process has exited). This can lead to some situations where we fail to spawn the daemon. To fix this, set "allow_reuse_address" inside the LocalXMLRPCServer, which will set SO_REUSADDR on the socket. Signed-off-by: Chris Lalancette <clalancette@gmail.com>

clalancette · 2024-11-18T16:04:08Z

Hmpf. After looking at this further, @cottsay found that this is reverting previous work: #622 . So marking this as Draft for now.

NOne

fujitatomoya

IMO, it would make more sense to enable allow_reuse_address as this PR suggested?

the socket is set with SO_LINGER with timeout 0, that means it causes the socket to send a TCP RST (Reset) upon closure and terminates the connection immediately, discarding any unsent data.

SO_REUSEADDR allows binding to a port that is already in use (e.g., in the TIME_WAIT state). It doesn't terminate the TIME_WAIT state but bypasses the restriction, enabling rebinding.

cottsay · 2024-11-18T19:27:05Z

SO_REUSEADDR allows binding to a port that is already in use (e.g., in the TIME_WAIT state). It doesn't terminate the TIME_WAIT state but bypasses the restriction, enabling rebinding.

From what I've read, this is only the case for Windows. The flag works slightly differently on BSD sockets. From man socket(7):

For AF_INET sockets this means that a socket may bind, except when there is an active listening socket bound to the address.

Signed-off-by: Chris Lalancette <clalancette@gmail.com>

clalancette · 2024-11-19T18:35:06Z

All right, with all of the latest pushes here, I got things to work far more reliably on both Linux and Windows. I'm going to explain a bit about how we got here.

What's the real problem we are trying to solve?

The original problem here is that running the test_strategy.py test in the nightly repeated jobs "sometimes" fails. You can see that for example in https://ci.ros2.org/view/nightly/job/nightly_linux-rhel_repeated/2098/testReport/junit/ros2cli.ros2cli.test/test_strategy/test_with_daemon_spawn_2_3_/ .

Why is that test failing?

What happens when that test fails is pretty simple; the daemon fails to spawn correctly. Because we have no checking around this in NodeStrategy, it is basically a silent failure until we try to verify that we've connected in the tests. Looking at

ros2cli/ros2cli/ros2cli/node/daemon.py

Lines 104 to 176 in 8766c33

    
           def spawn_daemon(args, timeout=None, debug=False): 
        
               """ 
        
               Spawn daemon node if it's not running. 
        
               To avoid TOCTOU races, this function instantiates 
        
               the XMLRPC server (binding the socket in the process) 
        
               and transfers it to the daemon process through pipes 
        
               (sending the inheritable socket with it). In a sense, 
        
               the socket functionally behaves as a mutex. 
        
               :param args: `DaemonNode` arguments namespace. 
        
               :param timeout: optional duration, in seconds, to wait 
        
                 until the daemon node is ready. Non-positive 
        
                 durations will result in an indefinite wait. 
        
               :param debug: if `True`, the daemon process will output 
        
                 to the current `stdout` and `stderr` streams. 
        
               :return: `True` if the the daemon was spawned, 
        
                 `False` if it was already running. 
        
               :raises: if it fails to spawn the daemon. 
        
               """ 
        
               # Acquire socket by instantiating XMLRPC server. 
        
               try: 
        
                   server = daemon.make_xmlrpc_server() 
        
                   server.socket.set_inheritable(True) 
        
               except socket.error as e: 
        
                   if e.errno == errno.EADDRINUSE: 
        
                       # Failed to acquire socket 
        
                       # Daemon already running 
        
                       return False 
        
                   raise 
        
               # During tab completion on the ros2 tooling, we can get here and attempt to spawn a daemon. 
        
               # In that scenario, there may be open file descriptors that can prevent us from successfully 
        
               # daemonizing, and instead cause the terminal to hang.  Here we mark all file handles except 
        
               # for 0, 1, 2, and the server socket as non-inheritable, which will cause daemonize() to close 
        
               # those file descriptors.  See https://github.com/ros2/ros2cli/issues/851 for more details. 
        
               if platform.system() != 'Windows': 
        
                   # Some unices have a high soft_limit; read fdsize if available. 
        
                   fdlimit = None 
        
                   try: 
        
                       string_to_find = 'FDSize:' 
        
                       with open('/proc/self/status', 'r') as f: 
        
                           for line in f: 
        
                               if line.startswith(string_to_find): 
        
                                   fdlimit = int(line.removeprefix(string_to_find).strip()) 
        
                                   break 
        
                   except (FileNotFoundError, ValueError): 
        
                       pass 
        
                   # The soft limit might be quite high on some systems. 
        
                   if fdlimit is None: 
        
                       import resource 
        
                       fdlimit, _ = resource.getrlimit(resource.RLIMIT_NOFILE) 
        
                   for i in range(3, fdlimit): 
        
                       try: 
        
                           if i != server.socket.fileno() and os.get_inheritable(i): 
        
                               os.set_inheritable(i, False) 
        
                       except OSError: 
        
                           # Just in case the file handles might be [3(closed), ..., 8(pipe handle), ...] 
        
                           continue 
        
               # Transfer XMLRPC server to daemon (and the socket with it). 
        
               try: 
        
                   tags = { 
        
                       'name': 'ros2-daemon', 'ros_domain_id': get_ros_domain_id(), 
        
                       'rmw_implementation': rclpy.get_rmw_implementation_identifier()} 
        
                   daemonize( 
        
                       functools.partial(daemon.serve, server), 
        
                       tags=tags, timeout=timeout, debug=debug) 
        
               finally: 
        
                   server.server_close() 
        
               return True

, I believe the only real reason the daemon can silently fail to spawn is if we get EADDRINUSE on the socket. In turn, that happens because it can be the case that a socket from an "old" daemon that was/is shutting down still hasn't been cleaned up by the kernel. In a bit more detail, the kernel can leave old listening sockets open in TCP TIME_WAIT state for a while after the socket was closed. In that situation, trying to create a "new" daemon will fail with EADDRINUSE (at least, by default). This also explains why we mostly see this in the tests; they are the only ones that are spawning and killing daemons in a loop. For most normal users, this won't be the case.

How did we get to where the current code?

There have been a few attempts to fix flakiness in the ros2 daemon in the past. These include #620 , #622 , and #652 . These all changed things in various ways, but the key PR was #652, which made spawning the daemon a reliable operation.

#622 made some changes to change the sockets to add SO_LINGER with a zero timeout. That improved, but did not totally solve the situation. It also has its own downsides, as SO_LINGER doesn't gracefully terminate connections and instead just sends RST on the socket and terminates it.

What's the solution?

The solution here is in 3 parts, though one of the parts is platform-dependent. This PR implements all 3 parts:

When the daemon is exiting cleanly, it should explicitly shutdown the socket that it was using for the XMLRPC server. That will cleanly shutdown the socket, and tell the kernel it can start the cleanup. On its own, this does not completely solve the problem, but it reduces the amount of time that things are hanging about waiting for the Python interpreter and/or the kernel to implicitly clean things up.
We should not specify SO_LINGER on the daemon sockets. As mentioned above, this is actually something of an anti-pattern and does not properly terminate connections with FIN (it just sends RST).
We should specify SO_REUSEADDR, but only on Unix. On Unix, SO_REUSEADDR essentially means "allow binding to an address/port that is in TCP TIME_WAIT (but not that is otherwise in use)". This is exactly the behavior we want. On Windows, SO_REUSEADDR causes undefined behavior, as it can cause a socket to bind even if there is something else bound already. Because of that, we want to set SO_REUSEADDR on Unix, but not Windows.

Finally, while testing here I had to add in one bugfix to make things reliable on Windows, which is to also catch ConnectionResetError. That arises because we can attempt to "connect" to a daemon that is in the process of shutting down. In that case, we should also consider the daemon not "connected".

With all of the fixes in this PR, I'm no longer able to reproduce the flakiness in the server on either Linux or Windows. Next up I'm going to run regular CI on this, as well as an equivalent of the nightly repeated jobs to make sure things are happy.

clalancette · 2024-11-19T18:36:09Z

Pulls: #947
Gist: https://gist.githubusercontent.com/clalancette/e73e1a2dd42b49fe06f374ccbb538d51/raw/2342a02ab486617cbac78c10af6ae3bacbacd367/ros2.repos
BUILD args:
TEST args:
ROS Distro: rolling
Job: ci_launcher
ci_launcher ran: https://ci.ros2.org/job/ci_launcher/14838

Linux
Linux-aarch64
Linux-rhel
Windows

clalancette · 2024-11-19T18:43:06Z

Here are the repeated jobs:

Linux
Linux-aarch64
Linux-rhel
Windows

fujitatomoya

@clalancette good study, thanks for sharing 👍

clalancette requested a review from cottsay November 18, 2024 15:32

clalancette requested review from audrow and gbiggs as code owners November 18, 2024 15:32

ahcorde approved these changes Nov 18, 2024

View reviewed changes

clalancette force-pushed the clalancette/fix-flaky-node-strategy branch from 7a3bb44 to 7cb8b24 Compare November 18, 2024 15:49

ahcorde previously approved these changes Nov 18, 2024

View reviewed changes

clalancette force-pushed the clalancette/fix-flaky-node-strategy branch from 7cb8b24 to e8f06ed Compare November 18, 2024 15:56

clalancette marked this pull request as draft November 18, 2024 16:04

fujitatomoya reviewed Nov 18, 2024

View reviewed changes

clalancette added 2 commits November 19, 2024 13:15

More attempted fixes.

03db923

Signed-off-by: Chris Lalancette <clalancette@gmail.com>

More fixes.

4d12c01

Signed-off-by: Chris Lalancette <clalancette@gmail.com>

clalancette marked this pull request as ready for review November 19, 2024 18:35

cottsay approved these changes Nov 19, 2024

View reviewed changes

fujitatomoya approved these changes Nov 19, 2024

View reviewed changes

clalancette merged commit e998af3 into rolling Nov 20, 2024
3 checks passed

clalancette deleted the clalancette/fix-flaky-node-strategy branch November 20, 2024 02:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow reuse address when firing up the ros2cli daemon. #947

Allow reuse address when firing up the ros2cli daemon. #947

clalancette commented Nov 18, 2024

clalancette commented Nov 18, 2024

clalancette commented Nov 18, 2024

clalancette commented Nov 18, 2024

clalancette commented Nov 18, 2024

clalancette commented Nov 18, 2024

fujitatomoya left a comment

cottsay commented Nov 18, 2024

clalancette commented Nov 19, 2024

clalancette commented Nov 19, 2024

clalancette commented Nov 19, 2024

fujitatomoya left a comment

Allow reuse address when firing up the ros2cli daemon. #947

Allow reuse address when firing up the ros2cli daemon. #947

Conversation

clalancette commented Nov 18, 2024

clalancette commented Nov 18, 2024

clalancette commented Nov 18, 2024

clalancette commented Nov 18, 2024

clalancette commented Nov 18, 2024

clalancette commented Nov 18, 2024

fujitatomoya left a comment

Choose a reason for hiding this comment

cottsay commented Nov 18, 2024

clalancette commented Nov 19, 2024

What's the real problem we are trying to solve?

Why is that test failing?

How did we get to where the current code?

What's the solution?

clalancette commented Nov 19, 2024

clalancette commented Nov 19, 2024

fujitatomoya left a comment

Choose a reason for hiding this comment