Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow reuse address when firing up the ros2cli daemon. #947

Merged
merged 3 commits into from
Nov 20, 2024

Conversation

clalancette
Copy link
Contributor

Especially in tests where the daemon is being repeatedly created and destroyed, it can take some time for the kernel to actually allow the address to be rebinded (even after the old process has exited). This can lead to some situations where we fail to spawn the daemon.

To fix this, set "allow_reuse_address" inside the LocalXMLRPCServer, which will set SO_REUSADDR on the socket.

@clalancette
Copy link
Contributor Author

Pulls: #947
Gist: https://gist.githubusercontent.com/clalancette/c79668ce9bddc836dcdeae6fb121cc2a/raw/2342a02ab486617cbac78c10af6ae3bacbacd367/ros2.repos
BUILD args:
TEST args:
ROS Distro: rolling
Job: ci_launcher
ci_launcher ran: https://ci.ros2.org/job/ci_launcher/14831

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Linux-rhel Build Status
  • Windows Build Status

@clalancette
Copy link
Contributor Author

After discussing this with @cottsay a bit, we shouldn't do this on Windows; SO_REUSEADDR with multiple ports is basically UB. See https://learn.microsoft.com/en-us/windows/win32/winsock/using-so-reuseaddr-and-so-exclusiveaddruse . So I'm going to fix this to only make this change on Linux. It does mean that on Windows we still may have the flaky test, but at least we'll fix it on Linux. We'll have to think about how to do this on Windows separately.

@clalancette clalancette force-pushed the clalancette/fix-flaky-node-strategy branch from 7a3bb44 to 7cb8b24 Compare November 18, 2024 15:49
@clalancette
Copy link
Contributor Author

And 7cb8b24 fixes it to do just that.

@clalancette
Copy link
Contributor Author

Pulls: #947
Gist: https://gist.githubusercontent.com/clalancette/56d982aa8bd47ed2aab98c1537166534/raw/2342a02ab486617cbac78c10af6ae3bacbacd367/ros2.repos
BUILD args:
TEST args:
ROS Distro: rolling
Job: ci_launcher
ci_launcher ran: https://ci.ros2.org/job/ci_launcher/14832

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Linux-rhel Build Status
  • Windows Build Status

ahcorde
ahcorde previously approved these changes Nov 18, 2024
Especially in tests where the daemon is being repeatedly
created and destroyed, it can take some time for the kernel
to actually allow the address to be rebinded (even after the
old process has exited).  This can lead to some situations
where we fail to spawn the daemon.

To fix this, set "allow_reuse_address" inside the LocalXMLRPCServer,
which will set SO_REUSADDR on the socket.

Signed-off-by: Chris Lalancette <clalancette@gmail.com>
@clalancette clalancette force-pushed the clalancette/fix-flaky-node-strategy branch from 7cb8b24 to e8f06ed Compare November 18, 2024 15:56
@clalancette
Copy link
Contributor Author

Hmpf. After looking at this further, @cottsay found that this is reverting previous work: #622 . So marking this as Draft for now.

@clalancette clalancette marked this pull request as draft November 18, 2024 16:04
Copy link
Collaborator

@fujitatomoya fujitatomoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, it would make more sense to enable allow_reuse_address as this PR suggested?

the socket is set with SO_LINGER with timeout 0, that means it causes the socket to send a TCP RST (Reset) upon closure and terminates the connection immediately, discarding any unsent data.

SO_REUSEADDR allows binding to a port that is already in use (e.g., in the TIME_WAIT state). It doesn't terminate the TIME_WAIT state but bypasses the restriction, enabling rebinding.

@cottsay
Copy link
Member

cottsay commented Nov 18, 2024

SO_REUSEADDR allows binding to a port that is already in use (e.g., in the TIME_WAIT state). It doesn't terminate the TIME_WAIT state but bypasses the restriction, enabling rebinding.

From what I've read, this is only the case for Windows. The flag works slightly differently on BSD sockets. From man socket(7):

For AF_INET sockets this means that a socket may bind, except when there is an active listening socket bound to the address.

Signed-off-by: Chris Lalancette <clalancette@gmail.com>
Signed-off-by: Chris Lalancette <clalancette@gmail.com>
@clalancette
Copy link
Contributor Author

All right, with all of the latest pushes here, I got things to work far more reliably on both Linux and Windows. I'm going to explain a bit about how we got here.

What's the real problem we are trying to solve?

The original problem here is that running the test_strategy.py test in the nightly repeated jobs "sometimes" fails. You can see that for example in https://ci.ros2.org/view/nightly/job/nightly_linux-rhel_repeated/2098/testReport/junit/ros2cli.ros2cli.test/test_strategy/test_with_daemon_spawn_2_3_/ .

Why is that test failing?

What happens when that test fails is pretty simple; the daemon fails to spawn correctly. Because we have no checking around this in NodeStrategy, it is basically a silent failure until we try to verify that we've connected in the tests. Looking at

def spawn_daemon(args, timeout=None, debug=False):
"""
Spawn daemon node if it's not running.
To avoid TOCTOU races, this function instantiates
the XMLRPC server (binding the socket in the process)
and transfers it to the daemon process through pipes
(sending the inheritable socket with it). In a sense,
the socket functionally behaves as a mutex.
:param args: `DaemonNode` arguments namespace.
:param timeout: optional duration, in seconds, to wait
until the daemon node is ready. Non-positive
durations will result in an indefinite wait.
:param debug: if `True`, the daemon process will output
to the current `stdout` and `stderr` streams.
:return: `True` if the the daemon was spawned,
`False` if it was already running.
:raises: if it fails to spawn the daemon.
"""
# Acquire socket by instantiating XMLRPC server.
try:
server = daemon.make_xmlrpc_server()
server.socket.set_inheritable(True)
except socket.error as e:
if e.errno == errno.EADDRINUSE:
# Failed to acquire socket
# Daemon already running
return False
raise
# During tab completion on the ros2 tooling, we can get here and attempt to spawn a daemon.
# In that scenario, there may be open file descriptors that can prevent us from successfully
# daemonizing, and instead cause the terminal to hang. Here we mark all file handles except
# for 0, 1, 2, and the server socket as non-inheritable, which will cause daemonize() to close
# those file descriptors. See https://github.com/ros2/ros2cli/issues/851 for more details.
if platform.system() != 'Windows':
# Some unices have a high soft_limit; read fdsize if available.
fdlimit = None
try:
string_to_find = 'FDSize:'
with open('/proc/self/status', 'r') as f:
for line in f:
if line.startswith(string_to_find):
fdlimit = int(line.removeprefix(string_to_find).strip())
break
except (FileNotFoundError, ValueError):
pass
# The soft limit might be quite high on some systems.
if fdlimit is None:
import resource
fdlimit, _ = resource.getrlimit(resource.RLIMIT_NOFILE)
for i in range(3, fdlimit):
try:
if i != server.socket.fileno() and os.get_inheritable(i):
os.set_inheritable(i, False)
except OSError:
# Just in case the file handles might be [3(closed), ..., 8(pipe handle), ...]
continue
# Transfer XMLRPC server to daemon (and the socket with it).
try:
tags = {
'name': 'ros2-daemon', 'ros_domain_id': get_ros_domain_id(),
'rmw_implementation': rclpy.get_rmw_implementation_identifier()}
daemonize(
functools.partial(daemon.serve, server),
tags=tags, timeout=timeout, debug=debug)
finally:
server.server_close()
return True
, I believe the only real reason the daemon can silently fail to spawn is if we get EADDRINUSE on the socket. In turn, that happens because it can be the case that a socket from an "old" daemon that was/is shutting down still hasn't been cleaned up by the kernel. In a bit more detail, the kernel can leave old listening sockets open in TCP TIME_WAIT state for a while after the socket was closed. In that situation, trying to create a "new" daemon will fail with EADDRINUSE (at least, by default). This also explains why we mostly see this in the tests; they are the only ones that are spawning and killing daemons in a loop. For most normal users, this won't be the case.

How did we get to where the current code?

There have been a few attempts to fix flakiness in the ros2 daemon in the past. These include #620 , #622 , and #652 . These all changed things in various ways, but the key PR was #652, which made spawning the daemon a reliable operation.

#622 made some changes to change the sockets to add SO_LINGER with a zero timeout. That improved, but did not totally solve the situation. It also has its own downsides, as SO_LINGER doesn't gracefully terminate connections and instead just sends RST on the socket and terminates it.

What's the solution?

The solution here is in 3 parts, though one of the parts is platform-dependent. This PR implements all 3 parts:

  1. When the daemon is exiting cleanly, it should explicitly shutdown the socket that it was using for the XMLRPC server. That will cleanly shutdown the socket, and tell the kernel it can start the cleanup. On its own, this does not completely solve the problem, but it reduces the amount of time that things are hanging about waiting for the Python interpreter and/or the kernel to implicitly clean things up.
  2. We should not specify SO_LINGER on the daemon sockets. As mentioned above, this is actually something of an anti-pattern and does not properly terminate connections with FIN (it just sends RST).
  3. We should specify SO_REUSEADDR, but only on Unix. On Unix, SO_REUSEADDR essentially means "allow binding to an address/port that is in TCP TIME_WAIT (but not that is otherwise in use)". This is exactly the behavior we want. On Windows, SO_REUSEADDR causes undefined behavior, as it can cause a socket to bind even if there is something else bound already. Because of that, we want to set SO_REUSEADDR on Unix, but not Windows.

Finally, while testing here I had to add in one bugfix to make things reliable on Windows, which is to also catch ConnectionResetError. That arises because we can attempt to "connect" to a daemon that is in the process of shutting down. In that case, we should also consider the daemon not "connected".

With all of the fixes in this PR, I'm no longer able to reproduce the flakiness in the server on either Linux or Windows. Next up I'm going to run regular CI on this, as well as an equivalent of the nightly repeated jobs to make sure things are happy.

@clalancette clalancette marked this pull request as ready for review November 19, 2024 18:35
@clalancette
Copy link
Contributor Author

Pulls: #947
Gist: https://gist.githubusercontent.com/clalancette/e73e1a2dd42b49fe06f374ccbb538d51/raw/2342a02ab486617cbac78c10af6ae3bacbacd367/ros2.repos
BUILD args:
TEST args:
ROS Distro: rolling
Job: ci_launcher
ci_launcher ran: https://ci.ros2.org/job/ci_launcher/14838

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Linux-rhel Build Status
  • Windows Build Status

@clalancette
Copy link
Contributor Author

Here are the repeated jobs:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Linux-rhel Build Status
  • Windows Build Status

Copy link
Collaborator

@fujitatomoya fujitatomoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clalancette good study, thanks for sharing 👍

@clalancette clalancette merged commit e998af3 into rolling Nov 20, 2024
3 checks passed
@clalancette clalancette deleted the clalancette/fix-flaky-node-strategy branch November 20, 2024 02:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants