-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow reuse address when firing up the ros2cli daemon. #947
Conversation
Pulls: #947 |
After discussing this with @cottsay a bit, we shouldn't do this on Windows; SO_REUSEADDR with multiple ports is basically UB. See https://learn.microsoft.com/en-us/windows/win32/winsock/using-so-reuseaddr-and-so-exclusiveaddruse . So I'm going to fix this to only make this change on Linux. It does mean that on Windows we still may have the flaky test, but at least we'll fix it on Linux. We'll have to think about how to do this on Windows separately. |
7a3bb44
to
7cb8b24
Compare
And 7cb8b24 fixes it to do just that. |
Pulls: #947 |
Especially in tests where the daemon is being repeatedly created and destroyed, it can take some time for the kernel to actually allow the address to be rebinded (even after the old process has exited). This can lead to some situations where we fail to spawn the daemon. To fix this, set "allow_reuse_address" inside the LocalXMLRPCServer, which will set SO_REUSADDR on the socket. Signed-off-by: Chris Lalancette <clalancette@gmail.com>
7cb8b24
to
e8f06ed
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, it would make more sense to enable allow_reuse_address
as this PR suggested?
the socket is set with SO_LINGER
with timeout 0
, that means it causes the socket to send a TCP RST (Reset) upon closure and terminates the connection immediately, discarding any unsent data.
SO_REUSEADDR
allows binding to a port that is already in use (e.g., in the TIME_WAIT
state). It doesn't terminate the TIME_WAIT
state but bypasses the restriction, enabling rebinding.
From what I've read, this is only the case for Windows. The flag works slightly differently on BSD sockets. From man socket(7):
|
Signed-off-by: Chris Lalancette <clalancette@gmail.com>
Signed-off-by: Chris Lalancette <clalancette@gmail.com>
All right, with all of the latest pushes here, I got things to work far more reliably on both Linux and Windows. I'm going to explain a bit about how we got here. What's the real problem we are trying to solve?The original problem here is that running the Why is that test failing?What happens when that test fails is pretty simple; the daemon fails to spawn correctly. Because we have no checking around this in ros2cli/ros2cli/ros2cli/node/daemon.py Lines 104 to 176 in 8766c33
How did we get to where the current code?There have been a few attempts to fix flakiness in the ros2 daemon in the past. These include #620 , #622 , and #652 . These all changed things in various ways, but the key PR was #652, which made spawning the daemon a reliable operation. #622 made some changes to change the sockets to add SO_LINGER with a zero timeout. That improved, but did not totally solve the situation. It also has its own downsides, as SO_LINGER doesn't gracefully terminate connections and instead just sends RST on the socket and terminates it. What's the solution?The solution here is in 3 parts, though one of the parts is platform-dependent. This PR implements all 3 parts:
Finally, while testing here I had to add in one bugfix to make things reliable on Windows, which is to also catch With all of the fixes in this PR, I'm no longer able to reproduce the flakiness in the server on either Linux or Windows. Next up I'm going to run regular CI on this, as well as an equivalent of the nightly repeated jobs to make sure things are happy. |
Pulls: #947 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@clalancette good study, thanks for sharing 👍
Especially in tests where the daemon is being repeatedly created and destroyed, it can take some time for the kernel to actually allow the address to be rebinded (even after the old process has exited). This can lead to some situations where we fail to spawn the daemon.
To fix this, set "allow_reuse_address" inside the LocalXMLRPCServer, which will set SO_REUSADDR on the socket.