Skip to content

Conversation

@Frauschi
Copy link
Contributor

@Frauschi Frauschi commented Aug 25, 2023

The first commit adds support for the SO_REUSEADDR option in order to allow multiple sockets to be bound to the same local IP:Port combination.
The second commit then fixes a bug in the connect() calls for UDP sockets. With that, remote peer IP address and port are correctly set and can be properly parsed for incoming messages to find the correct socket for the incoming message.

I'm pretty sure that the proposed fixes are not perfectly fitting and are rather intended for a discussion on how to implement the fix in the correct way. The implementation of the SO_REUSEADDR option certainly misses some places where the option needs to be taken care off. I'm also pretty sure, that there is a better fix for the UDP connect() issue of the second commit, but I'm not familiar enough with the code base to come up with a better solution at the moment.

However, I'm looking forward to your feedback on the proposed changes and fixes. I'm of course willing to update the PR based on your feedback.

Fixes #61884

@rlubos
Copy link
Contributor

rlubos commented Aug 25, 2023

@Frauschi The connect() for UDP sockets was just fixed in a slightly different way in #61843 (with tests added to cover this case).

@jukkar
Copy link
Member

jukkar commented Aug 25, 2023

I think we need to add some changes to the net_context_bind() too so that we check the address/ports before allowing bind to succeed.

There is actually quite nice summary how SO_REUSEADDR should work in StackOverflow so adding a link here just in case
https://stackoverflow.com/questions/14388706/how-do-so-reuseaddr-and-so-reuseport-differ

@Frauschi
Copy link
Contributor Author

@Frauschi The connect() for UDP sockets was just fixed in a slightly different way in #61843 (with tests added to cover this case).

This indeed fixes the bug, thank you. I reverted the commit with my proposed (less elegant) fix and rebased to the current main branch. Therefore, we are only talking about the SO_REUSEADDR topic going further.

Comment on lines 214 to 218
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me this looks like we are doing this too early. We are now ignoring all the checks done below. Better would be to check the reuseaddr if we have found a duplicate entry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are definitely right. I moved to the end after all other checks in my last commit

@Frauschi
Copy link
Contributor Author

I think we need to add some changes to the net_context_bind() too so that we check the address/ports before allowing bind to succeed.

Integrated in the last commit. While checking net_context_bind(), I also found another minor bug: the wrong protocol (AF_INET) is passed to check_used_port() instead of the actual protocol from the context. I fixed this right away, as this bug caused all checks of check_used_port anyway.

@Frauschi
Copy link
Contributor Author

There is actually quite nice summary how SO_REUSEADDR should work in StackOverflow so adding a link here just in case
https://stackoverflow.com/questions/14388706/how-do-so-reuseaddr-and-so-reuseport-differ

If I understand the article correctly, we have now (if I haven't missed another place to check the option) implemented the desired SO_REUSEADDR behavior. For SO_REUSEPORT, we would have to add further checks, that all other contexts also have the option set in order to have a successful bind() call. Should we implement that, too, or is SO_REUSEADDR enough atm? (for my use case it is actually).

@jukkar
Copy link
Member

jukkar commented Aug 25, 2023

For SO_REUSEPORT, we would have to add further checks, that all other contexts also have the option set in order to have a successful bind() call. Should we implement that, too, or is SO_REUSEADDR enough atm? (for my use case it is actually).

It makes actually sense to implement SO_REUSEPORT too, this could be a separate PR or this one. It would be great if you could add support for it.

BTW, please implement also unit tests that will make sure the option works as expected. You could place the tests to tests/net/socket/misc for example.

@Frauschi Frauschi changed the title net: sockets: fix for UDP sockets net: sockets: add support for SO_REUSEADDR and SO_REUSEPORT Aug 31, 2023
@Frauschi
Copy link
Contributor Author

The new commit now also adds support for SO_REUSEPORT and improves the SO_REUSEADDR implementation. A lot more changes were necessary to properly implement both options and all possibilities. I followed the explanation from the StackOverflow article and added unit tests for all identified situations.
The implementation now completely (hopefully) follows the BSD implementation and tries to follow the additional Linux options as much as possible. However, there are currently three limitions and therefore differences to the way Linux handles these options:

  • SO_REUSEADDR and SO_REUSEPORT are not "the same" for client sockets, as we do not have a trivial way so identify a socket as "client" during binding.
  • No prevention of "port hijacking"
  • No support for the load balancing stuff for incoming packets/connections

I think these limitations are ok, as an implementation would drastically increase the complexity for these two options and would change many more places in the code. The first one also can easily be worked around (just properly use SO_REUSEPORT instead of SO_REUSEADDR). For the other two, I am not sure if Zephyr even needs these features anyway.

On my way through the code I also fixed some bugs I found. Feel free to ask questions if some of the changes are unclear or seem useless / unrelated at first.

@jukkar
Copy link
Member

jukkar commented Sep 1, 2023

Please check the compliance issue(s).

Also the commits look weird, there are multiple commits with same text. I suggest you create four commits in this PR,

  1. one that introduce SO_REUSEADDR
  2. tests for SO_REUSEADDR
  3. then one that introduces SO_REUSEPORT
  4. and tests for SO_REUSEPORT

@Frauschi
Copy link
Contributor Author

Frauschi commented Sep 1, 2023

Please check the compliance issue(s).

Also the commits look weird, there are multiple commits with same text. I suggest you create four commits in this PR,

  1. one that introduce SO_REUSEADDR
  2. tests for SO_REUSEADDR
  3. then one that introduces SO_REUSEPORT
  4. and tests for SO_REUSEPORT

I changed the commits and hopefully fixed the missing compliance issues. I'm looking forward to your feedback of the changes.

@Frauschi
Copy link
Contributor Author

Frauschi commented Sep 1, 2023

I think the failing tests are not related to this PR

@Frauschi Frauschi requested review from jukkar and rlubos September 1, 2023 12:34
@rlubos
Copy link
Contributor

rlubos commented Sep 1, 2023

It seems that all failures are related to Thrift - @cfriedt any ideas?

@cfriedt
Copy link
Member

cfriedt commented Sep 1, 2023

It seems that all failures are related to Thrift - @cfriedt any ideas?

I believe these are TLS issues. I think there is a bit of an ongoing effort here, but I've been out of the loop for a couple of weeks. Cc @d3zd3z

@cfriedt
Copy link
Member

cfriedt commented Sep 1, 2023

It seems that all failures are related to Thrift - @cfriedt any ideas?

I believe these are TLS issues. I think there is a bit of an ongoing effort here, but I've been out of the loop for a couple of weeks. Cc @d3zd3z

I can make a PR this morning to disable TLS Transport tests until the issue is resolved. @d3zd3z - do you have an issue open already?

@rlubos
Copy link
Contributor

rlubos commented Sep 1, 2023

Actually, in case of this PR, it seems that the root cause may be different. Thrift seems to be SO_REUSEADDR user, so it now needs to enable NET_CONTEXT_REUSEADDR, the following made the test pass for me locally:

diff --git a/tests/lib/thrift/ThriftTest/prj.conf b/tests/lib/thrift/ThriftTest/prj.conf
index 4d2b145b13..7625b849d0 100755
--- a/tests/lib/thrift/ThriftTest/prj.conf
+++ b/tests/lib/thrift/ThriftTest/prj.conf
@@ -13,6 +13,7 @@ CONFIG_NET_SOCKETS=y
 CONFIG_NET_SOCKETPAIR=y
 CONFIG_HEAP_MEM_POOL_SIZE=16384
 CONFIG_EVENTFD=y
+CONFIG_NET_CONTEXT_REUSEADDR=y
 
 CONFIG_THRIFT=y

Copy link
Member

@cfriedt cfriedt Sep 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest merging this testsuite with the net/socket/ip/ app.

It can still be a separate testsuite inside of the net/socket/ip/ test app.

ZTEST(net_resuseaddr, test_foo) {
..
}
...
ZTEST_SUITE(net_resuseaddr, ...);

They can be in separate files as well, which makes things a bit cleaner, and can have separate setup / teardown / before / after as well, if that helps.

ZTEST(net_resuseport, test_bar) {
..
}
ZTEST_SUITE(net_resuseport, ...);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I can't find the testsuite app net/socket/ip. Can you please specifiy where to put the tests?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry - on mobile - tests/net/tcp / tests/net/udp/

@cfriedt
Copy link
Member

cfriedt commented Sep 1, 2023

Actually, in case of this PR, it seems that the root cause may be different. Thrift seems to be SO_REUSEADDR user, so it now needs to enable NET_CONTEXT_REUSEADDR, the following made the test pass for me locally:

diff --git a/tests/lib/thrift/ThriftTest/prj.conf b/tests/lib/thrift/ThriftTest/prj.conf
index 4d2b145b13..7625b849d0 100755
--- a/tests/lib/thrift/ThriftTest/prj.conf
+++ b/tests/lib/thrift/ThriftTest/prj.conf
@@ -13,6 +13,7 @@ CONFIG_NET_SOCKETS=y
 CONFIG_NET_SOCKETPAIR=y
 CONFIG_HEAP_MEM_POOL_SIZE=16384
 CONFIG_EVENTFD=y
+CONFIG_NET_CONTEXT_REUSEADDR=y
 
 CONFIG_THRIFT=y

Seems unnecessary. I think we should just always support it reuseaddr / reuseport.

@jukkar
Copy link
Member

jukkar commented Sep 4, 2023

Seems unnecessary. I think we should just always support it reuseaddr / reuseport.

Atm we have almost all socket options behind a Kconfig option, so we could do the same here. We could mark reuseadd/reuseport as y by default. Anyone wanting to squeeze extra bytes out from memory could turn the options off if the options are not needed.

@Frauschi
Copy link
Contributor Author

Frauschi commented Sep 4, 2023

Seems unnecessary. I think we should just always support it reuseaddr / reuseport.

Atm we have almost all socket options behind a Kconfig option, so we could do the same here. We could mark reuseadd/reuseport as y by default. Anyone wanting to squeeze extra bytes out from memory could turn the options off if the options are not needed.

I did exactly that in my last commit. Enabling both by default if TCP or UDP are enabled. But the user can still disable them explicitly. I think this is a reasonable way to go.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change the commit subject of "net: tests: add tests for SO_REUSEADDR" to
tests: net: sockets: add tests for SO_REUSEADDR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes here were a bit difficult to read. Are we just swapping the remote/local address check branches here, if true, why do we need to do it.

@Frauschi
Copy link
Contributor Author

Frauschi commented Sep 5, 2023

As both Kconfig options are now enabled by default, I think the failing tests are not related. I also cannot reproduce the failing tests on my side.

Copy link
Member

@jukkar jukkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for putting effort to the tests, they are very easy to read and understand.

Comment on lines +511 to +512
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might cause some issues in the future if some timing changes etc. Anyway, we can cross that bridge when we get there.

@jukkar
Copy link
Member

jukkar commented Sep 6, 2023

The twister error is this one

INFO    - --/__w/zephyr/zephyr/twister-out/qemu_x86/tests/net/conn_mgr_monitor/net.conn_mgr_dad/handler.log---
ERROR   - SeaBIOS (version zephyr-v1.0.0-0-g31d4e0e-dirty-20200714_234759-fv-az50-zephyr)
Booting from ROM..
*** Booting Zephyr OS build zephyr-v3.4.0-3292-g6304095e9c0d ***
Running TESTSUITE conn_mgr_monitor
===================================================================
START - test_DAD
 PASS - test_DAD in 0.154 seconds
===================================================================
START - test_cycle_ready_CC
E: Double Fault
E: EAX: 0x00010af5, EBX: 0x00000000, ECX: 0x000f4240, EDX: 0x00000000
E: ESI: 0x00000046, EDI: 0x00000000, EBP: 0x0013e014, ESP: 0x0013dfec
E: EFLAGS: 0x00000012 CS: 0x0008 CR3: 0x0015b000
E: call trace:
E: EIP: 0x001001b7
E:      0x001124e8 (0x10af5)
E:      0x00118ec4 (0xa)
E:      0x00118ff9 (0x0)
E:      0x0010659e (0x13e088)
E:      0x00106830 (0x106830)
E:      0x0010c51d (0x1551f8)
E:      0x0010c630 (0x10c630)
E:      0x0010f5cb (0x0)
E: >>> ZEPHYR FATAL ERROR 2: Stack overflow on CPU 0
E: Current thread: 0x12b120 (net_mgmt)
E: Halting system

which indicates stack overflow in net_mgmt thread and which indeed does not look related to these changes. But we have modified common code paths in this PR so that unrelated functionality might be affected. I try to replicate the issue.

@jukkar
Copy link
Member

jukkar commented Sep 6, 2023

I could replicate this by west build -b qemu_x86 -d build/conn_mgr_monitor -t run zephyr/tests/net/conn_mgr_monitor

This fixed the conn_mgr_monitor stack overflow issue

diff --git a/tests/net/conn_mgr_monitor/prj.conf b/tests/net/conn_mgr_monitor/prj.conf
index 1e60d8327e..f68f3f09fd 100644
--- a/tests/net/conn_mgr_monitor/prj.conf
+++ b/tests/net/conn_mgr_monitor/prj.conf
@@ -23,6 +23,7 @@ CONFIG_ZTEST_NEW_API=y
 CONFIG_NET_IF_MAX_IPV4_COUNT=6
 CONFIG_NET_IF_MAX_IPV6_COUNT=6
 CONFIG_TEST_USERSPACE=y
+CONFIG_NET_MGMT_EVENT_STACK_SIZE=1024
 
 # Enable for help debugging this test suite:
 # CONFIG_NET_LOG=y

Could you add a commit to this PR that increases the net_mgmt stack size?

@jukkar
Copy link
Member

jukkar commented Sep 6, 2023

There is actually already a PR that fixes the conn_mgr_monitor at #62317 so never mind fixing it here. Lets merge that one first and this PR should work too.

@jukkar
Copy link
Member

jukkar commented Sep 6, 2023

@Frauschi the stack size is now fixed in mainline, could you rebase against latest main and re-submit

This commit adds support for the SO_REUSEADDR option to be enabled for
a socket using setsockopt(). With this option, it is possible to bind
multiple sockets to the same local IP address / port combination, when
one of the IP address is unspecified (ANY_ADDR).

The implementation strictly follows the BSD implementation and tries to
follow the Linux implementation as close as possible. However, there is
one limitation: for client sockets, the Linux implementation of
SO_REUSEADDR behaves exactly like the one for SO_REUSEPORT and enables
multiple sockets to have exactly the same specific IP address / port
combination. This behavior is not possible with this implementation, as
there is no trivial way to identify a socket to be a client socket
during the bind() call. For this behavior, one has to use the
SO_REUSEPORT option in Zephyr.

There is also a new Kconfig to control this feature similar to other
socket options: CONFIG_NET_CONTEXT_REUSEADDR. This option is enabled by
default if TCP or UDP are enabled. However, it can still be disabled
explicitly.

Signed-off-by: Tobias Frauenschläger <t.frauenschlaeger@me.com>
This commit adds unit tests for the SO_REUSEADDR socket option to test
the different possible scenarios for the option to work correctly.

There is also a fix for the TCP testsuite to make a now failing test
pass again due to a TIME_WAIT socket.

Signed-off-by: Tobias Frauenschläger <t.frauenschlaeger@me.com>
This commits adds support for the SO_REUSEPORT socket option.

The implementation follows the behavior of BSD and tries to also follow
the specific additional features of linux with the following
limitations:
* SO_REUSEADDR and SO_REUSEPORT are not "the same" for client sockets,
  as we do not have a trivial way so identify a socket as "client"
  during binding. To get the Linux behavior, one has to use SO_REUSEPORT
  with Zephyr
* No prevention of "port hijacking"
* No support for the load balancing stuff for incoming
  packets/connections

There is also a new Kconfig option to control this feature, which is
enabled by default if TCP or UDP is enabled.

Signed-off-by: Tobias Frauenschläger <t.frauenschlaeger@me.com>
This commit adds unit tests for the SO_REUSEPORT sockets option. It also
fixes some bugs in other tests triggered by the new features.

Signed-off-by: Tobias Frauenschläger <t.frauenschlaeger@me.com>
@jukkar
Copy link
Member

jukkar commented Sep 6, 2023

@cfriedt @ssharks if you have time, please take a look.

Copy link
Member

@cfriedt cfriedt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for now - It might be better to eventually use these for NET_TCP and NET_UDP as an alternative to he default y if ...

select NET_CONTECT_REUSEADDR
select NET_CONTECT_REUSEPORT

@jukkar jukkar added this to the v3.5.0 milestone Sep 15, 2023
Copy link
Contributor

@rlubos rlubos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for the nice test rework.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

net: upd sockets: incorrect behavior for multiple clients on a single server

6 participants