-
Notifications
You must be signed in to change notification settings - Fork 539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot ping to link-local ipv6 interface address of the switch. #774
Cannot ping to link-local ipv6 interface address of the switch. #774
Conversation
Fixes: 1. Packets destined to the switch's routing interface link-local ipv6 address are not coming to CPU. Hence the ping fails. Since all interfaces have the same link-local ipv6 address, all we need is a single ip2me /128 route corresponding to this address added in the hardware. We don't need fe80 ip2me route added to hardware for every interface. Hence the address overlap issue won't arise for the link-local interface address. 2. Fixed another issue as part of this PR. Where the link-local ipv6 neighbors are not learned via netlink by neighsync. As a result, we could not add an ipv6 route via link-local nexthop. Allow neighsync to learn the link-local neighbors too. Signed-off-by: kiran.kella@broadcom.com
…rankella/sonic-swss into ipv6_link_locall_addr_support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest to add a VS test to check the route configuration with the expected switch mac!
orchagent/routeorch.cpp
Outdated
{ | ||
SWSS_LOG_ENTER(); | ||
|
||
string ipPrefix; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest to use consistent naming convention for variables. ip_prefix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed the comment.
orchagent/routeorch.cpp
Outdated
return ipPrefix; | ||
} | ||
|
||
void RouteOrch::addLinkLocalRouteToMe(sai_object_id_t vrf_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about passing linklocal ipPrefix
also to this function so that if it is different for different interfaces/VRFs, the API would still work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed the comment.
orchagent/routeorch.h
Outdated
/* Length of the Interface Id value in EUI64 format */ | ||
#define EUI64_INTF_ID_LEN 8 | ||
|
||
#define IP6_ADDR_STR_LEN 40 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use INET6_ADDRSTRLEN
instead of defining new one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed the comment.
{ | ||
SWSS_LOG_ERROR("Failed to create link local ipv6 host route %s to cpu, rv:%d", | ||
linklocal_prefix.getIp().to_string().c_str(), status); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
throw runtime_error here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed the comment.
Sure. Can you please point me to any documentation on how to add VS tests? |
|
||
IpPrefix linklocal_prefix = getLinkLocalEui64Addr(); | ||
|
||
addLinkLocalRouteToMe(gVirtualRouterId, linklocal_prefix); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also add the fe80::/10. Probably with addSubnetRoute after you check it does the right thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Why do we need the fe80::/10 route? Is it to handle all packets destined to any link-local address(es) belonging to us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nikos-github, I am just thinking what are the other cases where we might need packets with DIP matching fe80::/10 link-local subnet prefix to come to CPU. I don't think we would need any other link-local packets other than fe80.../128 to come to CPU. Am I missing something here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider the following deployment scenario. Leafs are provisioned but the link local is unknown. For servers to work irrespective of the leaf they are attached to, they can be configured with a default gateway in the fe80 range - eg fe80::1. Traffic hitting the switch will be dropped since fe80::/10 is not in the hw.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Verified that the fe80::/10 subnet route to CPU handles both the scenarios that this PR is raised for (pinging to link-local address and link-local address as nexthop).
Now, we don't need the fe80.../128 route to CPU to be added to hardware since fe80::/10 covers that too. Retained in the code, the function that returns our own interface link-local address with eui64 interface id, for later reference.
@prsunny Since we are no longer adding the fe80.../128 now with this change, i am not currently adding any VS test for the same. Hope that is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, the subnet and ip2me are trapped to the same queue which is currently 600pps. To me, this is expected as we have separate traps for those control packets destined to me, which leaves this trap for packets like ICMP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the control packets like BGP, LACP, LLDP... and the ARP,NDP packets are going to the same queue (4). And the ARP, NDP packets are rate limited to 600pps.
Since, subnet and ip2me are trapped to the same queue and are rate limited to 600 pps, currently there does not seem to be any difference in both the routes.
Anyways, we might change the queue mapping between the subnet and ip2me traps in future. So, it is good to have the /128 route too in the hardware for the link-local address (like in case of global interface addresses). @nikos-github @prsunny , do you agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine to have both for reasons of functionality in case the underlying SAI implementation changes on how packets are rate limited for the subnet and the host entries, and for consistency purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nikos-github Retained /128 ip2me route too. Thanks.
return ip_prefix; | ||
} | ||
|
||
void RouteOrch::addLinkLocalRouteToMe(sai_object_id_t vrf_id, IpPrefix linklocal_prefix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is identical with addIp2MeRoute (I think it is), please use addIp2MeRoute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. This is identical to addIp2MeRoute() function but that is a member function of IntfsOrch class. and we would need an IntfsOrch object instance to call it. Or else, make addIp2MeRoute() function a public static function of IntfOrch class. Or using an unsafe reinterpret_cast<> of 'this' pointer from RouteOrch to IntfsOrch to access addIp2MeRoute().
@kirankella , generally we add VS test as part of the same PR. Yes the intention is to check if linklocal |
… ipv6 packets to CPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if user would like to configure link local address explicitly, how does the change in routeorch coordinate with the processing in intforch?
@@ -84,6 +87,9 @@ class RouteOrch : public Orch, public Subject | |||
bool addRoute(IpPrefix, IpAddresses); | |||
bool removeRoute(IpPrefix); | |||
|
|||
std::string getLinkLocalEui64Addr(void); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is this function called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now called in routeorch initializing code. Please check the latest code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jipanyang Could you please review the latest changes?
@nikos-github @prsunny @jipanyang , this PR code change is to address the case where we have to push the default auto configured link-local address to hardware (since intfsyncd is no longer detecting the default link-local from the stack). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look ok to me!
…ubnet route. Signed-off-by: Kiran Kella <kiran.kella@broadcom.com>
I'm fine with the current changes. Ideally I would have made addIp2MeRoute public. |
@kirankella Looks like this logic can only work in scenarios where the link-local address is the same for all interfaces in the switch. If i understand your point correctly, this PR doesn't solve the problem for typical scenarios in which users may want to define link-local-addresses in a more human-friendly fashion. For doing that you need to associate ip-addresses with l3-interfaces; your current approach of having a single system-wide ip2me route doesn't offer any flexibility, as users would need to restart the box after making any link-local-address change. I believe the fix in PR/437 (which i'm currently re-testing after latest SAI), provides a more generic solution to this problem. |
|
@rodnymolina You are right on the point that this PR is not about addressing the manually configured link-local addresses (CLI). Since we don't have the intfsyncd now, we have to push to the hardware, the only automatically-derived-system-wide-link-local-address ip2me route (the default config case) at the initialization time. This PR addresses that (bootup default case for having the fe80 /128 and fe80 /10 CPU routes). |
@kirankella My point is that we shouldn't hard-code something as dynamic and subject to change as an ip-address. Think about the possibility of a user wanting to to replace (not add) the default link-local address of all the interfaces. Or think about someone interested in disabling link-local-forwarding altogether. Having a hard-coded ip2me entry in hardware doesn't sound right to me. |
@rodnymolina The link-local IP that Kiran is using is derived from the mac. We don't support to my knowledge changing the mac on the host i/f without restarting swss. The user can change the mac of all the front panel ports but that doesn't invalidate Kiran's changes in this PR. |
There is use case link local autoconf is disabled: net.ipv6.conf.all.autoconf=0, and user would config link local IP address explicitly. Capability should be provided to do the same for swss. That could be provided as a SWSS startup option or incremental config support. |
@kirankella , thanks for the PR. can you also develop a ansible test in sonic-mgmt repo, so that non-broadcom asic platform can test? It looks like you have only tested this on broadcom platform. |
@jipanyang What you are describing, is not currently supported in sonic and is certainly outside the scope of this PR. It can be discussed in a separate PR or feature request if it actually makes sense to follow that setting for the host interface. |
@lguohan Sure. I will work on it. |
If the user were to do that, then there's no reason for the link-local-address derived from the 'original' system-mac to exist. Then why would you want to bring traffic to CPU for a DIP that it's not existing in the system/kernel? You'd want that traffic to be dropped in hw, right? But there's a more fundamental issue i'm struggling with while looking at this PR: how do we come up with a consistent interface-overlap-prevention logic if we don't have a common 'nexus' between 'interfaces' in the system and the 'routes' associated to those interfaces? By creating this global ip2me route (which is tied to no interface), we are breaking that implicit association, and this would definitely contribute to a more complex overlap-prevention logic, which would now has slightly different flavors: one for ipv4 and another one for ipv6. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the open issues related to this link local route behavior, I'd suggest not to make it a default.
We may add an entry in DEVICE_METADATA to explicitly enable the download of fixed linklocal to me routes. orchagent.sh may read the setting and start accordingly.
Or we go further to dynamically enable/disable the fixed link local to me routes. By default it should be disabled, this will avoid conflict with any user configured link local ip.
You are thinking only of the front panel ports. There's the mgmt port as well. Also the absence of ip2me won't cause a packet drop because the fe80::/10 also goes down to hw. |
Your suggestion is a no-op in the presence of the fe80::/10. Not sure what your goal is here but it can be discussed off this PR. |
@jipanyang , @rodnymolina , what is the conclusion on this based on the latest comments? |
@prsunny My understanding is that putting link-local-address derived from the 'original' system-mac and the fe80 /10 CPU routes into asic by default is the not the desired behavior, we should have a knob (either an orchchagent startup option or more preferably dynamic enable/disable config support) to avoid conflict with explicit link local configuration. |
@prsunny My changes in PR/437 (which are about to be updated), are taking into account that each link-local address in the system has been configured over a specific interface. So i can't see how my overlap-prevention-logic can coexist with the one in this PR. Perhaps we could do something like what mentioned by @jipanyang above: have a global knob that enables either one approach or the other, but both cannot be enabled at the same time. |
@prsunny @jipanyang The desired behavior is for the ping to go through. Right now the hw is dropping the packets and that's not acceptable. It's also a mismatch with the linux kernel behavior. By default, the link local subnet should be in the hw. Since the behavior is the same today for subnet and host link-local addresses, the ip2me is not needed and was merely here for completeness. Also Rodny's changes are not needed either and he only needs to make sure that further link-local prefixes do not make it into the hw. The overlapping addresses is a different issue after that. |
* Cannot ping to link-local ipv6 interface address of the switch. Fixes: 1. Packets destined to the switch's routing interface link-local ipv6 address are not coming to CPU. Hence the ping fails. Since all interfaces have the same link-local ipv6 address, all we need is a single ip2me /128 route corresponding to this address added in the hardware. We don't need fe80 ip2me route added to hardware for every interface. Hence the address overlap issue won't arise for the link-local interface address. 2. Fixed another issue as part of this PR. Where the link-local ipv6 neighbors are not learned via netlink by neighsync. As a result, we could not add an ipv6 route via link-local nexthop. Allow neighsync to learn the link-local neighbors too. Signed-off-by: kiran.kella@broadcom.com * Incremental change to the code changes. * Incremental change to the code changes. * Incorporated review comments. * Incorporated review comments. * Add fe80::/10 route to CPU to forward all locally destined link-local ipv6 packets to CPU. * Retain fe80.../128 ip2me route in the hardware along with fe80::/10 subnet route. Signed-off-by: Kiran Kella <kiran.kella@broadcom.com>
Knob needed by mlnx when doing firmware update on some platforms, which can exceed default 1 min timeout.
What I did
Following fixes were done:
1. Packets destined to the switch's routing interface link-local ipv6 address
are not coming to CPU. Hence the ping fails.
Since all interfaces have the same link-local ipv6 address, all we need is
a single ip2me /128 route corresponding to this address added in the hardware.
We don't need fe80 ip2me route added to hardware for every interface. Hence the
address overlap issue won't arise for the link-local interface address.
2. Fixed another issue as part of this PR.
Where the link-local ipv6 neighbors are not learned via netlink by neighsync.
As a result, we could not add an ipv6 route via link-local nexthop.
Allow neighsync to learn the link-local neighbors too. So that IPv6 routes using the neighbors link-
local address as nexthop are added to hardware.
Why I did it
For the IPv6 communication to work over the link-local addresses between peers (mgmt or protocols), and for the IPv6 data plane forwarding to work over link-local nexthops.
How I verified it
Connect DUTs back to back.
Ping the link-local address on the neighbor DUT interfaces.
root@sonic:/home/admin# ping6 fe80::ce37:abff:fe17:6c9a -I Ethernet4
PING fe80::ce37:abff:fe17:6c9a(fe80::ce37:abff:fe17:6c9a) from fe80::ce37:abff:fe17:6ce4%Ethernet4 Ethernet4: 56 data bytes
64 bytes from fe80::ce37:abff:fe17:6c9a%Ethernet4: icmp_seq=1 ttl=64 time=0.475 ms
64 bytes from fe80::ce37:abff:fe17:6c9a%Ethernet4: icmp_seq=2 ttl=64 time=0.244 ms
Check the route table contents for the link-local route.
admin@sonic:~$ bcmcmd 'l3 ip6route show'
l3 ip6route show
Unit 0, Total Number of IPv6 entries: 6144 (IPv6/64 4096, IPv6/128 2048)
Max number of ECMP paths 64
Free IPv6 entries available: 6132 (IPv6/64 4087, IPv6/128 2045)
VRF Net addr Next Hop Mac INTF MODID PORT PRIO CLASS HIT VLAN
0 0 fe80:0000:0000:0000:ce37:abff:fe17:6ce4/128 00:00:00:00:00:00 100003 0 0 0 1 y
1 0 3031:0000:0000:0000:0000:0000:0000:0002/128 00:00:00:00:00:00 100003 0 0 0 1 y
2 0 3044:0000:0000:0000:0000:0000:0000:0001/128 00:00:00:00:00:00 100003 0 0 0 1 y
...
Check the route table contents for the IPv6 route using link-local nexthop
root@sonic:/home/admin# show ipv6 route
Codes: K - kernel route, C - connected, S - static, R - RIPng,
O - OSPFv3, I - IS-IS, B - BGP, N - NHRP, T - Table,
v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
> - selected route, * - FIB route
C>* 3031::/64 is directly connected, Vlan900, 00:43:35
B>* 3044::/64 [200/0] via fe80::ce37:abff:fe17:6ce4, Vlan900, 00:34:03
root@sonic:/home/admin# bcmcmd 'l3 ip6route show'
l3 ip6route show
Unit 0, Total Number of IPv6 entries: 6144 (IPv6/64 4096, IPv6/128 2048)
Max number of ECMP paths 64
Free IPv6 entries available: 6131 (IPv6/64 4086, IPv6/128 2045)
VRF Net addr Next Hop Mac INTF MODID PORT PRIO CLASS HIT VLAN
.....
2562 0 3044:0000:0000:0000:0000:0000:0000:0000/64 00:00:00:00:00:00 100006 0 0 0 0 n
......
Check that the ping to the route destination over the link-local nexthop works.
root@sonic:/home/admin# ping6 3044::1
PING 3044::1(3044::1) 56 data bytes
64 bytes from 3044::1: icmp_seq=1 ttl=64 time=0.260 ms
64 bytes from 3044::1: icmp_seq=2 ttl=64 time=0.341 ms
Details if related