Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot ping to link-local ipv6 interface address of the switch. #774

Merged
merged 8 commits into from
Sep 19, 2019

Conversation

kirankella
Copy link
Contributor

@kirankella kirankella commented Jan 30, 2019

What I did
Following fixes were done:
1. Packets destined to the switch's routing interface link-local ipv6 address
are not coming to CPU. Hence the ping fails.
Since all interfaces have the same link-local ipv6 address, all we need is
a single ip2me /128 route corresponding to this address added in the hardware.
We don't need fe80 ip2me route added to hardware for every interface. Hence the
address overlap issue won't arise for the link-local interface address.
2. Fixed another issue as part of this PR.
Where the link-local ipv6 neighbors are not learned via netlink by neighsync.
As a result, we could not add an ipv6 route via link-local nexthop.
Allow neighsync to learn the link-local neighbors too. So that IPv6 routes using the neighbors link-
local address as nexthop are added to hardware.

Why I did it
For the IPv6 communication to work over the link-local addresses between peers (mgmt or protocols), and for the IPv6 data plane forwarding to work over link-local nexthops.

How I verified it
Connect DUTs back to back.
Ping the link-local address on the neighbor DUT interfaces.

root@sonic:/home/admin# ping6 fe80::ce37:abff:fe17:6c9a -I Ethernet4
PING fe80::ce37:abff:fe17:6c9a(fe80::ce37:abff:fe17:6c9a) from fe80::ce37:abff:fe17:6ce4%Ethernet4 Ethernet4: 56 data bytes
64 bytes from fe80::ce37:abff:fe17:6c9a%Ethernet4: icmp_seq=1 ttl=64 time=0.475 ms
64 bytes from fe80::ce37:abff:fe17:6c9a%Ethernet4: icmp_seq=2 ttl=64 time=0.244 ms

Check the route table contents for the link-local route.

admin@sonic:~$ bcmcmd 'l3 ip6route show'
l3 ip6route show
Unit 0, Total Number of IPv6 entries: 6144 (IPv6/64 4096, IPv6/128 2048)
Max number of ECMP paths 64
Free IPv6 entries available: 6132 (IPv6/64 4087, IPv6/128 2045)
VRF Net addr Next Hop Mac INTF MODID PORT PRIO CLASS HIT VLAN
0 0 fe80:0000:0000:0000:ce37:abff:fe17:6ce4/128 00:00:00:00:00:00 100003 0 0 0 1 y
1 0 3031:0000:0000:0000:0000:0000:0000:0002/128 00:00:00:00:00:00 100003 0 0 0 1 y
2 0 3044:0000:0000:0000:0000:0000:0000:0001/128 00:00:00:00:00:00 100003 0 0 0 1 y
...

Check the route table contents for the IPv6 route using link-local nexthop

root@sonic:/home/admin# show ipv6 route
Codes: K - kernel route, C - connected, S - static, R - RIPng,
O - OSPFv3, I - IS-IS, B - BGP, N - NHRP, T - Table,
v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
> - selected route, * - FIB route
C>* 3031::/64 is directly connected, Vlan900, 00:43:35
B>* 3044::/64 [200/0] via fe80::ce37:abff:fe17:6ce4, Vlan900, 00:34:03

root@sonic:/home/admin# bcmcmd 'l3 ip6route show'
l3 ip6route show
Unit 0, Total Number of IPv6 entries: 6144 (IPv6/64 4096, IPv6/128 2048)
Max number of ECMP paths 64
Free IPv6 entries available: 6131 (IPv6/64 4086, IPv6/128 2045)
VRF Net addr Next Hop Mac INTF MODID PORT PRIO CLASS HIT VLAN
.....
2562 0 3044:0000:0000:0000:0000:0000:0000:0000/64 00:00:00:00:00:00 100006 0 0 0 0 n
......

Check that the ping to the route destination over the link-local nexthop works.

root@sonic:/home/admin# ping6 3044::1
PING 3044::1(3044::1) 56 data bytes
64 bytes from 3044::1: icmp_seq=1 ttl=64 time=0.260 ms
64 bytes from 3044::1: icmp_seq=2 ttl=64 time=0.341 ms

Details if related

Fixes:
       1. Packets destined to the switch's routing interface link-local ipv6 address
       are not coming to CPU. Hence the ping fails.
       Since all interfaces have the same link-local ipv6 address, all we need is
       a single ip2me /128 route corresponding to this address added in the hardware.
       We don't need fe80 ip2me route added to hardware for every interface. Hence the
       address overlap issue won't arise for the link-local interface address.

       2. Fixed another issue as part of this PR.
       Where the link-local ipv6 neighbors are not learned via netlink by neighsync.
       As a result, we could not add an ipv6 route via link-local nexthop.
       Allow neighsync to learn the link-local neighbors too.

Signed-off-by: kiran.kella@broadcom.com
Copy link
Collaborator

@prsunny prsunny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to add a VS test to check the route configuration with the expected switch mac!

{
SWSS_LOG_ENTER();

string ipPrefix;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to use consistent naming convention for variables. ip_prefix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed the comment.

return ipPrefix;
}

void RouteOrch::addLinkLocalRouteToMe(sai_object_id_t vrf_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about passing linklocal ipPrefix also to this function so that if it is different for different interfaces/VRFs, the API would still work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed the comment.

/* Length of the Interface Id value in EUI64 format */
#define EUI64_INTF_ID_LEN 8

#define IP6_ADDR_STR_LEN 40
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use INET6_ADDRSTRLEN instead of defining new one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed the comment.

{
SWSS_LOG_ERROR("Failed to create link local ipv6 host route %s to cpu, rv:%d",
linklocal_prefix.getIp().to_string().c_str(), status);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

throw runtime_error here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed the comment.

@kirankella
Copy link
Contributor Author

Suggest to add a VS test to check the route configuration with the expected switch mac!

Sure. Can you please point me to any documentation on how to add VS tests?
So, do we generally add the VS tests in a separate PR?
And the test you are referring to here is to check that the fe80../128 route's matches the switch MAC right?


IpPrefix linklocal_prefix = getLinkLocalEui64Addr();

addLinkLocalRouteToMe(gVirtualRouterId, linklocal_prefix);
Copy link
Contributor

@nikos-github nikos-github Jan 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also add the fe80::/10. Probably with addSubnetRoute after you check it does the right thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Why do we need the fe80::/10 route? Is it to handle all packets destined to any link-local address(es) belonging to us.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nikos-github, I am just thinking what are the other cases where we might need packets with DIP matching fe80::/10 link-local subnet prefix to come to CPU. I don't think we would need any other link-local packets other than fe80.../128 to come to CPU. Am I missing something here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider the following deployment scenario. Leafs are provisioned but the link local is unknown. For servers to work irrespective of the leaf they are attached to, they can be configured with a default gateway in the fe80 range - eg fe80::1. Traffic hitting the switch will be dropped since fe80::/10 is not in the hw.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Verified that the fe80::/10 subnet route to CPU handles both the scenarios that this PR is raised for (pinging to link-local address and link-local address as nexthop).
Now, we don't need the fe80.../128 route to CPU to be added to hardware since fe80::/10 covers that too. Retained in the code, the function that returns our own interface link-local address with eui64 interface id, for later reference.
@prsunny Since we are no longer adding the fe80.../128 now with this change, i am not currently adding any VS test for the same. Hope that is fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree.
@lguohan @prsunny, In SONiC, I see that, by default the CoPP policy traps applied for ip2me routes rate limits the matching packets to 600 pps. Is this expected?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, the subnet and ip2me are trapped to the same queue which is currently 600pps. To me, this is expected as we have separate traps for those control packets destined to me, which leaves this trap for packets like ICMP.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the control packets like BGP, LACP, LLDP... and the ARP,NDP packets are going to the same queue (4). And the ARP, NDP packets are rate limited to 600pps.
Since, subnet and ip2me are trapped to the same queue and are rate limited to 600 pps, currently there does not seem to be any difference in both the routes.
Anyways, we might change the queue mapping between the subnet and ip2me traps in future. So, it is good to have the /128 route too in the hardware for the link-local address (like in case of global interface addresses). @nikos-github @prsunny , do you agree?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine to have both for reasons of functionality in case the underlying SAI implementation changes on how packets are rate limited for the subnet and the host entries, and for consistency purposes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nikos-github Retained /128 ip2me route too. Thanks.

return ip_prefix;
}

void RouteOrch::addLinkLocalRouteToMe(sai_object_id_t vrf_id, IpPrefix linklocal_prefix)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is identical with addIp2MeRoute (I think it is), please use addIp2MeRoute.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This is identical to addIp2MeRoute() function but that is a member function of IntfsOrch class. and we would need an IntfsOrch object instance to call it. Or else, make addIp2MeRoute() function a public static function of IntfOrch class. Or using an unsafe reinterpret_cast<> of 'this' pointer from RouteOrch to IntfsOrch to access addIp2MeRoute().

@prsunny
Copy link
Collaborator

prsunny commented Jan 31, 2019

@kirankella , generally we add VS test as part of the same PR. Yes the intention is to check if linklocal fe80../128 matches the switchmac in ASIC DB.

Copy link
Contributor

@jipanyang jipanyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if user would like to configure link local address explicitly, how does the change in routeorch coordinate with the processing in intforch?

@@ -84,6 +87,9 @@ class RouteOrch : public Orch, public Subject
bool addRoute(IpPrefix, IpAddresses);
bool removeRoute(IpPrefix);

std::string getLinkLocalEui64Addr(void);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this function called?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now called in routeorch initializing code. Please check the latest code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jipanyang Could you please review the latest changes?

@kirankella
Copy link
Contributor Author

What if user would like to configure link local address explicitly, how does the change in routeorch coordinate with the processing in intforch?

@nikos-github @prsunny @jipanyang , this PR code change is to address the case where we have to push the default auto configured link-local address to hardware (since intfsyncd is no longer detecting the default link-local from the stack).
The manually configured link-local address (using 'config interface Ethernet4 ip add fe80::22:1/64 for eg..) still goes to the hardware via the regular path (config-db -> orchagent/intfsorch -> sai) as /128 ip2me host route and /64 subnet route (We already know about the /64 subnet route conflict that happens if we try to add more than 1 link-local /64 addresses manually. This is a different overlap issue).

Copy link
Collaborator

@prsunny prsunny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look ok to me!

…ubnet route.

Signed-off-by: Kiran Kella <kiran.kella@broadcom.com>
@nikos-github
Copy link
Contributor

I'm fine with the current changes. Ideally I would have made addIp2MeRoute public.

@rodnymolina
Copy link
Contributor

@kirankella Looks like this logic can only work in scenarios where the link-local address is the same for all interfaces in the switch. If i understand your point correctly, this PR doesn't solve the problem for typical scenarios in which users may want to define link-local-addresses in a more human-friendly fashion. For doing that you need to associate ip-addresses with l3-interfaces; your current approach of having a single system-wide ip2me route doesn't offer any flexibility, as users would need to restart the box after making any link-local-address change. I believe the fix in PR/437 (which i'm currently re-testing after latest SAI), provides a more generic solution to this problem.

@kirankella
Copy link
Contributor Author

@kirankella Looks like this logic can only work in scenarios where the link-local address is the same for all interfaces in the switch. If i understand your point correctly, this PR doesn't solve the problem for typical scenarios in which users may want to define link-local-addresses in a more human-friendly fashion. For doing that you need to associate ip-addresses with l3-interfaces; your current approach of having a single system-wide ip2me route doesn't offer any flexibility, as users would need to restart the box after making any link-local-address change. I believe the fix in PR/437 (which i'm currently re-testing after latest SAI), provides a more generic solution to this problem.

@kirankella kirankella closed this Feb 12, 2019
@kirankella kirankella reopened this Feb 12, 2019
@kirankella
Copy link
Contributor Author

@kirankella Looks like this logic can only work in scenarios where the link-local address is the same for all interfaces in the switch. If i understand your point correctly, this PR doesn't solve the problem for typical scenarios in which users may want to define link-local-addresses in a more human-friendly fashion. For doing that you need to associate ip-addresses with l3-interfaces; your current approach of having a single system-wide ip2me route doesn't offer any flexibility, as users would need to restart the box after making any link-local-address change. I believe the fix in PR/437 (which i'm currently re-testing after latest SAI), provides a more generic solution to this problem.

@rodnymolina You are right on the point that this PR is not about addressing the manually configured link-local addresses (CLI). Since we don't have the intfsyncd now, we have to push to the hardware, the only automatically-derived-system-wide-link-local-address ip2me route (the default config case) at the initialization time. This PR addresses that (bootup default case for having the fe80 /128 and fe80 /10 CPU routes).
Having said that, we can have more than 1 link-local addresses assigned to the interfaces (just like the global IPv6 addresses). And assigning a new link-local address doesn't have to delete the automatically derived (MAC based by Linux) link-local address. Also, there can be cases in future where we have a VRRPv3 Virtual IPv6 Link-local address that is assigned to the VRRP Master routing interfaces (another case where we have multiple link-local addresses on the interface).

@rodnymolina
Copy link
Contributor

@kirankella My point is that we shouldn't hard-code something as dynamic and subject to change as an ip-address. Think about the possibility of a user wanting to to replace (not add) the default link-local address of all the interfaces. Or think about someone interested in disabling link-local-forwarding altogether. Having a hard-coded ip2me entry in hardware doesn't sound right to me.

@nikos-github
Copy link
Contributor

@rodnymolina The link-local IP that Kiran is using is derived from the mac. We don't support to my knowledge changing the mac on the host i/f without restarting swss. The user can change the mac of all the front panel ports but that doesn't invalidate Kiran's changes in this PR.

@jipanyang
Copy link
Contributor

There is use case link local autoconf is disabled: net.ipv6.conf.all.autoconf=0, and user would config link local IP address explicitly. Capability should be provided to do the same for swss. That could be provided as a SWSS startup option or incremental config support.

@lguohan
Copy link
Contributor

lguohan commented Feb 13, 2019

@kirankella , thanks for the PR. can you also develop a ansible test in sonic-mgmt repo, so that non-broadcom asic platform can test? It looks like you have only tested this on broadcom platform.

@nikos-github
Copy link
Contributor

There is use case link local autoconf is disabled: net.ipv6.conf.all.autoconf=0, and user would config link local IP address explicitly. Capability should be provided to do the same for swss. That could be provided as a SWSS startup option or incremental config support.

@jipanyang What you are describing, is not currently supported in sonic and is certainly outside the scope of this PR. It can be discussed in a separate PR or feature request if it actually makes sense to follow that setting for the host interface.

@kirankella
Copy link
Contributor Author

@kirankella , thanks for the PR. can you also develop a ansible test in sonic-mgmt repo, so that non-broadcom asic platform can test? It looks like you have only tested this on broadcom platform.

@lguohan Sure. I will work on it.

@rodnymolina
Copy link
Contributor

@rodnymolina The link-local IP that Kiran is using is derived from the mac. We don't support to my knowledge changing the mac on the host i/f without restarting swss. The user can change the mac of all the front panel ports but that doesn't invalidate Kiran's changes in this PR.

If the user were to do that, then there's no reason for the link-local-address derived from the 'original' system-mac to exist. Then why would you want to bring traffic to CPU for a DIP that it's not existing in the system/kernel? You'd want that traffic to be dropped in hw, right?

But there's a more fundamental issue i'm struggling with while looking at this PR: how do we come up with a consistent interface-overlap-prevention logic if we don't have a common 'nexus' between 'interfaces' in the system and the 'routes' associated to those interfaces? By creating this global ip2me route (which is tied to no interface), we are breaking that implicit association, and this would definitely contribute to a more complex overlap-prevention logic, which would now has slightly different flavors: one for ipv4 and another one for ipv6.

Copy link
Contributor

@jipanyang jipanyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the open issues related to this link local route behavior, I'd suggest not to make it a default.

We may add an entry in DEVICE_METADATA to explicitly enable the download of fixed linklocal to me routes. orchagent.sh may read the setting and start accordingly.

Or we go further to dynamically enable/disable the fixed link local to me routes. By default it should be disabled, this will avoid conflict with any user configured link local ip.

@nikos-github
Copy link
Contributor

@rodnymolina The link-local IP that Kiran is using is derived from the mac. We don't support to my knowledge changing the mac on the host i/f without restarting swss. The user can change the mac of all the front panel ports but that doesn't invalidate Kiran's changes in this PR.

If the user were to do that, then there's no reason for the link-local-address derived from the 'original' system-mac to exist. Then why would you want to bring traffic to CPU for a DIP that it's not existing in the system/kernel? You'd want that traffic to be dropped in hw, right?

But there's a more fundamental issue i'm struggling with while looking at this PR: how do we come up with a consistent interface-overlap-prevention logic if we don't have a common 'nexus' between 'interfaces' in the system and the 'routes' associated to those interfaces? By creating this global ip2me route (which is tied to no interface), we are breaking that implicit association, and this would definitely contribute to a more complex overlap-prevention logic, which would now has slightly different flavors: one for ipv4 and another one for ipv6.

You are thinking only of the front panel ports. There's the mgmt port as well. Also the absence of ip2me won't cause a packet drop because the fe80::/10 also goes down to hw.

@nikos-github
Copy link
Contributor

With the open issues related to this link local route behavior, I'd suggest not to make it a default.

We may add an entry in DEVICE_METADATA to explicitly enable the download of fixed linklocal to me routes. orchagent.sh may read the setting and start accordingly.

Or we go further to dynamically enable/disable the fixed link local to me routes. By default it should be disabled, this will avoid conflict with any user configured link local ip.

Your suggestion is a no-op in the presence of the fe80::/10. Not sure what your goal is here but it can be discussed off this PR.

@prsunny
Copy link
Collaborator

prsunny commented Feb 19, 2019

@jipanyang , @rodnymolina , what is the conclusion on this based on the latest comments?

@jipanyang
Copy link
Contributor

@prsunny My understanding is that putting link-local-address derived from the 'original' system-mac and the fe80 /10 CPU routes into asic by default is the not the desired behavior, we should have a knob (either an orchchagent startup option or more preferably dynamic enable/disable config support) to avoid conflict with explicit link local configuration.

@rodnymolina
Copy link
Contributor

@prsunny My changes in PR/437 (which are about to be updated), are taking into account that each link-local address in the system has been configured over a specific interface. So i can't see how my overlap-prevention-logic can coexist with the one in this PR. Perhaps we could do something like what mentioned by @jipanyang above: have a global knob that enables either one approach or the other, but both cannot be enabled at the same time.

@nikos-github
Copy link
Contributor

nikos-github commented Feb 20, 2019

@prsunny @jipanyang The desired behavior is for the ping to go through. Right now the hw is dropping the packets and that's not acceptable. It's also a mismatch with the linux kernel behavior. By default, the link local subnet should be in the hw. Since the behavior is the same today for subnet and host link-local addresses, the ip2me is not needed and was merely here for completeness. Also Rodny's changes are not needed either and he only needs to make sure that further link-local prefixes do not make it into the hw. The overlapping addresses is a different issue after that.

@yxieca yxieca merged commit 12c29b4 into sonic-net:master Sep 19, 2019
yxieca pushed a commit that referenced this pull request Sep 19, 2019
* Cannot ping to link-local ipv6 interface address of the switch.

Fixes:
       1. Packets destined to the switch's routing interface link-local ipv6 address
       are not coming to CPU. Hence the ping fails.
       Since all interfaces have the same link-local ipv6 address, all we need is
       a single ip2me /128 route corresponding to this address added in the hardware.
       We don't need fe80 ip2me route added to hardware for every interface. Hence the
       address overlap issue won't arise for the link-local interface address.

       2. Fixed another issue as part of this PR.
       Where the link-local ipv6 neighbors are not learned via netlink by neighsync.
       As a result, we could not add an ipv6 route via link-local nexthop.
       Allow neighsync to learn the link-local neighbors too.

Signed-off-by: kiran.kella@broadcom.com

* Incremental change to the code changes.

* Incremental change to the code changes.

* Incorporated review comments.

* Incorporated review comments.

* Add fe80::/10 route to CPU to forward all locally destined link-local ipv6 packets to CPU.

* Retain fe80.../128 ip2me route in the hardware along with fe80::/10 subnet route.

Signed-off-by: Kiran Kella <kiran.kella@broadcom.com>
oleksandrivantsiv pushed a commit to oleksandrivantsiv/sonic-swss that referenced this pull request Mar 1, 2023
Knob needed by mlnx when doing firmware update on some platforms, which can exceed default 1 min timeout.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants