Dynamically learned neighbors are not deleted from ASIC_DB when eBGP interfaces are shutdown and the neighbors are flushed #12442

vganesan-nokia · 2022-10-18T17:02:52Z

Description

After the eBGP neighbors interfaces (ports and port channles) are shut down, when "sonic-clear arp" or "sonic-clear ndp" command is issued, the dynamically learned eBGP neighbors are not cleared from ASIC_DB. The orchagent syslogs show that "Failed to remove still referenced neighbor". This is unexpected and incorrect. Since bringing down the eBGP interfaces will bring down the eBGP neighbors. So all routes attached to the eBGP neighbors will be withdrawn and neighbors should not be referencing to any route. Flushing the neighbors clears the neighbors from the linux kernel but the neighbors still exist in ASIC_DB. (Please see "Additional Information" below for the root cause of this issue)

Steps to reproduce the issue:

Following is one scenario how the problem can be reproduced more frequently

Establish eBGP sessions and advertise default route (IPv4 or IPv6).
Make sure that default routes exists in kernel, APPL_DB and ASIC_DB as expected.
Shutdown interfaces such that all the eBGP neighbors that advertised the default route are down.
Dump the routes in the kernel and make sure that the defautl route has only one next on the "eth0"
Clear the dynamically learned neighbors using "sonic-clearp ndp" or "sonic-clear arp" command.
Dump the neighbors in the linux kernel using "ip neigh show" command
Dump neighbors from ASIC_DB

Describe the results you received:

Dynamically learned eBGP neighbors are removed from linux kernel as expected.
But some dynamicaly learned neighbors are still in the ASIC_DB,

Describe the results you expected:

All the dynamically learned neighbors must be cleared from linux kernel, APPL_DB and ASIC_DB when the neighbors are flushed after shutting down the eBGP neighbor interfaces.

Output of `show version`:

(paste your output here)

Output of `show techsupport`:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

The root cause of this problem is that some routes learned from eBGP neighbors are not cleared from APPL_DB when the eBGP neighbors go down. The routes which show this issue are those routes which include the next hop on interface "eth0". This consistently occurs for default routes for both IPv4 and IPv6 in any asic instance. Since the route is not deleted from APPL DB (though route is deleted from kernel), the neighbor ref_count is not 0 and hence the neighbor is not deleted from the ASIC_DB.

All asic instances include the default route with next hop on eth0 (with large metric to avoid colliding with default routes learned via eBGP). The APPL_DB has this route with only the eBGP next hops due to a filtering in routesync that filters the kernel route updates that has a next hop on eth0 or docker0. Consequently, the next hop on eth0 is not included in the route entry in APPL_DB and ASIC_DB. When all the eBGP neighbor interfaces are shut down, all eBGP neighbors go down. When all eBGP neighbors go down, the default route is withdrawn. The next hop on eth0 becomes the only next hop for the default route. When kernel sends this update, because of the the above mentioned routesync filter the route update is not sent to APPL_DB. Hence the APPL_DB is left with a stale route entry with whatever next hops it had before the eth0 next hop became only next hop. Since these next hops still have this route referenced, these are not cleared when neighbors are flushed.

A possible soultion to both of these issues viz., (1) stale route entry in APPL_DB and ASIC_DB and (2) stale neighbor entry in ASIC_DB is to delete the routes from APPL_DB when kernel updates are received with next hop on "eth0" as the only next hop.

The text was updated successfully, but these errors were encountered:

rlhui · 2022-10-24T22:38:43Z

@vganesan-nokia - which release is this found and what kind of platform e.g. chassis or pizza box?

arlakshm · 2022-11-19T00:26:33Z

@vganesan-nokia,
I tried to repro the issue with the following steps but did see the issue

Shutdown all ebgp sessions
Check if the default route is removed from app_db and asic_db
shutdown the port to flush out the neigh entries.

After these steps I see the default route is removed from APP_db and ASIC_DB
The neighbor entry is removed from APP_db and CHASSIS_APP_DB.

I am testing on the latest 202205 image. Can you confirm if you still see the is problem, if you can you confirm steps to reproduce

Signed-off-by: vedganes <veda.ganesan@nokia.com> The changes are for fixing stale neighbor in the ASIC_DB and data path when eBGP neighbors are shutdown and neighbors are flushed. The problem is described in issue: sonic-net/sonic-buildimage#12442 The root cause of this issue is due to not deleing the route from the ASIC_DB when the route's next hop is on eth0 or docker0 interface. The solution is to delete the route entry from ASIC_DB instead of just returning when the route's next hop is on the interface eth0 or docker0

mannytaheri · 2022-12-12T20:43:14Z

How_to_reproduce_PR_12442.txt

abdosi · 2022-12-13T18:40:10Z

@prsunny for viz. Hitting same another chassis platform for Ipv6 Default route.

abdosi · 2022-12-13T18:46:31Z

@vganesan-nokia looks like issue with only default route. Because i am seeing it for that only. If that is the case can we update Issues Title and Description to point that (default routes are only impacted)

abdosi · 2022-12-13T18:47:20Z

Hitted while running test_default_route.py::test_default_route_with_bgp_flap

vganesan-nokia · 2022-12-13T19:14:39Z

@vganesan-nokia looks like issue with only default route. Because i am seeing it for that only. If that is the case can we update Issues Title and Description to point that (default routes are only impacted)

@abdosi, this issue can happen for any route learned from eBGPs and when eBGPs go down and if that route has an additional next hop on eth0/docker0.

abdosi · 2022-12-13T19:19:15Z

wondering is there any other route over eth0/docker0 other than default route ?

Signed-off-by: vedganes <veda.ganesan@nokia.com> The changes are for fixing stale neighbor in the ASIC_DB and data path when eBGP neighbors are shutdown and neighbors are flushed. The problem is described in issue: sonic-net/sonic-buildimage#12442 The root cause of this issue is due to not deleing the route from the ASIC_DB when the route's next hop is on eth0 or docker0 interface. The solution is to delete the route entry from ASIC_DB instead of just returning when the route's next hop is on the interface eth0 or docker0

* [routesync] Fix for stale dynamic neighbor The changes are for fixing stale neighbor in the ASIC_DB and data path when eBGP neighbors are shutdown and neighbors are flushed. The problem is described in issue: sonic-net/sonic-buildimage#12442 The root cause of this issue is due to not deleing the route from the ASIC_DB when the route's next hop is on eth0 or docker0 interface. The solution is to delete the route entry from ASIC_DB instead of just returning when the route's next hop is on the interface eth0 or docker0 This commit fixes the warm restart unit test failure. When the route with only nh on eth0 or docker0 is removed and if the route is the default route, orchagent sends "add" black hole route to the syncd. So the ASIC DB gets n hset message. When this happens during warm restart, the unit test identifies this as unwanted setting and the unit test fails. To fix this issues, the route delete is sent only if the warm restart is not in progress. This is done following the same warm restart handling approach used for route delete in other palces. Signed-off-by: vedganes <veda.ganesan@nokia.com>

vganesan-nokia · 2023-02-01T20:20:16Z

Fixed by PR sonic-net/sonic-swss#2553

* [routesync] Fix for stale dynamic neighbor The changes are for fixing stale neighbor in the ASIC_DB and data path when eBGP neighbors are shutdown and neighbors are flushed. The problem is described in issue: sonic-net/sonic-buildimage#12442 The root cause of this issue is due to not deleing the route from the ASIC_DB when the route's next hop is on eth0 or docker0 interface. The solution is to delete the route entry from ASIC_DB instead of just returning when the route's next hop is on the interface eth0 or docker0 This commit fixes the warm restart unit test failure. When the route with only nh on eth0 or docker0 is removed and if the route is the default route, orchagent sends "add" black hole route to the syncd. So the ASIC DB gets n hset message. When this happens during warm restart, the unit test identifies this as unwanted setting and the unit test fails. To fix this issues, the route delete is sent only if the warm restart is not in progress. This is done following the same warm restart handling approach used for route delete in other palces. Signed-off-by: vedganes <veda.ganesan@nokia.com>

yuxuehong · 2024-07-04T09:19:38Z

when route with muti nexthops which not include eth0, frr has feture which when one nexthop invalid，will resolve to default; thus zebra will update route with muti nexthop, and one of them via eth0, thus we can not handle this situation；
so, handle muti nexthop which include eth0, we sholud update the non-eth0 routes to APPDB?

rlhui added the Chassis 🤖 Modular chassis support label Nov 5, 2022

rlhui assigned arlakshm Nov 11, 2022

rlhui added the P0 Priority of the issue label Nov 12, 2022

vganesan-nokia mentioned this issue Nov 29, 2022

[routesync] Fix for stale dynamic neighbor sonic-net/sonic-swss#2553

Merged

rlhui added the Triaged this issue has been triaged label Jan 11, 2023

vganesan-nokia closed this as completed Feb 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamically learned neighbors are not deleted from ASIC_DB when eBGP interfaces are shutdown and the neighbors are flushed #12442

Dynamically learned neighbors are not deleted from ASIC_DB when eBGP interfaces are shutdown and the neighbors are flushed #12442

vganesan-nokia commented Oct 18, 2022

rlhui commented Oct 24, 2022

arlakshm commented Nov 19, 2022

mannytaheri commented Dec 12, 2022

abdosi commented Dec 13, 2022 •

edited

Loading

abdosi commented Dec 13, 2022

abdosi commented Dec 13, 2022

vganesan-nokia commented Dec 13, 2022

abdosi commented Dec 13, 2022

vganesan-nokia commented Feb 1, 2023

yuxuehong commented Jul 4, 2024

Dynamically learned neighbors are not deleted from ASIC_DB when eBGP interfaces are shutdown and the neighbors are flushed #12442

Dynamically learned neighbors are not deleted from ASIC_DB when eBGP interfaces are shutdown and the neighbors are flushed #12442

Comments

vganesan-nokia commented Oct 18, 2022

Description

Steps to reproduce the issue:

Describe the results you received:

Describe the results you expected:

Output of show version:

Output of show techsupport:

Additional information you deem important (e.g. issue happens only occasionally):

rlhui commented Oct 24, 2022

arlakshm commented Nov 19, 2022

mannytaheri commented Dec 12, 2022

abdosi commented Dec 13, 2022 • edited Loading

abdosi commented Dec 13, 2022

abdosi commented Dec 13, 2022

vganesan-nokia commented Dec 13, 2022

abdosi commented Dec 13, 2022

vganesan-nokia commented Feb 1, 2023

yuxuehong commented Jul 4, 2024

Output of `show version`:

Output of `show techsupport`:

abdosi commented Dec 13, 2022 •

edited

Loading