-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamically learned neighbors are not deleted from ASIC_DB when eBGP interfaces are shutdown and the neighbors are flushed #12442
Comments
@vganesan-nokia - which release is this found and what kind of platform e.g. chassis or pizza box? |
@vganesan-nokia,
After these steps I see the default route is removed from APP_db and ASIC_DB I am testing on the latest 202205 image. Can you confirm if you still see the is problem, if you can you confirm steps to reproduce |
Signed-off-by: vedganes <veda.ganesan@nokia.com> The changes are for fixing stale neighbor in the ASIC_DB and data path when eBGP neighbors are shutdown and neighbors are flushed. The problem is described in issue: sonic-net/sonic-buildimage#12442 The root cause of this issue is due to not deleing the route from the ASIC_DB when the route's next hop is on eth0 or docker0 interface. The solution is to delete the route entry from ASIC_DB instead of just returning when the route's next hop is on the interface eth0 or docker0
@prsunny for viz. Hitting same another chassis platform for Ipv6 Default route. |
@vganesan-nokia looks like issue with only default route. Because i am seeing it for that only. If that is the case can we update Issues Title and Description to point that (default routes are only impacted) |
Hitted while running |
@abdosi, this issue can happen for any route learned from eBGPs and when eBGPs go down and if that route has an additional next hop on eth0/docker0. |
wondering is there any other route over |
Signed-off-by: vedganes <veda.ganesan@nokia.com> The changes are for fixing stale neighbor in the ASIC_DB and data path when eBGP neighbors are shutdown and neighbors are flushed. The problem is described in issue: sonic-net/sonic-buildimage#12442 The root cause of this issue is due to not deleing the route from the ASIC_DB when the route's next hop is on eth0 or docker0 interface. The solution is to delete the route entry from ASIC_DB instead of just returning when the route's next hop is on the interface eth0 or docker0
* [routesync] Fix for stale dynamic neighbor The changes are for fixing stale neighbor in the ASIC_DB and data path when eBGP neighbors are shutdown and neighbors are flushed. The problem is described in issue: sonic-net/sonic-buildimage#12442 The root cause of this issue is due to not deleing the route from the ASIC_DB when the route's next hop is on eth0 or docker0 interface. The solution is to delete the route entry from ASIC_DB instead of just returning when the route's next hop is on the interface eth0 or docker0 This commit fixes the warm restart unit test failure. When the route with only nh on eth0 or docker0 is removed and if the route is the default route, orchagent sends "add" black hole route to the syncd. So the ASIC DB gets n hset message. When this happens during warm restart, the unit test identifies this as unwanted setting and the unit test fails. To fix this issues, the route delete is sent only if the warm restart is not in progress. This is done following the same warm restart handling approach used for route delete in other palces. Signed-off-by: vedganes <veda.ganesan@nokia.com>
* [routesync] Fix for stale dynamic neighbor The changes are for fixing stale neighbor in the ASIC_DB and data path when eBGP neighbors are shutdown and neighbors are flushed. The problem is described in issue: sonic-net/sonic-buildimage#12442 The root cause of this issue is due to not deleing the route from the ASIC_DB when the route's next hop is on eth0 or docker0 interface. The solution is to delete the route entry from ASIC_DB instead of just returning when the route's next hop is on the interface eth0 or docker0 This commit fixes the warm restart unit test failure. When the route with only nh on eth0 or docker0 is removed and if the route is the default route, orchagent sends "add" black hole route to the syncd. So the ASIC DB gets n hset message. When this happens during warm restart, the unit test identifies this as unwanted setting and the unit test fails. To fix this issues, the route delete is sent only if the warm restart is not in progress. This is done following the same warm restart handling approach used for route delete in other palces. Signed-off-by: vedganes <veda.ganesan@nokia.com>
Fixed by PR sonic-net/sonic-swss#2553 |
* [routesync] Fix for stale dynamic neighbor The changes are for fixing stale neighbor in the ASIC_DB and data path when eBGP neighbors are shutdown and neighbors are flushed. The problem is described in issue: sonic-net/sonic-buildimage#12442 The root cause of this issue is due to not deleing the route from the ASIC_DB when the route's next hop is on eth0 or docker0 interface. The solution is to delete the route entry from ASIC_DB instead of just returning when the route's next hop is on the interface eth0 or docker0 This commit fixes the warm restart unit test failure. When the route with only nh on eth0 or docker0 is removed and if the route is the default route, orchagent sends "add" black hole route to the syncd. So the ASIC DB gets n hset message. When this happens during warm restart, the unit test identifies this as unwanted setting and the unit test fails. To fix this issues, the route delete is sent only if the warm restart is not in progress. This is done following the same warm restart handling approach used for route delete in other palces. Signed-off-by: vedganes <veda.ganesan@nokia.com>
when route with muti nexthops which not include eth0, frr has feture which when one nexthop invalid,will resolve to default; thus zebra will update route with muti nexthop, and one of them via eth0, thus we can not handle this situation; |
Description
After the eBGP neighbors interfaces (ports and port channles) are shut down, when "sonic-clear arp" or "sonic-clear ndp" command is issued, the dynamically learned eBGP neighbors are not cleared from ASIC_DB. The orchagent syslogs show that "Failed to remove still referenced neighbor". This is unexpected and incorrect. Since bringing down the eBGP interfaces will bring down the eBGP neighbors. So all routes attached to the eBGP neighbors will be withdrawn and neighbors should not be referencing to any route. Flushing the neighbors clears the neighbors from the linux kernel but the neighbors still exist in ASIC_DB. (Please see "Additional Information" below for the root cause of this issue)
Steps to reproduce the issue:
Following is one scenario how the problem can be reproduced more frequently
Describe the results you received:
Describe the results you expected:
Output of
show version
:Output of
show techsupport
:Additional information you deem important (e.g. issue happens only occasionally):
The root cause of this problem is that some routes learned from eBGP neighbors are not cleared from APPL_DB when the eBGP neighbors go down. The routes which show this issue are those routes which include the next hop on interface "eth0". This consistently occurs for default routes for both IPv4 and IPv6 in any asic instance. Since the route is not deleted from APPL DB (though route is deleted from kernel), the neighbor ref_count is not 0 and hence the neighbor is not deleted from the ASIC_DB.
All asic instances include the default route with next hop on eth0 (with large metric to avoid colliding with default routes learned via eBGP). The APPL_DB has this route with only the eBGP next hops due to a filtering in routesync that filters the kernel route updates that has a next hop on eth0 or docker0. Consequently, the next hop on eth0 is not included in the route entry in APPL_DB and ASIC_DB. When all the eBGP neighbor interfaces are shut down, all eBGP neighbors go down. When all eBGP neighbors go down, the default route is withdrawn. The next hop on eth0 becomes the only next hop for the default route. When kernel sends this update, because of the the above mentioned routesync filter the route update is not sent to APPL_DB. Hence the APPL_DB is left with a stale route entry with whatever next hops it had before the eth0 next hop became only next hop. Since these next hops still have this route referenced, these are not cleared when neighbors are flushed.
A possible soultion to both of these issues viz., (1) stale route entry in APPL_DB and ASIC_DB and (2) stale neighbor entry in ASIC_DB is to delete the routes from APPL_DB when kernel updates are received with next hop on "eth0" as the only next hop.
The text was updated successfully, but these errors were encountered: