-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bgp 'next-hop-tracking' feature enabled causes port_toggle to fail more frequently #6146
Comments
Liat: Issue caused by #5600. @liat-grozovik to add more info. |
@stephenxs please add techsupport and more information of what is failing. |
For most of the reproduction, port toggle test failed. I suspect it’s because it takes to much time to withdraw routing. |
sonic_dump_arc-switch1004_20201210_084105.tar.gz |
can you do a comparison test. one enable next-hop-tracking, one without. and then compare the number of the route entries in the sairedis log between these two runs. if your theory is correct, then there will be much more route entries when device has next-hop-tracking enables v.s. w/o. |
sonic_dump_arc-switch1004_20201211_075414.tar.gz |
can you analyze the sairedis log and compare? |
I’ll analyze it a bit more. |
Hi all, I did some analyze on this issue. The issue is probably "caused" by that zebra deleting routes faster than before. The test flow is like:
Two kinds of requests go to syncd queue: shutdown port request and remove route request. (of course there are other requests, but let's focus on these two) There are about 12000 routes. So the best situation is that all "shutdown port request" are in front of "remove route request". The worst situation is that there are 12000 "remove routes request" ahead of some "shutdown port request". In this case, zebra deletes routes faster and more "remove route request" go in front of "shutdown port request" which causes port_toggle test case failed to wait for all ports shutdown in 20 seconds. I did a few times test. Here is the logs for 4 of them, 2 success, 2 failed. Success 1, it takes 27 seconds for zebra to delete all routes.
Success 2, it takes 35 seconds for zebra to delete all routes:
Failed 1, it takes 24 seconds for zebra to delete all routes:
Failed 2, it takes 23 seconds for zebra to delete all routes:
So the port toggle failure is probably not an issue of BGP but a timing issue. I would suggest to enlarger the port toggle timeout for waiting ports down. |
Description
Recently we notice that the
port_toggle
issue can be observed more frequently than before. We did some tests on the image from middle September till now in t1-lag topo and eventually found that it is very likely to be caused by the bgp 'next hop tracking' feature. With the bgp next hop tracking disabled, the port toggle test can seldom fail.The flow of port toggle test is:
As there are a lot of routing entries learnt in t1-lag, in step 2 those entries will be withdrawn, and in step 4 they will be relearnt.
Is it possible that enabling this option makes it consume more time for routing entries learnt/withdrawn, hence make the port toggle more possible to fail?
Steps to reproduce the issue:
1.
2.
3.
Describe the results you received:
Describe the results you expected:
Additional information you deem important (e.g. issue happens only occasionally):
The text was updated successfully, but these errors were encountered: