Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bgp 'next-hop-tracking' feature enabled causes port_toggle to fail more frequently #6146

Closed
stephenxs opened this issue Dec 7, 2020 · 9 comments

Comments

@stephenxs
Copy link
Collaborator

Description
Recently we notice that the port_toggle issue can be observed more frequently than before. We did some tests on the image from middle September till now in t1-lag topo and eventually found that it is very likely to be caused by the bgp 'next hop tracking' feature. With the bgp next hop tracking disabled, the port toggle test can seldom fail.
The flow of port toggle test is:

  1. shutdown all ports
  2. wait 20 seconds, verify whether all ports are down
  3. startup all ports
  4. wait 60 seconds, verify whether all ports are up.

As there are a lot of routing entries learnt in t1-lag, in step 2 those entries will be withdrawn, and in step 4 they will be relearnt.
Is it possible that enabling this option makes it consume more time for routing entries learnt/withdrawn, hence make the port toggle more possible to fail?

Steps to reproduce the issue:
1.
2.
3.

Describe the results you received:

Describe the results you expected:

Additional information you deem important (e.g. issue happens only occasionally):

**Output of `show version`:**

```
(paste your output here)
```

**Attach debug file `sudo generate_dump`:**

```
(paste your output here)
```
@anshuv-mfst
Copy link

Liat: Issue caused by #5600.

@liat-grozovik to add more info.

@liat-grozovik
Copy link
Collaborator

@stephenxs please add techsupport and more information of what is failing.
if i recall correctly you referred to some kind of error flow on swss. if this is not the case please elaborate.

@stephenxs
Copy link
Collaborator Author

For most of the reproduction, port toggle test failed. I suspect it’s because it takes to much time to withdraw routing.
The delay between shutting down and starting up ports are newly introduced. Before having that, sometimes we can observe sairedis timeout during the starting up phase. This is because too many requests accumulated in sairedis queue makes it difficult to handle new request in time.
Now seldom do we observe this because the test stop if some of the ports are not down after being shut down.

@stephenxs
Copy link
Collaborator Author

stephenxs commented Dec 10, 2020

sonic_dump_arc-switch1004_20201210_084105.tar.gz
this is the dump with next-hop-tracking enabled

@lguohan
Copy link
Collaborator

lguohan commented Dec 10, 2020

can you do a comparison test. one enable next-hop-tracking, one without. and then compare the number of the route entries in the sairedis log between these two runs. if your theory is correct, then there will be much more route entries when device has next-hop-tracking enables v.s. w/o.

@stephenxs
Copy link
Collaborator Author

sonic_dump_arc-switch1004_20201211_075414.tar.gz
this is the dump without next-hop-tracking

@lguohan
Copy link
Collaborator

lguohan commented Dec 11, 2020

can you analyze the sairedis log and compare?

@stephenxs
Copy link
Collaborator Author

I’ll analyze it a bit more.
the failure is not the sairedis time out but port toggle test failure.
there’s a logic in the port toggle test: after all port shut down, some of the ports are not down within the given time.
Without the BGP commit the test passed.

@Junchao-Mellanox
Copy link
Collaborator

Junchao-Mellanox commented Dec 28, 2020

Hi all,

I did some analyze on this issue. The issue is probably "caused" by that zebra deleting routes faster than before. The test flow is like:

  1. User shutdown all ports from CLI
  2. BGP detect some ports down and start to deleting routes
  3. User wait 20 seconds for all ports to down

Two kinds of requests go to syncd queue: shutdown port request and remove route request. (of course there are other requests, but let's focus on these two) There are about 12000 routes. So the best situation is that all "shutdown port request" are in front of "remove route request". The worst situation is that there are 12000 "remove routes request" ahead of some "shutdown port request".

In this case, zebra deletes routes faster and more "remove route request" go in front of "shutdown port request" which causes port_toggle test case failed to wait for all ports shutdown in 20 seconds.

I did a few times test. Here is the logs for 4 of them, 2 success, 2 failed.

Success 1, it takes 27 seconds for zebra to delete all routes.

2020-12-28.06:00:07.455280|ROUTE_TABLE:fc00::78/126|DEL
...
2020-12-28.06:00:34.530440|ROUTE_TABLE:192.186.88.128/25|DEL

Success 2, it takes 35 seconds for zebra to delete all routes:

2020-12-28.06:10:07.902919|ROUTE_TABLE:fc00::78/126|DEL
...
2020-12-28.06:10:42.298481|ROUTE_TABLE:193.10.72.128/25|DEL

Failed 1, it takes 24 seconds for zebra to delete all routes:

2020-12-28.05:12:47.351838|ROUTE_TABLE:fc00::78/126|DEL
2020-12-28.05:13:11.269318|ROUTE_TABLE:192.186.88.128/25|DEL

Failed 2, it takes 23 seconds for zebra to delete all routes:

2020-12-28.06:30:25.212260|ROUTE_TABLE:fc00::78/126|DEL
...
2020-12-28.06:30:48.186791|ROUTE_TABLE:20c0:ef08:0:80::/64|DEL

So the port toggle failure is probably not an issue of BGP but a timing issue. I would suggest to enlarger the port toggle timeout for waiting ports down.

@liat-grozovik liat-grozovik changed the title To enable bgp 'next-hop-tracking' feature causes port_toggle more frequently bgo 'next-hop-tracking' feature enabled causes port_toggle to fail more frequently Dec 28, 2020
@liat-grozovik liat-grozovik changed the title bgo 'next-hop-tracking' feature enabled causes port_toggle to fail more frequently bgp 'next-hop-tracking' feature enabled causes port_toggle to fail more frequently Dec 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants