bgp 'next-hop-tracking' feature enabled causes port_toggle to fail more frequently #6146

stephenxs · 2020-12-07T15:07:09Z

Description
Recently we notice that the port_toggle issue can be observed more frequently than before. We did some tests on the image from middle September till now in t1-lag topo and eventually found that it is very likely to be caused by the bgp 'next hop tracking' feature. With the bgp next hop tracking disabled, the port toggle test can seldom fail.
The flow of port toggle test is:

shutdown all ports
wait 20 seconds, verify whether all ports are down
startup all ports
wait 60 seconds, verify whether all ports are up.

As there are a lot of routing entries learnt in t1-lag, in step 2 those entries will be withdrawn, and in step 4 they will be relearnt.
Is it possible that enabling this option makes it consume more time for routing entries learnt/withdrawn, hence make the port toggle more possible to fail?

Steps to reproduce the issue:
1.
2.
3.

Describe the results you received:

Describe the results you expected:

Additional information you deem important (e.g. issue happens only occasionally):

**Output of `show version`:**

```
(paste your output here)
```

**Attach debug file `sudo generate_dump`:**

```
(paste your output here)
```

The text was updated successfully, but these errors were encountered:

anshuv-mfst · 2020-12-09T16:14:30Z

Liat: Issue caused by #5600.

@liat-grozovik to add more info.

liat-grozovik · 2020-12-09T16:19:35Z

@stephenxs please add techsupport and more information of what is failing.
if i recall correctly you referred to some kind of error flow on swss. if this is not the case please elaborate.

stephenxs · 2020-12-09T23:09:55Z

For most of the reproduction, port toggle test failed. I suspect it’s because it takes to much time to withdraw routing.
The delay between shutting down and starting up ports are newly introduced. Before having that, sometimes we can observe sairedis timeout during the starting up phase. This is because too many requests accumulated in sairedis queue makes it difficult to handle new request in time.
Now seldom do we observe this because the test stop if some of the ports are not down after being shut down.

stephenxs · 2020-12-10T08:57:08Z

sonic_dump_arc-switch1004_20201210_084105.tar.gz
this is the dump with next-hop-tracking enabled

lguohan · 2020-12-10T09:04:02Z

can you do a comparison test. one enable next-hop-tracking, one without. and then compare the number of the route entries in the sairedis log between these two runs. if your theory is correct, then there will be much more route entries when device has next-hop-tracking enables v.s. w/o.

stephenxs · 2020-12-11T08:23:05Z

sonic_dump_arc-switch1004_20201211_075414.tar.gz
this is the dump without next-hop-tracking

lguohan · 2020-12-11T14:27:25Z

can you analyze the sairedis log and compare?

stephenxs · 2020-12-15T00:09:30Z

I’ll analyze it a bit more.
the failure is not the sairedis time out but port toggle test failure.
there’s a logic in the port toggle test: after all port shut down, some of the ports are not down within the given time.
Without the BGP commit the test passed.

Junchao-Mellanox · 2020-12-28T07:13:44Z

Hi all,

I did some analyze on this issue. The issue is probably "caused" by that zebra deleting routes faster than before. The test flow is like:

User shutdown all ports from CLI
BGP detect some ports down and start to deleting routes
User wait 20 seconds for all ports to down

Two kinds of requests go to syncd queue: shutdown port request and remove route request. (of course there are other requests, but let's focus on these two) There are about 12000 routes. So the best situation is that all "shutdown port request" are in front of "remove route request". The worst situation is that there are 12000 "remove routes request" ahead of some "shutdown port request".

In this case, zebra deletes routes faster and more "remove route request" go in front of "shutdown port request" which causes port_toggle test case failed to wait for all ports shutdown in 20 seconds.

I did a few times test. Here is the logs for 4 of them, 2 success, 2 failed.

Success 1, it takes 27 seconds for zebra to delete all routes.

2020-12-28.06:00:07.455280|ROUTE_TABLE:fc00::78/126|DEL
...
2020-12-28.06:00:34.530440|ROUTE_TABLE:192.186.88.128/25|DEL

Success 2, it takes 35 seconds for zebra to delete all routes:

2020-12-28.06:10:07.902919|ROUTE_TABLE:fc00::78/126|DEL
...
2020-12-28.06:10:42.298481|ROUTE_TABLE:193.10.72.128/25|DEL

Failed 1, it takes 24 seconds for zebra to delete all routes:

2020-12-28.05:12:47.351838|ROUTE_TABLE:fc00::78/126|DEL
2020-12-28.05:13:11.269318|ROUTE_TABLE:192.186.88.128/25|DEL

Failed 2, it takes 23 seconds for zebra to delete all routes:

2020-12-28.06:30:25.212260|ROUTE_TABLE:fc00::78/126|DEL
...
2020-12-28.06:30:48.186791|ROUTE_TABLE:20c0:ef08:0:80::/64|DEL

So the port toggle failure is probably not an issue of BGP but a timing issue. I would suggest to enlarger the port toggle timeout for waiting ports down.

anshuv-mfst added the Awaiting Info ⌛ label Dec 9, 2020

liat-grozovik changed the title ~~To enable bgp 'next-hop-tracking' feature causes port_toggle more frequently~~ bgo 'next-hop-tracking' feature enabled causes port_toggle to fail more frequently Dec 28, 2020

liat-grozovik changed the title ~~bgo 'next-hop-tracking' feature enabled causes port_toggle to fail more frequently~~ bgp 'next-hop-tracking' feature enabled causes port_toggle to fail more frequently Dec 28, 2020

stephenxs closed this as completed Mar 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bgp 'next-hop-tracking' feature enabled causes port_toggle to fail more frequently #6146

bgp 'next-hop-tracking' feature enabled causes port_toggle to fail more frequently #6146

stephenxs commented Dec 7, 2020

anshuv-mfst commented Dec 9, 2020

liat-grozovik commented Dec 9, 2020

stephenxs commented Dec 9, 2020

stephenxs commented Dec 10, 2020 •

edited

Loading

lguohan commented Dec 10, 2020

stephenxs commented Dec 11, 2020

lguohan commented Dec 11, 2020

stephenxs commented Dec 15, 2020

Junchao-Mellanox commented Dec 28, 2020 •

edited

Loading

bgp 'next-hop-tracking' feature enabled causes port_toggle to fail more frequently #6146

bgp 'next-hop-tracking' feature enabled causes port_toggle to fail more frequently #6146

Comments

stephenxs commented Dec 7, 2020

anshuv-mfst commented Dec 9, 2020

liat-grozovik commented Dec 9, 2020

stephenxs commented Dec 9, 2020

stephenxs commented Dec 10, 2020 • edited Loading

lguohan commented Dec 10, 2020

stephenxs commented Dec 11, 2020

lguohan commented Dec 11, 2020

stephenxs commented Dec 15, 2020

Junchao-Mellanox commented Dec 28, 2020 • edited Loading

stephenxs commented Dec 10, 2020 •

edited

Loading

Junchao-Mellanox commented Dec 28, 2020 •

edited

Loading