Fix Voq chassis orchagent crash with 34K routes #3329
Draft
+132
−2
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What I did
Don't add the remote system neighbor if the same neighbor exists.
Why I did it
The IMM has two asics and has 2 port channels in each asic and 2 port members in each port channel.
The ip address is configured on each port channel and bgp is enabled. The neighbor and routes are learned on these port channel.
In sonic-mgmt pc suite, the test case po-update removes the port members from one of the port channel, removes the ip address configured on that port channel, creates new port channel, adds the same port members to the new port channel, adds the same ip address to the new port channel.
In the remote asic, before all the routes learned on the old port channel are removed by routeOrch, orchagent trries to remove the neighbor and nexthop for the old portchannel. But since the routes are pending, the old nexthop and neighbor are not removed. Then the neighbor and nexthop for the new port channel are being added. If the neighbor is learned on remote system port in remote asic, the nexthop is added with alias as inband port's alias, so the key (ip,alias) is same for both old nexthop and new nexthop. When the new nexthop is added , it calls hasNextHop function to check if the nexthop with (ip-address, alias) as key and since the old nexthop is not removed yet, the hasNextHop returns true, however the assert(!hasNextHop) does n't trigger the crash. So addNextHop function replace the old nexthop with old rif-id with new nexthop with new old rif-id in the nexthop map. Then after all the routes learned on old port channel is removed, the old neighbor and old nexthop are being removed. Sine the old nexthop was replaced with new nexthop, when orchagent tries to delete the old nexthop, it actually deletes the new nexthop from SAI. Then when it tries to remove the old neighbor, SAI returns error since orchagent removed the new nexthop from SAI instead of old nexthop and old neighbor is still referenced by the old nexthop in SAI. So orchagent crashes when SAI returns error.
How I verified it
Ran pc and voq suite and verified it passes.
Details if related