-
Notifications
You must be signed in to change notification settings - Fork 529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the Orchagent crash seen during Port channel OC test cases (Issue#17665) #3042
Fix the Orchagent crash seen during Port channel OC test cases (Issue#17665) #3042
Conversation
The function addNeighbor adds the remote system neighbor against the remote system port and increment the reference count for remote system port's RIF. However when it adds the nextHop in addNextHop function , it adds it against Inband port with RIF-ID of remote system port, but increases the RIF reference count of Inband port instead of remote system port.When the neighbor is removed in removeNeighbor, it decreases the ref count of remote system port for RIF. But when it removes the nexthop in removeNextHop, it decreases the ref count for remote system port. So if the remote system port has both ipv4 and ipv6 configured, then the ref count is incremented by 2 for remote system port's RIF (ipv4 and ipv4 nbr) and incremented by 2 (ipv4 and ipv6 nexthop) for Inband Port's RIF. But the ref count is decremented 4 times for remote system port's RIF. So sometimes, as soon as the ipv4 or ipv6 is delted, the orchagent tries to delete the remote system port's RIF, but since SAI meta layer has different ref count, it returns failure and orchagent crashes. Signed-off-by: saksarav <sakthivadivu.saravanaraj@nokia.com>
This needs to be cherry-picked to 202205 |
@prsunny please review this |
@judyjoseph for viz |
MSFT ADO: 26718860 @yxieca , @StormLiangMS , This is a bug fix that is needed to be back ported to 202205 and 202305. |
…c-net#3042) The function addNeighbor adds the remote system neighbor against the remote system port and increment the reference count for remote system port's RIF. However when it adds the nextHop in addNextHop function , it adds it against Inband port with RIF-ID of remote system port, but increases the RIF reference count of Inband port instead of remote system port.When the neighbor is removed in removeNeighbor, it decreases the ref count of remote system port for RIF. But when it removes the nexthop in removeNextHop, it decreases the ref count for remote system port. So if the remote system port has both ipv4 and ipv6 configured, then the ref count is incremented by 2 for remote system port's RIF (ipv4 and ipv4 nbr) and incremented by 2 (ipv4 and ipv6 nexthop) for Inband Port's RIF. But the ref count is decremented 4 times for remote system port's RIF. So sometimes, as soon as the ipv4 or ipv6 is delted, the orchagent tries to delete the remote system port's RIF, but since SAI meta layer has different ref count, it returns failure and orchagent crashes. Signed-off-by: saksarav <sakthivadivu.saravanaraj@nokia.com>
Cherry-pick PR to 202311: #3043 |
…c-net#3042) The function addNeighbor adds the remote system neighbor against the remote system port and increment the reference count for remote system port's RIF. However when it adds the nextHop in addNextHop function , it adds it against Inband port with RIF-ID of remote system port, but increases the RIF reference count of Inband port instead of remote system port.When the neighbor is removed in removeNeighbor, it decreases the ref count of remote system port for RIF. But when it removes the nexthop in removeNextHop, it decreases the ref count for remote system port. So if the remote system port has both ipv4 and ipv6 configured, then the ref count is incremented by 2 for remote system port's RIF (ipv4 and ipv4 nbr) and incremented by 2 (ipv4 and ipv6 nexthop) for Inband Port's RIF. But the ref count is decremented 4 times for remote system port's RIF. So sometimes, as soon as the ipv4 or ipv6 is delted, the orchagent tries to delete the remote system port's RIF, but since SAI meta layer has different ref count, it returns failure and orchagent crashes. Signed-off-by: saksarav <sakthivadivu.saravanaraj@nokia.com>
Cherry-pick PR to 202205: #3044 |
The function addNeighbor adds the remote system neighbor against the remote system port and increment the reference count for remote system port's RIF. However when it adds the nextHop in addNextHop function , it adds it against Inband port with RIF-ID of remote system port, but increases the RIF reference count of Inband port instead of remote system port.When the neighbor is removed in removeNeighbor, it decreases the ref count of remote system port for RIF. But when it removes the nexthop in removeNextHop, it decreases the ref count for remote system port. So if the remote system port has both ipv4 and ipv6 configured, then the ref count is incremented by 2 for remote system port's RIF (ipv4 and ipv4 nbr) and incremented by 2 (ipv4 and ipv6 nexthop) for Inband Port's RIF. But the ref count is decremented 4 times for remote system port's RIF. So sometimes, as soon as the ipv4 or ipv6 is delted, the orchagent tries to delete the remote system port's RIF, but since SAI meta layer has different ref count, it returns failure and orchagent crashes. Signed-off-by: saksarav <sakthivadivu.saravanaraj@nokia.com>
… (#3044) The function addNeighbor adds the remote system neighbor against the remote system port and increment the reference count for remote system port's RIF. However when it adds the nextHop in addNextHop function , it adds it against Inband port with RIF-ID of remote system port, but increases the RIF reference count of Inband port instead of remote system port.When the neighbor is removed in removeNeighbor, it decreases the ref count of remote system port for RIF. But when it removes the nexthop in removeNextHop, it decreases the ref count for remote system port. So if the remote system port has both ipv4 and ipv6 configured, then the ref count is incremented by 2 for remote system port's RIF (ipv4 and ipv4 nbr) and incremented by 2 (ipv4 and ipv6 nexthop) for Inband Port's RIF. But the ref count is decremented 4 times for remote system port's RIF. So sometimes, as soon as the ipv4 or ipv6 is delted, the orchagent tries to delete the remote system port's RIF, but since SAI meta layer has different ref count, it returns failure and orchagent crashes.
@StormLiangMS I see this got approved for 202305 but I don't see it ever get picked into 202305. Can you help take a look what happened? perhaps auto cherry-pick got disabled in 202305? |
…c-net#3042) The function addNeighbor adds the remote system neighbor against the remote system port and increment the reference count for remote system port's RIF. However when it adds the nextHop in addNextHop function , it adds it against Inband port with RIF-ID of remote system port, but increases the RIF reference count of Inband port instead of remote system port.When the neighbor is removed in removeNeighbor, it decreases the ref count of remote system port for RIF. But when it removes the nexthop in removeNextHop, it decreases the ref count for remote system port. So if the remote system port has both ipv4 and ipv6 configured, then the ref count is incremented by 2 for remote system port's RIF (ipv4 and ipv4 nbr) and incremented by 2 (ipv4 and ipv6 nexthop) for Inband Port's RIF. But the ref count is decremented 4 times for remote system port's RIF. So sometimes, as soon as the ipv4 or ipv6 is delted, the orchagent tries to delete the remote system port's RIF, but since SAI meta layer has different ref count, it returns failure and orchagent crashes. Signed-off-by: saksarav <sakthivadivu.saravanaraj@nokia.com>
Cherry-pick PR to 202305: #3072 |
The function addNeighbor adds the remote system neighbor against the remote system port and increment the reference count for remote system port's RIF. However when it adds the nextHop in addNextHop function , it adds it against Inband port with RIF-ID of remote system port, but increases the RIF reference count of Inband port instead of remote system port.When the neighbor is removed in removeNeighbor, it decreases the ref count of remote system port for RIF. But when it removes the nexthop in removeNextHop, it decreases the ref count for remote system port. So if the remote system port has both ipv4 and ipv6 configured, then the ref count is incremented by 2 for remote system port's RIF (ipv4 and ipv4 nbr) and incremented by 2 (ipv4 and ipv6 nexthop) for Inband Port's RIF. But the ref count is decremented 4 times for remote system port's RIF. So sometimes, as soon as the ipv4 or ipv6 is delted, the orchagent tries to delete the remote system port's RIF, but since SAI meta layer has different ref count, it returns failure and orchagent crashes. Signed-off-by: saksarav <sakthivadivu.saravanaraj@nokia.com>
What I did
Modified addNextHop function to pass the alias of remote system port instead of Inband port if the neighbor is remote system neighbor.
Why I did it
Fix for sonic-net/sonic-buildimage#17665
The function addNeighbor adds the remote system neighbor against the remote system port and increment the reference count for remote system port's RIF.
However when it adds the nextHop in addNextHop function , it adds it against Inband port with RIF-ID of remote system port, but increases the RIF reference
count of Inband port instead of remote system port.When the neighbor is removed in removeNeighbor, it decreases the ref count of remote system port for RIF.
But when it removes the nexthop in removeNextHop, it decreases the ref count for remote system port. So if the remote system port has both ipv4 and ipv6 configured,
then the ref count is incremented by 2 for remote system port's RIF (ipv4 and ipv4 nbr) and incremented by 2 (ipv4 and ipv6 nexthop) for Inband Port's RIF.
But the ref count is decremented 4 times for remote system port's RIF. So sometimes, as soon as the ipv4 or ipv6 is delted, the orchagent tries to delete the
remote system port's RIF, but since SAI meta layer has different ref count, it returns failure and orchagent crashes.
How I verified it
Ran the Port channel OC suites and db consistency OC suites multiple times and verified that the orchagent crash is not seen anymore.
Details if related