-
Notifications
You must be signed in to change notification settings - Fork 529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add failure notification for orchagent #1665
Conversation
This pull request fixes 2 alerts when merging 5de44a6 into 03a0e21 - view on LGTM.com fixed alerts:
|
This pull request fixes 2 alerts when merging eaed258 into 9e30abb - view on LGTM.com fixed alerts:
|
This pull request fixes 2 alerts when merging 625dfbb into 1951365 - view on LGTM.com fixed alerts:
|
orchagent/aclorch.cpp
Outdated
@@ -3596,6 +3596,7 @@ sai_status_t AclOrch::createDTelWatchListTables() | |||
if (status != SAI_STATUS_SUCCESS) | |||
{ | |||
SWSS_LOG_ERROR("Failed to create table %s", flowWLTable.description.c_str()); | |||
handleSaiCreateStatus(SAI_API_ACL, status); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
handleSaiCreateStatus [](start = 8, length = 21)
handleSaiCreateStatus [](start = 8, length = 21)
Check the return value? If successfully handled, you treat it as success? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few SAI calls that do not take any action in the case of SAI failures. For these ones, since our goal is to make it on-par with async mode at this stage, I am not checking the return value for now so that it has the same behavior as before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's assume this function magically retry and fix the issue, so the return value is task_success
. You should return SAI_STATUS_SUCCESS
after this function.
In reply to: 595609272 [](ancestors = 595609272)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, updated the logic to check if the return value is task_success
, and proceed when it is.
orchagent/copporch.cpp
Outdated
@@ -125,7 +125,7 @@ void CoppOrch::initDefaultHostIntfTable() | |||
if (status != SAI_STATUS_SUCCESS) | |||
{ | |||
SWSS_LOG_ERROR("Failed to create default host interface table, rv:%d", status); | |||
throw "CoppOrch initialization failure"; | |||
handleSaiCreateStatus(SAI_API_HOSTIF, status); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
handleSaiCreateStatus [](start = 8, length = 21)
handleSaiCreateStatus [](start = 8, length = 21)
Do you want to keep old behavior even with your new handle function? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in the old behavior (async mode), this throw should not even be called since syncd should already trigger an swss restart before it reaches this throw. To mimic the old behavior close enough, I think a method could be running this handling function before this throw and keep this throw regardless of the return value of the handling function. I updated the code as such.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You also need to check return value and conditional throw
.
In reply to: 595610233 [](ancestors = 595610233)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the logic to check return value before throwing.
orchagent/copporch.cpp
Outdated
@@ -489,6 +489,7 @@ task_process_status CoppOrch::processCoppRule(Consumer& consumer) | |||
if (sai_status != SAI_STATUS_SUCCESS) | |||
{ | |||
SWSS_LOG_ERROR("Failed to apply attribute:%d to trap group:%" PRIx64 ", name:%s, error:%d\n", trap_gr_attr.id, m_trap_group_map[trap_group_name], trap_group_name.c_str(), sai_status); | |||
handleSaiSetStatus(SAI_API_HOSTIF, sai_status); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
handleSaiSetStatus [](start = 20, length = 18)
handleSaiSetStatus [](start = 20, length = 18)
Check the return value? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few SAI calls that do not take any action in the case of SAI failures. For these ones, since our goal is to make it on-par with async mode at this stage, I am not checking the return value for now so that it has the same behavior as before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a check for the return value of the handling function and return only when it is not task_success
orchagent/dtelorch.cpp
Outdated
} | ||
else | ||
{ | ||
it++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it++; [](start = 24, length = 5)
I think you change the behavior. Old code will always erase(it)
no matter success or not. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, the handling function always exits orchagent to achieve on-par handling with async mode. So this change will not be activated at the moment and the behavior should be the same as async mode. Once we have a better logic for the failure handling functions, we can fully control whether it should retry or not.
@@ -1097,7 +1098,7 @@ bool FdbOrch::addFdbEntry(const FdbEntry& entry, const string& port_name, | |||
SWSS_LOG_ERROR("Failed to create %s FDB %s in %s on %s, rv:%d", | |||
fdbData.type.c_str(), entry.mac.to_string().c_str(), | |||
vlan.m_alias.c_str(), port_name.c_str(), status); | |||
return false; //FIXME: it should be based on status. Some could be retried, some not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FIXME [](start = 28, length = 5)
Did you truly fixed the FIXME
? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not, added the FIXME back.
@@ -1212,7 +1213,7 @@ bool FdbOrch::removeFdbEntry(const FdbEntry& entry, FdbOrigin origin) | |||
{ | |||
SWSS_LOG_ERROR("FdbOrch RemoveFDBEntry: Failed to remove FDB entry. mac=%s, bv_id=0x%" PRIx64, | |||
entry.mac.to_string().c_str(), entry.bv_id); | |||
return true; //FIXME: it should be based on status. Some could be retried. some not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FIXME [](start = 23, length = 5)
The same. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the FIXME back.
orchagent/intfsorch.cpp
Outdated
@@ -1056,7 +1061,7 @@ bool IntfsOrch::addRouterIntfs(sai_object_id_t vrf_id, Port &port) | |||
{ | |||
SWSS_LOG_ERROR("Failed to create router interface %s, rv:%d", | |||
port.m_alias.c_str(), status); | |||
throw runtime_error("Failed to create router interface."); | |||
return handleSaiCreateStatus(SAI_API_ROUTER_INTERFACE, status); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
handleSaiCreateStatus [](start = 15, length = 21)
handleSaiCreateStatus [](start = 15, length = 21)
To exactly mimic old behavior, you can derive the handleSaiCreateStatus function in this class, and throw
inside the function. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in the old behavior (async mode), this throw should not even be called since syncd should already trigger an swss restart before it reaches this throw. To mimic the old behavior close enough, I think a method could be running this handling function before this throw and keep this throw regardless of the return value of the handling function. I updated the code as such.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a check for the return value of the handling function and throw exception only when it did not successfully handle the failure.
What I did Add SAI failure handling functions in aclorch, bufferorch, copporch, dtelorch, fdborch, fgnhgorch, intfsorch, mirrororch, natorch, policerorch, macsecorch, portsorch, qosorch, sfloworch, switchorch, tunneldecaporch, vrforch. Why I did it Failure notification mechanism to ensure enough notifications in the presence of SAI failures and avoid running switches with unhandled failures.
What I did Add SAI failure handling functions in aclorch, bufferorch, copporch, dtelorch, fdborch, fgnhgorch, intfsorch, mirrororch, natorch, policerorch, macsecorch, portsorch, qosorch, sfloworch, switchorch, tunneldecaporch, vrforch. Why I did it Failure notification mechanism to ensure enough notifications in the presence of SAI failures and avoid running switches with unhandled failures.
What I did Backport SAI failure handling related commits into the 202012 branch. The following is a list of backported commits: 941875a Deactivate mirror session only when session status is true in updateLagMember (#1666) be12482 Ignore ALREADY_EXIST error in FDB creation (#1815) c9c1aa2 Add failure handling for SAI get operations (#1768) 47b4276 [BufferOrch] Don't call SAI API for BUFFER_POOL/PROFILE handling in case the op is DEL and the SAI OID is NULL (#1786) db9238f Add failure notification for orchagent (#1665) fc8e43f [synchronous mode] Add failure notification for SAI failures in synchronous mode (#1596) Why I did it 202012 image needs to include failure handling mechanism for enough notification in the presence of SAI failures.
What I did Add SAI failure handling functions in aclorch, bufferorch, copporch, dtelorch, fdborch, fgnhgorch, intfsorch, mirrororch, natorch, policerorch, macsecorch, portsorch, qosorch, sfloworch, switchorch, tunneldecaporch, vrforch. Why I did it Failure notification mechanism to ensure enough notifications in the presence of SAI failures and avoid running switches with unhandled failures.
What I did
Add SAI failure handling functions in aclorch, bufferorch, copporch, dtelorch, fdborch, fgnhgorch, intfsorch, mirrororch, natorch, policerorch, macsecorch, portsorch, qosorch, sfloworch, switchorch, tunneldecaporch, vrforch.
Why I did it
Failure notification mechanism to ensure enough notifications in the presence of SAI failures and avoid running switches with unhandled failures.
How I verified it
Details if related