-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fast-reboot] APPLY_VIEW failure in handleSaiCreateStatus causing orchagent crash #7798
Comments
@prsunny please take a look |
please attach full syslog and sairedis.rec from that encounter, also notice that error code returned is SAI_STATUS_ITEM_ALREADY_EXISTS, so OA could take notice of that for FDB entry, and just skip that during creation since entry exists, maybe some "SET" attributes could be explicitly set again |
Thanks Kamil! The older logs are attached. But, here I am attaching the latest logs. Also, this issue is inconsistent, and thus I have also seen successful fast-reboot test results. |
from first logs:
first line is where fdb is learned by hw, second where OA tries to explicitly create fdb |
from new attached logs, same scenario:
72:06:00:01:00:09 mac is learned and about 1.5 seconds later OA tries to create it, and fails since exist already, 1.5 sec seems a long time here compared to previous 0.02sec
i selected couple events here from sairedis.rec, switch is created ad 36 seconds, and first operations start sa 46, at 52 ports are getting up and also at 52 bridge ports are created for multiple ports with SAI_BRIDGE_PORT_FDB_LEARNING_MODE_HW, so those bridge ports will be learning new fdbs as they appear on ports, since currently no fdb is present at the switch, and we can see that first fdb is learned at 53, then at 54, and also at 54 OA tries to create fdb with mac 72:06:00:01:00:09 and fails since that already exists and was learned 1.5 seconds before there are 2 ways to mitigate this:
because OA fails and exit of this error, we don't have full configuration recorded here, but from create switch at 17:51:36 to error at 17:51:54 elapses 18 seconds, which is way under 30 seconds expected for fast-boot scenario - to actually see how long it takes we would need to get log from successful run PS. from sairedis.rec, right after switch_created we receive many port notification state change with SAI_PORT_OPER_STATUS_DOWN (not sure if we had this before), create switch here takes about 10 seconds, so maybe those notifications have time to build up, but this is actually minor issue |
I think @kcudnik 's either suggestion is good. But I want to point out:
|
Ad1. thats a good question, maybe we have been lucky ? if i remember correctly @vaibhavhd was saying that this error is present for some time, and i pointed out in previous post that would be nice to have history of this test run so we could corelate this to PR that might cause this issue to start Ad2. what do you mean second creating call ? each fdb entry is created only once explicitly by OA, and it fails, since metadata reported that entry exists, and this is correct behavior. PS. there is one scenario here that we can hit, since notifications are asynchronous, this can happen:
in async mode this scenario will crash syncd, in sync mode you will get already_exists error |
What I did Ignore ALREADY_EXIST error in FDB creation. Fix: sonic-net/sonic-buildimage#7798 Why I did it In FDB creation, there are scenarios where the hardware learns an FDB entry before orchagent. In such cases, the FDB SAI creation would report the status of SAI_STATUS_ITEM_ALREADY_EXISTS, and orchagent should ignore the error and treat it as entry was explicitly created.
What I did Ignore ALREADY_EXIST error in FDB creation. Fix: sonic-net/sonic-buildimage#7798 Why I did it In FDB creation, there are scenarios where the hardware learns an FDB entry before orchagent. In such cases, the FDB SAI creation would report the status of SAI_STATUS_ITEM_ALREADY_EXISTS, and orchagent should ignore the error and treat it as entry was explicitly created.
What I did Ignore ALREADY_EXIST error in FDB creation. Fix: sonic-net/sonic-buildimage#7798 Why I did it In FDB creation, there are scenarios where the hardware learns an FDB entry before orchagent. In such cases, the FDB SAI creation would report the status of SAI_STATUS_ITEM_ALREADY_EXISTS, and orchagent should ignore the error and treat it as entry was explicitly created.
What I did Ignore ALREADY_EXIST error in FDB creation. Fix: sonic-net/sonic-buildimage#7798 Why I did it In FDB creation, there are scenarios where the hardware learns an FDB entry before orchagent. In such cases, the FDB SAI creation would report the status of SAI_STATUS_ITEM_ALREADY_EXISTS, and orchagent should ignore the error and treat it as entry was explicitly created.
What I did Ignore ALREADY_EXIST error in FDB creation. Fix: sonic-net/sonic-buildimage#7798 Why I did it In FDB creation, there are scenarios where the hardware learns an FDB entry before orchagent. In such cases, the FDB SAI creation would report the status of SAI_STATUS_ITEM_ALREADY_EXISTS, and orchagent should ignore the error and treat it as entry was explicitly created.
Description
After fast-reboot, during APPLY_VIEW stage, Orchagent crashed and swss kept restarting. From syslog, the error reported is as:
ERR swss#orchagent: :- handleSaiCreateStatus: Encountered failure in create operation, exiting orchagent, SAI API: SAI_API_FDB, status: SAI_STATUS_ITEM_ALREADY_EXISTS
Steps to reproduce the issue:
"FAILED:dut:DUT hasn't started to work for 300 seconds",
Describe the results you received:
syslog
sairedis logs
Describe the results you expected:
Output of
show version
:Output of
show techsupport
:Additional information you deem important (e.g. issue happens only occasionally):
test_advanced_reboot (1).log
The text was updated successfully, but these errors were encountered: