-
Notifications
You must be signed in to change notification settings - Fork 280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Warmboot reconciliation fails when zero buffer profiles are attached to some queues #899
Comments
i will need to deeper read your analysis, but if you say that there are 8 buffers in asic db after warm boot and then there are 10 in the temp view, then this means after warm boot OA is setting extra 2 buffers for some reason. please provide sairedis.rec and syslog from both those scenarios |
github_logs.zip |
thank you, sorry but i didn't have time yet to dig through this issue, i will spent some time on weekend on this and get back with results on monday |
I was able to reproduce this locally. I started doing some analysis, and found that number of created buffer profiles is in both cases is 8. So where is that 10 coming from ? from here:
After warm boot, before calling APPLY_VIEW, OA query (get) buffer profile on SAI_INGRESS_PRIORITY_GROUP_ATTR_BUFFER_PROFILE and on SAI_QUEUE_ATTR_BUFFER_PROFILE_ID and decides to set new profiles on those attributes. Since new 6 profiles are already created, then 2 belong to previous build (from get), and since after query OA wants to changed them, it creates 2 new profiles. And in total it's 10 of them. Small problem, happens here, since on first cold boot 8 profiles are created, then on second boot again 8 are created, which should be matched, but by doing GET operation on those profiles, OA brings 2 profiles from previous VIEW. This gives 10 of them. What's happens in comparison logic is that those 2 buffer profiles, have the same VID as in current view and temporary VIEW, this is normal, but comparison logic if see 2 of exact same VID's it marks object as MATCHED, which means this object exists in both VIEWS, this can happen in 2 ways:
But so far we didn't have case that this matched object would need to be removed from current view as in this case right here. A am thinking how to make a fix for this. |
I came to realization that this previous buffer profile, when doing GET is acknowledged by OA that it exists, for example, same scenario with VLAN members, if OA query vlan members for default vlan, where each bridge port belongs by default, then after warm boot they are not removed (those vlan members) since OA acknowledges that they are there. And if it will not remove it, that buffer profile will be hanging unattached anywhere, until next warm boot, when it will be removed since OA after query will return first buffer profile. Actually cold boot scenario in comparison logic is a little bit different, this vlan member example applies to not remove vlans, because they were discovered in first cold boot scenario, but buffer profile was not on first cold boot. This get's a little bit complicated since same scenario we can apply to splitted ports, new 4 ports instead of 1 also were not in cold boot, and they should still exists and not be automatically removed by comparison logic. This is a little bit complicated, since only GET should qualify object for remove? let say GET got value of that buffer profile from previous boot, now we only have VID for it, so should comparison logic remove that object? (it still have reference count on that particular QUEUE/IPG, until new temporary view will release that object from current view). But what if we use that reference and assign it somewhere else? Or if we SET some attribute on that buffer profile not checking other attributes. How to distinguish those scenarios, and decide whether this object should be remove or not? Rather not, let say, that OA later on decides to use that buffer profile, query it and assign some values, and reuse it, but if that object was removed by comparison logic, then this would crash since no longer that object would exist. On issue itself, as you noted, since OA did only GET on that buffer profile, only VID is registered, and no attributes of this object exists in temporary view, but object is fully created on current view. So if no attributes, comparison logic tries to bring that object to "default" state, and hence it needs to know what was original state of the attribute, but that's not possible when attribute is mandatory_on_create, since those attribute don't have previous state. As a workaround here we can make a copy of existing attributes to temporary view (non oids) if object is empty, and assume that was just possible GET scenario. If the object would be "unmatched" from current/temp view, comparison logic would find best candidate and match good parameters to it and that would lead to assign new VID to existing object, there is a problem since OA, still have that old VID and "may" use it somewhere. This would lead to a crash, since OA would have 10 references of VIDs, and only 8 ware created. This also opens another problem, let say we have this scenario:
PS. this applies to all objects, not only buffer profile |
I will add a workaround for that case, but it would be much better if OA would skip query before apply view :P |
I found another issue, since that buffer profile have pool id assigned to it SAI_BUFFER_PROFILE_ATTR_POOL_ID, then that poll id also needs to be transferred to temporary view :( and as i suspected, this poll is not processed:
i'm not sure how to fix this yet, but for now the quickest workaround would be on OA side, not to do GET on queues/ipgs for buffer profile, just to assign it. |
Check issue sonic-net#899 for details.
Check issue sonic-net#899 for details.
Hey, i created possible fix for this issue: #906 please cherrypick that and test if it will solve the issue with buffer profile. And i track down that this GET responsible is this file: orchagent/pfcactionhandler.cpp
and
interestingly enough this was added my @marian-pritsak in 2017, it's wired why this surfaced just no. |
* [syncd] WIP buffer profile GET * Add object status serialize method * Add buffer profile get test * [syncd] Allow AsicView return dummy created object * Update aspell * [syncd] Transfer not processed objects to temp view Check issue #899 for details. * [tests] Add missing recording file for tests
…et#906) * [syncd] WIP buffer profile GET * Add object status serialize method * Add buffer profile get test * [syncd] Allow AsicView return dummy created object * Update aspell * [syncd] Transfer not processed objects to temp view Check issue sonic-net#899 for details. * [tests] Add missing recording file for tests
…et#906) * [syncd] WIP buffer profile GET * Add object status serialize method * Add buffer profile get test * [syncd] Allow AsicView return dummy created object * Update aspell * [syncd] Transfer not processed objects to temp view Check issue sonic-net#899 for details. * [tests] Add missing recording file for tests
Description:
Two extra zero buffer profiles are created in the 'temp' asic view if warmboot is executed while some queues have zero buffer profiles attached to them. The VIDs of these two extra zero buffer profiles in temp asic view match those in the 'current' asic view. However the attribute list in the temp asic view is empty for these matching VIDs and hence the comparison logic during warmboot reconciliation inside performObjectSetTransition fails with the errors mentioned below. We ran into this issue while running the PFC WD warmboot pytest, specifically the second sub-test (https://github.com/Azure/sonic-mgmt/blob/master/tests/pfcwd/test_pfcwd_warm_reboot.py#L25)
Steps to reproduce:
Execute the second scenario in the pfc watchdog warmboot test on platform that uses 'zero buffer profile' model to handle PFC storms.
To reproduce manually, perform the following steps:
(a) Enable PFC WD on all target port/queue.
(b) Send PFC storm to target port/queue and verify PFC storm is detected and mitigation action is executed.
(c) While PFC storm is continued to be sent, perform warmboot.
(d) Reconciliation will fail.
Describe the results you got:
From the syslog below, we notice that current asic view has 8 buffer profiles (6 from config_db and 2 zero buffer profiles). Temp asic view however has 10 buffer profiles.
Jul 26 22:37:00.333718 sonic-dut WARNING syncd#syncd: :- logViewObjectCount: object count for SAI_OBJECT_TYPE_HOSTIF_TRAP_GROUP on current view 5 is different than on temporary view: 1
Jul 26 22:37:00.333718 sonic-dut WARNING syncd#syncd: :- logViewObjectCount: object count for SAI_OBJECT_TYPE_POLICER on current view 3 is different than on temporary view: 0
Jul 26 22:37:00.333747 sonic-dut WARNING syncd#syncd: :- logViewObjectCount: object count for SAI_OBJECT_TYPE_BUFFER_PROFILE on current view 8 is different than on temporary view: 10
Jul 26 22:37:00.333812 sonic-dut WARNING syncd#syncd: :- logViewObjectCount: object count for SAI_OBJECT_TYPE_HOSTIF_TRAP on current view 12 is different than on temporary view: 1
Jul 26 22:37:00.337738 sonic-dut WARNING syncd#syncd: :- logViewObjectCount: object count is different on both view, there will be ASIC OPERATIONS!
Jul 26 22:37:00.407638 sonic-dut NOTICE syncd#syncd: :- processObjectForViewTransition: processing: SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x190000000007a8
Jul 26 22:37:00.407638 sonic-dut NOTICE syncd#syncd: :- findCurrentBestMatch: found best match for SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x190000000007a8 since object status is MATCHED
Jul 26 22:37:00.407656 sonic-dut NOTICE syncd#syncd: :- processObjectForViewTransition: found best match SAI_OBJECT_TYPE_BUFFER_PROFILE: current: oid:0x190000000007a8 temporary: oid:0x190000000007a8
Jul 26 22:37:00.407656 sonic-dut NOTICE syncd#syncd: :- performObjectSetTransition: first pass (curr): attr SAI_BUFFER_PROFILE_ATTR_THRESHOLD_MODE
Jul 26 22:37:00.407672 sonic-dut NOTICE syncd#syncd: :- performObjectSetTransition: performing default on existing object VID oid:0x190000000007a8: SAI_BUFFER_PROFILE_ATTR_THRESHOLD_MODE: SAI_BUFFER_PROFILE_THRESHOLD_MODE_DYN AMIC, we need default dependency TREE, FIXME
Jul 26 22:37:00.407672 sonic-dut NOTICE syncd#syncd: :- performObjectSetTransition: Skipping create only attr on matched object: SAI_BUFFER_PROFILE_ATTR_THRESHOLD_MODE:SAI_BUFFER_PROFILE_THRESHOLD_MODE_DYNAMIC
Jul 26 22:37:00.407690 sonic-dut NOTICE syncd#syncd: :- performObjectSetTransition: first pass (curr): attr SAI_BUFFER_PROFILE_ATTR_SHARED_DYNAMIC_TH
Jul 26 22:37:00.407690 sonic-dut NOTICE syncd#syncd: :- performObjectSetTransition: performing default on existing object VID oid:0x190000000007a8: SAI_BUFFER_PROFILE_ATTR_SHARED_DYNAMIC_TH: -8, we need default dependency TRE E, FIXME
Jul 26 22:37:00.407704 sonic-dut ERR syncd#syncd: :- performObjectSetTransition: current attribute is mandatory on create, crate and set, and object MATCHED, FIXME oid:0x190000000007a8 SAI_BUFFER_PROFILE_ATTR_SHARED_DYNAMIC_T H:-8
Jul 26 22:37:00.407704 sonic-dut ERR syncd#syncd: :- processObjectForViewTransition: performObjectSetTransition on MATCHED object (oid:0x190000000007a8) FAILED! bug?
Jul 26 22:37:00.407718 sonic-dut NOTICE syncd#syncd: :- applyViewTransition: comparison logic took 0.068258 sec
Jul 26 22:37:00.481019 sonic-dut ERR syncd#syncd: :- applyView: Exception: :- processObjectForViewTransition: performObjectSetTransition on MATCHED object (oid:0x190000000007a8) FAILED! bug?
Jul 26 22:37:00.491945 sonic-dut NOTICE syncd#syncd: :- applyView: apply took 1.044873 sec
Jul 26 22:37:00.492476 sonic-dut ERR swss#orchagent: :- syncd_apply_view: Failed to notify syncd APPLY_VIEW -1
0x190000000007a8 and 0x190000000007aa are the VIDs of the zero buffer profile that we see both in 'current' and 'temp' asic view. However, the temp version has NULL attribute list which is not handled in the object transition code.
Additional information you deem important
Following is from redis-cli output.
Before warmboot:
127.0.0.1:6379[1]> keys BUFFER_PROFILE
After warmboot:
127.0.0.1:6379[1]> keys BUFFER_PROFILE
However, the temp view VID has empty attribute list which is not handled in performObjectSetTransition
127.0.0.1:6379[1]> hgetall TEMP_ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x190000000007a8
127.0.0.1:6379[1]> hgetall TEMP_ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x190000000007aa
Debug Analysis:
On further debugging we observed the following (from syslog):
a. There are 2 ‘extra’ buffer profiles created in the ‘temp’ view with empty attribute list. We are not able to determine why /how they got created. However, the VID of those ‘temp’ objects match the VID of the ‘current’ objects (zero buffer profiles). Hence, they are marked as ‘MATCHED’.
b. However, in performObjectSetTransition, the temp object with matched VID of the zero buffer profile is parsed and since the attribute list is empty, it is never put in the ‘processed attr list’. Now, while parsing the attributes of the ‘current’ object (with VID of the zero buffer profile), we encounter failure because one of the attributes of the buffer profile (SAI_BUFFER_PROFILE_ATTR_SHARED_DYNAMIC_TH) is ‘mandatory on create’ and ‘create and set’.
Relevant code snippet from performObjectSetTransition:
Sonic version
SONiC Software Version: SONiC.202012.Innovium.0-dirty-20210523.005101
Distribution: Debian 10.9
Kernel: 4.19.0-12-2-amd64
Build commit: d7b4c62f
Build date: Sun May 23 07:58:46 UTC 2021
Platform: x86_64-cel_midstone-r0
HwSKU: Midstone-200i
ASIC: innovium
ASIC Count: 1
The text was updated successfully, but these errors were encountered: