Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warmboot reconciliation fails when zero buffer profiles are attached to some queues #899

Closed
svsivm opened this issue Aug 16, 2021 · 8 comments · Fixed by #906
Closed

Warmboot reconciliation fails when zero buffer profiles are attached to some queues #899

svsivm opened this issue Aug 16, 2021 · 8 comments · Fixed by #906
Assignees

Comments

@svsivm
Copy link

svsivm commented Aug 16, 2021

Description:
Two extra zero buffer profiles are created in the 'temp' asic view if warmboot is executed while some queues have zero buffer profiles attached to them. The VIDs of these two extra zero buffer profiles in temp asic view match those in the 'current' asic view. However the attribute list in the temp asic view is empty for these matching VIDs and hence the comparison logic during warmboot reconciliation inside performObjectSetTransition fails with the errors mentioned below. We ran into this issue while running the PFC WD warmboot pytest, specifically the second sub-test (https://github.com/Azure/sonic-mgmt/blob/master/tests/pfcwd/test_pfcwd_warm_reboot.py#L25)

Steps to reproduce:
Execute the second scenario in the pfc watchdog warmboot test on platform that uses 'zero buffer profile' model to handle PFC storms.

To reproduce manually, perform the following steps:
(a) Enable PFC WD on all target port/queue.
(b) Send PFC storm to target port/queue and verify PFC storm is detected and mitigation action is executed.
(c) While PFC storm is continued to be sent, perform warmboot.
(d) Reconciliation will fail.

Describe the results you got:
From the syslog below, we notice that current asic view has 8 buffer profiles (6 from config_db and 2 zero buffer profiles). Temp asic view however has 10 buffer profiles.

Jul 26 22:37:00.333718 sonic-dut WARNING syncd#syncd: :- logViewObjectCount: object count for SAI_OBJECT_TYPE_HOSTIF_TRAP_GROUP on current view 5 is different than on temporary view: 1
Jul 26 22:37:00.333718 sonic-dut WARNING syncd#syncd: :- logViewObjectCount: object count for SAI_OBJECT_TYPE_POLICER on current view 3 is different than on temporary view: 0
Jul 26 22:37:00.333747 sonic-dut WARNING syncd#syncd: :- logViewObjectCount: object count for SAI_OBJECT_TYPE_BUFFER_PROFILE on current view 8 is different than on temporary view: 10
Jul 26 22:37:00.333812 sonic-dut WARNING syncd#syncd: :- logViewObjectCount: object count for SAI_OBJECT_TYPE_HOSTIF_TRAP on current view 12 is different than on temporary view: 1
Jul 26 22:37:00.337738 sonic-dut WARNING syncd#syncd: :- logViewObjectCount: object count is different on both view, there will be ASIC OPERATIONS!
Jul 26 22:37:00.407638 sonic-dut NOTICE syncd#syncd: :- processObjectForViewTransition: processing: SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x190000000007a8
Jul 26 22:37:00.407638 sonic-dut NOTICE syncd#syncd: :- findCurrentBestMatch: found best match for SAI_OBJECT_TYPE_BUFFER_PROFILE oid:0x190000000007a8 since object status is MATCHED
Jul 26 22:37:00.407656 sonic-dut NOTICE syncd#syncd: :- processObjectForViewTransition: found best match SAI_OBJECT_TYPE_BUFFER_PROFILE: current: oid:0x190000000007a8 temporary: oid:0x190000000007a8
Jul 26 22:37:00.407656 sonic-dut NOTICE syncd#syncd: :- performObjectSetTransition: first pass (curr): attr SAI_BUFFER_PROFILE_ATTR_THRESHOLD_MODE
Jul 26 22:37:00.407672 sonic-dut NOTICE syncd#syncd: :- performObjectSetTransition: performing default on existing object VID oid:0x190000000007a8: SAI_BUFFER_PROFILE_ATTR_THRESHOLD_MODE: SAI_BUFFER_PROFILE_THRESHOLD_MODE_DYN AMIC, we need default dependency TREE, FIXME
Jul 26 22:37:00.407672 sonic-dut NOTICE syncd#syncd: :- performObjectSetTransition: Skipping create only attr on matched object: SAI_BUFFER_PROFILE_ATTR_THRESHOLD_MODE:SAI_BUFFER_PROFILE_THRESHOLD_MODE_DYNAMIC
Jul 26 22:37:00.407690 sonic-dut NOTICE syncd#syncd: :- performObjectSetTransition: first pass (curr): attr SAI_BUFFER_PROFILE_ATTR_SHARED_DYNAMIC_TH
Jul 26 22:37:00.407690 sonic-dut NOTICE syncd#syncd: :- performObjectSetTransition: performing default on existing object VID oid:0x190000000007a8: SAI_BUFFER_PROFILE_ATTR_SHARED_DYNAMIC_TH: -8, we need default dependency TRE E, FIXME
Jul 26 22:37:00.407704 sonic-dut ERR syncd#syncd: :- performObjectSetTransition: current attribute is mandatory on create, crate and set, and object MATCHED, FIXME oid:0x190000000007a8 SAI_BUFFER_PROFILE_ATTR_SHARED_DYNAMIC_T H:-8
Jul 26 22:37:00.407704 sonic-dut ERR syncd#syncd: :- processObjectForViewTransition: performObjectSetTransition on MATCHED object (oid:0x190000000007a8) FAILED! bug?
Jul 26 22:37:00.407718 sonic-dut NOTICE syncd#syncd: :- applyViewTransition: comparison logic took 0.068258 sec
Jul 26 22:37:00.481019 sonic-dut ERR syncd#syncd: :- applyView: Exception: :- processObjectForViewTransition: performObjectSetTransition on MATCHED object (oid:0x190000000007a8) FAILED! bug?
Jul 26 22:37:00.491945 sonic-dut NOTICE syncd#syncd: :- applyView: apply took 1.044873 sec
Jul 26 22:37:00.492476 sonic-dut ERR swss#orchagent: :- syncd_apply_view: Failed to notify syncd APPLY_VIEW -1

0x190000000007a8 and 0x190000000007aa are the VIDs of the zero buffer profile that we see both in 'current' and 'temp' asic view. However, the temp version has NULL attribute list which is not handled in the object transition code.

Additional information you deem important
Following is from redis-cli output.
Before warmboot:
127.0.0.1:6379[1]> keys BUFFER_PROFILE

  1. "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x190000000006f1"
  2. "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x190000000007a8" <<<<< Zero buffer profile
  3. "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x190000000006ee"
  4. "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x190000000006ed"
  5. "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x190000000006ec"
  6. "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x190000000007aa" <<<<< Zero buffer profile
  7. "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x190000000006f0"
  8. "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x190000000006ef"

After warmboot:
127.0.0.1:6379[1]> keys BUFFER_PROFILE

  1. "TEMP_ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x190000000007d3"
  2. "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x190000000006ef"
  3. "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x190000000006ed"
  4. "TEMP_ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x1900000000087c"
  5. "TEMP_ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x190000000007d0"
  6. "TEMP_ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x190000000007d5"
  7. "TEMP_ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x190000000007d2"
  8. "TEMP_ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x190000000007a8" <<<<< Matching zero buffer profile VID in temp view
  9. "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x190000000006ee"
  10. "TEMP_ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x1900000000087a"
  11. "TEMP_ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x190000000007aa" <<<<< Matching zero buffer profile VID in temp view
  12. "TEMP_ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x190000000007d1"
  13. "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x190000000006ec"
  14. "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x190000000006f0"
  15. "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x190000000006f1"
  16. "TEMP_ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x190000000007d4"
  17. "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x190000000007aa"
  18. "ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x190000000007a8"

However, the temp view VID has empty attribute list which is not handled in performObjectSetTransition
127.0.0.1:6379[1]> hgetall TEMP_ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x190000000007a8

  1. "NULL"
  2. "NULL"
    127.0.0.1:6379[1]> hgetall TEMP_ASIC_STATE:SAI_OBJECT_TYPE_BUFFER_PROFILE:oid:0x190000000007aa
  3. "NULL"
  4. "NULL"

Debug Analysis:
On further debugging we observed the following (from syslog):

a. There are 2 ‘extra’ buffer profiles created in the ‘temp’ view with empty attribute list. We are not able to determine why /how they got created. However, the VID of those ‘temp’ objects match the VID of the ‘current’ objects (zero buffer profiles). Hence, they are marked as ‘MATCHED’.

b. However, in performObjectSetTransition, the temp object with matched VID of the zero buffer profile is parsed and since the attribute list is empty, it is never put in the ‘processed attr list’. Now, while parsing the attributes of the ‘current’ object (with VID of the zero buffer profile), we encounter failure because one of the attributes of the buffer profile (SAI_BUFFER_PROFILE_ATTR_SHARED_DYNAMIC_TH) is ‘mandatory on create’ and ‘create and set’.

Relevant code snippet from performObjectSetTransition:

for (auto &at: temporaryObj->getAllAttributes())
    {
        auto &temporaryAttr = at.second;

        SWSS_LOG_NOTICE("first pass (temp): attr %s", temporaryAttr->getStrAttrId().c_str());

        const auto meta = temporaryAttr->getAttrMetadata();

        const sai_attribute_t &attr = *temporaryAttr->getSaiAttr();

        processedAttributes.insert(attr.id); // mark attr id as processed <<<<<< Since temp object attr list is empty, this list is empty too.
……
   }

    for (auto &ac: currentBestMatch->getAllAttributes())
    {
        auto &currentAttr = ac.second;

        const auto &meta = currentAttr->getAttrMetadata();

        const sai_attribute_t &attr = *currentAttr->getSaiAttr();

        if (processedAttributes.find(attr.id) != processedAttributes.end())   <<<<<<<<< This is false for the current zero buffer profile object.
        {
            /*
             * This attribute was processed in previous temporary attributes processing so skip it here.
             */

            continue;
        }
……..
……..
                SWSS_LOG_ERROR("current attribute is mandatory on create, crate and set, and object MATCHED, FIXME %s %s:%s",
                        currentBestMatch->m_str_object_id.c_str(),
                        meta->attridname,
                        currentAttr->getStrAttrValue().c_str());
}

Sonic version
SONiC Software Version: SONiC.202012.Innovium.0-dirty-20210523.005101
Distribution: Debian 10.9
Kernel: 4.19.0-12-2-amd64
Build commit: d7b4c62f
Build date: Sun May 23 07:58:46 UTC 2021
Platform: x86_64-cel_midstone-r0
HwSKU: Midstone-200i
ASIC: innovium
ASIC Count: 1

@kcudnik
Copy link
Collaborator

kcudnik commented Aug 16, 2021

i will need to deeper read your analysis, but if you say that there are 8 buffers in asic db after warm boot and then there are 10 in the temp view, then this means after warm boot OA is setting extra 2 buffers for some reason.

please provide sairedis.rec and syslog from both those scenarios

@kcudnik kcudnik self-assigned this Aug 17, 2021
@svsivm
Copy link
Author

svsivm commented Aug 20, 2021

github_logs.zip
Please find the syslog and the sairedis logs for before and after warmboot scenarios attached.

@kcudnik
Copy link
Collaborator

kcudnik commented Aug 20, 2021

thank you, sorry but i didn't have time yet to dig through this issue, i will spent some time on weekend on this and get back with results on monday

@kcudnik
Copy link
Collaborator

kcudnik commented Aug 22, 2021

I was able to reproduce this locally. I started doing some analysis, and found that number of created buffer profiles is in both cases is 8. So where is that 10 coming from ? from here:

2021-08-17.04:19:31.376976|g|SAI_OBJECT_TYPE_INGRESS_PRIORITY_GROUP:oid:0x1a0000000000b9|SAI_INGRESS_PRIORITY_GROUP_ATTR_BUFFER_PROFILE=oid:0x0
2021-08-17.04:19:31.379199|G|SAI_STATUS_SUCCESS|SAI_INGRESS_PRIORITY_GROUP_ATTR_BUFFER_PROFILE=oid:0x1900000000044a
2021-08-17.04:19:31.389772|G|SAI_STATUS_SUCCESS|SAI_QUEUE_ATTR_BUFFER_PROFILE_ID=oid:0x19000000000448
2021-08-17.04:19:31.389809|s|SAI_OBJECT_TYPE_QUEUE:oid:0x150000000001bb|SAI_QUEUE_ATTR_BUFFER_PROFILE_ID=oid:0x1900000000051a
2021-08-17.04:19:31.415208|a|APPLY_VIEW

After warm boot, before calling APPLY_VIEW, OA query (get) buffer profile on SAI_INGRESS_PRIORITY_GROUP_ATTR_BUFFER_PROFILE and on SAI_QUEUE_ATTR_BUFFER_PROFILE_ID and decides to set new profiles on those attributes. Since new 6 profiles are already created, then 2 belong to previous build (from get), and since after query OA wants to changed them, it creates 2 new profiles. And in total it's 10 of them.

Small problem, happens here, since on first cold boot 8 profiles are created, then on second boot again 8 are created, which should be matched, but by doing GET operation on those profiles, OA brings 2 profiles from previous VIEW. This gives 10 of them. What's happens in comparison logic is that those 2 buffer profiles, have the same VID as in current view and temporary VIEW, this is normal, but comparison logic if see 2 of exact same VID's it marks object as MATCHED, which means this object exists in both VIEWS, this can happen in 2 ways:

  • OA did GET operation and get returned VID of existing object
  • syncd moved current view objects to temporary view (for example, if OA didn't query QUEUESs, syncd moves QUEUEs to temporary view, since we don't want them to be removed, just because OA didn't explicitly query for them. This is actually a special case, this goes for queues, ingress priority groups and scheduler groups + discovered vids on switch create)

But so far we didn't have case that this matched object would need to be removed from current view as in this case right here.

A am thinking how to make a fix for this.

@kcudnik
Copy link
Collaborator

kcudnik commented Aug 22, 2021

I came to realization that this previous buffer profile, when doing GET is acknowledged by OA that it exists, for example, same scenario with VLAN members, if OA query vlan members for default vlan, where each bridge port belongs by default, then after warm boot they are not removed (those vlan members) since OA acknowledges that they are there. And if it will not remove it, that buffer profile will be hanging unattached anywhere, until next warm boot, when it will be removed since OA after query will return first buffer profile.

Actually cold boot scenario in comparison logic is a little bit different, this vlan member example applies to not remove vlans, because they were discovered in first cold boot scenario, but buffer profile was not on first cold boot. This get's a little bit complicated since same scenario we can apply to splitted ports, new 4 ports instead of 1 also were not in cold boot, and they should still exists and not be automatically removed by comparison logic.

This is a little bit complicated, since only GET should qualify object for remove? let say GET got value of that buffer profile from previous boot, now we only have VID for it, so should comparison logic remove that object? (it still have reference count on that particular QUEUE/IPG, until new temporary view will release that object from current view). But what if we use that reference and assign it somewhere else? Or if we SET some attribute on that buffer profile not checking other attributes. How to distinguish those scenarios, and decide whether this object should be remove or not?

Rather not, let say, that OA later on decides to use that buffer profile, query it and assign some values, and reuse it, but if that object was removed by comparison logic, then this would crash since no longer that object would exist.

On issue itself, as you noted, since OA did only GET on that buffer profile, only VID is registered, and no attributes of this object exists in temporary view, but object is fully created on current view. So if no attributes, comparison logic tries to bring that object to "default" state, and hence it needs to know what was original state of the attribute, but that's not possible when attribute is mandatory_on_create, since those attribute don't have previous state.

As a workaround here we can make a copy of existing attributes to temporary view (non oids) if object is empty, and assume that was just possible GET scenario. If the object would be "unmatched" from current/temp view, comparison logic would find best candidate and match good parameters to it and that would lead to assign new VID to existing object, there is a problem since OA, still have that old VID and "may" use it somewhere. This would lead to a crash, since OA would have 10 references of VIDs, and only 8 ware created.

This also opens another problem, let say we have this scenario:

  1. cold boot
    • OA create buffer profile (assigns some attributes different than default) and assign it to QUEUE
  2. warm boot
    • OA GET previous buffer profile, and assign new one.
    • now since OA have previous profile reference, query some of it attributes, and decides to leave them since they are fine
    • OA perform apply view and syncd too (now what should happen to buffer profile, should we get attributes to default state?)
    • OA assigns previous buffer profile (with assumed attributes) but syncd brought those attributes to default state
    • OA now have inconsistent knowledge about buffer profile parameters

PS. this applies to all objects, not only buffer profile

@kcudnik
Copy link
Collaborator

kcudnik commented Aug 22, 2021

I will add a workaround for that case, but it would be much better if OA would skip query before apply view :P

@kcudnik
Copy link
Collaborator

kcudnik commented Aug 22, 2021

I found another issue, since that buffer profile have pool id assigned to it SAI_BUFFER_PROFILE_ATTR_POOL_ID, then that poll id also needs to be transferred to temporary view :(

and as i suspected, this poll is not processed:

ltsyncd          : checkObjectsStatus: object was not processed: SAI_OBJECT_TYPE_BUFFER_POOL oid:0x180000000005a7, status: 0 (ref: 1)
ltsyncd          : checkObjectsStatus: 1 objects were not processed
ltsyncd          : applyView: Exception: e:- checkObjectsStatus: 1 objects were not processed
ltsyncd          : applyView: apply took 0.210514 sec
ltsaiplayer       : performNotifySyncd: response status SAI_STATUS_SUCCESS is different than syncd status SAI_STATUS_FAILURE

i'm not sure how to fix this yet, but for now the quickest workaround would be on OA side, not to do GET on queues/ipgs for buffer profile, just to assign it.

kcudnik added a commit to kcudnik/sonic-sairedis that referenced this issue Aug 25, 2021
kcudnik added a commit to kcudnik/sonic-sairedis that referenced this issue Aug 25, 2021
@kcudnik
Copy link
Collaborator

kcudnik commented Aug 25, 2021

Hey, i created possible fix for this issue: #906 please cherrypick that and test if it will solve the issue with buffer profile.

And i track down that this GET responsible is this file: orchagent/pfcactionhandler.cpp

538     attr.id = SAI_INGRESS_PRIORITY_GROUP_ATTR_BUFFER_PROFILE;
539
540     // Get PG's buffer profile
541     status = sai_buffer_api->get_ingress_priority_group_attribute(pg, 1, &attr);

and

521     attr.id = SAI_QUEUE_ATTR_BUFFER_PROFILE_ID;
522     attr.value.oid = ZeroBufferProfile::getZeroBufferProfile(false);
523
524     // Set our zero buffer profile
525     status = sai_queue_api->set_queue_attribute(queue, &attr);

interestingly enough this was added my @marian-pritsak in 2017, it's wired why this surfaced just no.
@marian-pritsak can you explain why we do first GET and the SET on queue or ipg, and if it's not matched we are doing SET operation? this seems to be causing some trouble since we only have a "GET" oid without internals

kcudnik added a commit that referenced this issue Sep 4, 2021
* [syncd] WIP buffer profile GET

* Add object status serialize method

* Add buffer profile get test

* [syncd] Allow AsicView return dummy created object

* Update aspell

* [syncd] Transfer not processed objects to temp view

Check issue #899
for details.

* [tests] Add missing recording file for tests
kcudnik added a commit to kcudnik/sonic-sairedis that referenced this issue Sep 22, 2021
…et#906)

* [syncd] WIP buffer profile GET

* Add object status serialize method

* Add buffer profile get test

* [syncd] Allow AsicView return dummy created object

* Update aspell

* [syncd] Transfer not processed objects to temp view

Check issue sonic-net#899
for details.

* [tests] Add missing recording file for tests
pettershao-ragilenetworks pushed a commit to pettershao-ragilenetworks/sonic-sairedis that referenced this issue Nov 18, 2022
…et#906)

* [syncd] WIP buffer profile GET

* Add object status serialize method

* Add buffer profile get test

* [syncd] Allow AsicView return dummy created object

* Update aspell

* [syncd] Transfer not processed objects to temp view

Check issue sonic-net#899
for details.

* [tests] Add missing recording file for tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants