Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix adding flex counter to wrong context #1421

Merged
merged 8 commits into from
Oct 23, 2024

Conversation

byu343
Copy link
Contributor

@byu343 byu343 commented Sep 18, 2024

The counters for syncd (switch chip) were attempted to be added to gbsyncd (gearbox phys), and vice versa.
This issue is introduced by #1362 When setting the redis attribute
SAI_REDIS_SWITCH_ATTR_FLEX_COUNTER_GROUP and
SAI_REDIS_SWITCH_ATTR_FLEX_COUNTER, the operation is applied to every contexts (both syncd and gbsyncd). However, the counters to initialize could only exist in one context.

The fix is to check that the target switch id exists in a specific context, for the set operation of any redis type attribute; if not, skip the set on that context.

Before the fix, we see the error below in syslog
ERR gbsyncd#syncd: :- processFlexCounterEvent: port VID oid:0x1000000000002, was not found (probably port was removed/splitted) and will remove from counters now

After the fix, log like below is printed on info level:
INFO swss#orchagent: :- containsSwitch: context phys failed to find switch oid:0x21000000000000

This change is needed by the SONiC release 202405 and later.

Copy link
Collaborator

@kcudnik kcudnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meet code cover requirements

lib/Sai.cpp Outdated
Comment on lines 234 to 239
if (objectType == SAI_OBJECT_TYPE_SWITCH && objectId != SAI_NULL_OBJECT_ID &&
(attr->id == SAI_REDIS_SWITCH_ATTR_FLEX_COUNTER ||
attr->id == SAI_REDIS_SWITCH_ATTR_FLEX_COUNTER_GROUP))
{
if (!kvp.second->m_redisSai->containsSwitch(objectId))
{
continue;
}
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will not guarantee that the context is actually gbsyncd or syncd, no what basis this decision is made here ? from my perspective therre should be some other attribute passed to indicated which syncd it is or based on switch created attributes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When setting attribute SAI_REDIS_SWITCH_ATTR_FLEX_COUNTER or SAI_REDIS_SWITCH_ATTR_FLEX_COUNTER_GROUP using sai_switch_api, the switch Id is set to gSwitchId or gearbox_oid, e.g., https://github.com/sonic-net/sonic-swss/blob/master/orchagent/saihelper.cpp#L878.

Before my change, code around here was looping through all the context using "for (auto& kvp: m_contextMap)" and setting the attributes to all contexts. My change is to check if the switch id passed through could match the context. Assumption is that gearbox_oid is only available in the gbsyncd context, switch chip oid is only available in the syncd context, which are always the case from my understanding.

There is some description in PR #1362:

Gearbox flex counter database
Pass the Phy OID, an OID of a SAI switch object in syntax, when calling the SAI set API to set the extended attributes. By doing so, the SAI redis objects can choose in which context the SAI API call should be invoked and the corresponding gearbox syncd docker container will handle it.

From my understanding, my PR here is to implement "choose in which context the SAI API call should be invoke based on OID of a SAI switch object".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your assumption is not enough, since you are assuming those 2 attributes will be passed only on grbsyncd OID, but if you pass other OID for other switch it will also execute it, sairedis should know internally whether those attributes belong to grsyncd, maybe based on create_switch

anyway, if you know outside the sairedis whether those attrbiutes should be passed or not, just don't pass them for non grbsyncd, and let that api loop throu all oids

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused about the comment. Let me explain the original issue without this change. Could you please take a look and give some idea on where to change to fix the issue if you still think the current location is not appropriate?

There are two contexts: (A) is syncd for the ASIC switch chip, and (B) is gbsyncd for multiple gearbox chips.

When the system is enabling counter for switch chip of context (A), it executes

sai_switch_api->set_switch_attribute(gSwitchId, &attr);

in https://github.com/sonic-net/sonic-swss/blob/master/orchagent/saihelper.cpp#L878
with SAI_REDIS_SWITCH_ATTR_FLEX_COUNTER_GROUP in the attr.

Then because of

for (auto& kvp: m_contextMap)

in https://github.com/sonic-net/sonic-sairedis/blob/master/lib/Sai.cpp#L232,
it loops through both context (A) and context (B) and runs the set operation on both. The parameters in the original set_switch_attribute is for ASIC switch chip, thus only relevant to context (A). As it also runs on context (B) we saw the errors like below in syslog that we want to get rid of (the port VID is actually in context (A))

ERR gbsyncd#syncd: :- processFlexCounterEvent: port VID oid:0x1000000000002, was not found (probably port was removed/splitted) and will remove from counters now

Then I proposed that instead of running on both, probably it should only run on context (A), and that is through checking the objectId from set_switch_attribute belongs to which context. Also the TODO line in https://github.com/sonic-net/sonic-sairedis/blob/master/lib/Sai.cpp#L227-L228 seems also implying we should run set on only one context under certain situation.

Above example is for enabling counter on ASIC switch chip of context (A). Meanwhile, when system is enabling counters on gearbox chips of context (B), we found similar error from syncd complaining the port VID is not found in syncd (actually it belongs to gbsyncd).

ERR syncd#syncd: :- processFlexCounterEvent: port VID oid:0x1010000000001, was not found (probably port was removed/splitted) and will remove from counters now

For you comments

anyway, if you know outside the sairedis whether those attributes should be passed or not, just don't pass them for non gbsyncd, and let that api loop through all oids

it seems hard for "don't pass them for non gbsyncd" with the existing code, as that executing on all contexts is done by sairedis: https://github.com/sonic-net/sonic-sairedis/blob/master/lib/Sai.cpp#L232. Note the loop is only for redis attribute: https://github.com/sonic-net/sonic-sairedis/blob/master/lib/Sai.cpp#L204

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i get the problem fully now, and easiest solution for that is this: https://github.com/sonic-net/sonic-sairedis/blob/master/lib/Sai.cpp#L227, apply only for specific context based on switch ID, instead of hard codding specific attributes, and apply to all contexts if switchid is nullobjectid

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the latest change, the switch id check has been applied to the set operation of all redis attributes when the switch ID is not null.

@byu343 byu343 force-pushed the flex-counter-context branch 2 times, most recently from 8941ffc to 0266f34 Compare October 3, 2024 05:33
@byu343 byu343 force-pushed the flex-counter-context branch from 0266f34 to 28eb780 Compare October 3, 2024 17:02
byu343 added 4 commits October 8, 2024 09:39
The counters for syncd (switch chip) were attempted to be added to
gbsyncd (gearbox phys), and vice versa.
This issue is introduced by
sonic-net#1362
When setting the redis attribute
SAI_REDIS_SWITCH_ATTR_FLEX_COUNTER_GROUP and
SAI_REDIS_SWITCH_ATTR_FLEX_COUNTER, the operation is applied to every
contexts (both syncd and gbsyncd). However, the counters to initialize
could only exist in one context.

The fix is to check that the target switch id exists in the context; if
not, skip the operation.
Normally, setting SAI_REDIS_SWITCH_ATTR_FLEX_COUNTER_GROUP and
SAI_REDIS_SWITCH_ATTR_FLEX_COUNTER should provide a valid switch id
and the setting is only applied to the context with the switch id.
Fall back to loop all contexts if SAI_NULL_OBJECT_ID is provided.
Extend the context check to all redis attrs
@byu343 byu343 force-pushed the flex-counter-context branch from 6aeb709 to 2281261 Compare October 8, 2024 16:39
@yaqiangz
Copy link

Hi @byu343 seems merging master branch is required

@StormLiangMS
Copy link

/azp run Azure.sonic-sairedis

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@StormLiangMS
Copy link

/azp run Azure.sonic-sairedis

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@yaqiangz
Copy link

Hi @kcudnik, are we good to merge this PR?

@kcudnik kcudnik merged commit b8a8856 into sonic-net:master Oct 23, 2024
18 checks passed
@yaqiangz
Copy link

Hi @bingwang-ms could you please help to approve the backport request?

@bingwang-ms
Copy link
Contributor

@liushilongbuaa Could you check the auto cherry-pick? Seems it's broken for this repo. Thanks

yaqiangz pushed a commit to yaqiangz/sonic-sairedis that referenced this pull request Oct 28, 2024
The counters for syncd (switch chip) were attempted to be added to gbsyncd (gearbox phys), and vice versa.
This issue is introduced by sonic-net#1362 When setting the redis attribute
SAI_REDIS_SWITCH_ATTR_FLEX_COUNTER_GROUP and
SAI_REDIS_SWITCH_ATTR_FLEX_COUNTER, the operation is applied to every contexts (both syncd and gbsyncd). However, the counters to initialize could only exist in one context.

The fix is to check that the target switch id exists in a specific context, for the set operation of any redis type attribute; if not, skip the set on that context.

Before the fix, we see the error below in syslog
ERR gbsyncd#syncd: :- processFlexCounterEvent: port VID oid:0x1000000000002, was not found (probably port was removed/splitted) and will remove from counters now

After the fix, log like below is printed on info level:
INFO swss#orchagent: :- containsSwitch: context phys failed to find switch oid:0x21000000000000

This change is needed by the SONiC release 202405 and later.
bingwang-ms pushed a commit that referenced this pull request Oct 28, 2024
The counters for syncd (switch chip) were attempted to be added to gbsyncd (gearbox phys), and vice versa.
This issue is introduced by #1362 When setting the redis attribute
SAI_REDIS_SWITCH_ATTR_FLEX_COUNTER_GROUP and
SAI_REDIS_SWITCH_ATTR_FLEX_COUNTER, the operation is applied to every contexts (both syncd and gbsyncd). However, the counters to initialize could only exist in one context.

The fix is to check that the target switch id exists in a specific context, for the set operation of any redis type attribute; if not, skip the set on that context.

Before the fix, we see the error below in syslog
ERR gbsyncd#syncd: :- processFlexCounterEvent: port VID oid:0x1000000000002, was not found (probably port was removed/splitted) and will remove from counters now

After the fix, log like below is printed on info level:
INFO swss#orchagent: :- containsSwitch: context phys failed to find switch oid:0x21000000000000

This change is needed by the SONiC release 202405 and later.

Co-authored-by: byu343 <byu@arista.com>
shiraez pushed a commit to Marvell-switching/sonic-sairedis that referenced this pull request Dec 12, 2024
The counters for syncd (switch chip) were attempted to be added to gbsyncd (gearbox phys), and vice versa.
This issue is introduced by sonic-net#1362 When setting the redis attribute
SAI_REDIS_SWITCH_ATTR_FLEX_COUNTER_GROUP and
SAI_REDIS_SWITCH_ATTR_FLEX_COUNTER, the operation is applied to every contexts (both syncd and gbsyncd). However, the counters to initialize could only exist in one context.

The fix is to check that the target switch id exists in a specific context, for the set operation of any redis type attribute; if not, skip the set on that context.

Before the fix, we see the error below in syslog
ERR gbsyncd#syncd: :- processFlexCounterEvent: port VID oid:0x1000000000002, was not found (probably port was removed/splitted) and will remove from counters now

After the fix, log like below is printed on info level:
INFO swss#orchagent: :- containsSwitch: context phys failed to find switch oid:0x21000000000000

This change is needed by the SONiC release 202405 and later.
shiraez pushed a commit to Marvell-switching/sonic-sairedis that referenced this pull request Dec 12, 2024
The counters for syncd (switch chip) were attempted to be added to gbsyncd (gearbox phys), and vice versa.
This issue is introduced by sonic-net#1362 When setting the redis attribute
SAI_REDIS_SWITCH_ATTR_FLEX_COUNTER_GROUP and
SAI_REDIS_SWITCH_ATTR_FLEX_COUNTER, the operation is applied to every contexts (both syncd and gbsyncd). However, the counters to initialize could only exist in one context.

The fix is to check that the target switch id exists in a specific context, for the set operation of any redis type attribute; if not, skip the set on that context.

Before the fix, we see the error below in syslog
ERR gbsyncd#syncd: :- processFlexCounterEvent: port VID oid:0x1000000000002, was not found (probably port was removed/splitted) and will remove from counters now

After the fix, log like below is printed on info level:
INFO swss#orchagent: :- containsSwitch: context phys failed to find switch oid:0x21000000000000

This change is needed by the SONiC release 202405 and later.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants