-
Notifications
You must be signed in to change notification settings - Fork 515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix flex counter out-of-order issue by notifying counter operations using SelectableChannel #3076
Fix flex counter out-of-order issue by notifying counter operations using SelectableChannel #3076
Conversation
c2fdae6
to
644092b
Compare
644092b
to
707f2de
Compare
Please fix build issues |
Hi @kcudnik |
Will review |
15df98e
to
8a349fc
Compare
/azpw run |
/AzurePipelines run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azpw run |
b1303d5
to
124600f
Compare
/azpw run |
As mentioned in the description, the PR depends on sairedis PR (sonic-net/sonic-sairedis#1362) merged a week ago.
It fetched
|
9a7400d
to
7301ad0
Compare
1bbeb21
to
1730f9b
Compare
/azpw run |
/AzurePipelines run |
Azure Pipelines successfully started running 1 pipeline(s). |
1730f9b
to
1270128
Compare
1270128
to
c8487a1
Compare
Rebased to resolve the conflict. |
extern bool gTraditionalFlexCounter; | ||
|
||
vector<sai_object_id_t> gGearboxOids; | ||
|
||
unique_ptr<DBConnector> gFlexCounterDb; | ||
unique_ptr<ProducerTable> gFlexCounterGroupTable; | ||
unique_ptr<ProducerTable> gFlexCounterTable; | ||
unique_ptr<DBConnector> gGearBoxFlexCounterDb; | ||
unique_ptr<ProducerTable> gGearBoxFlexCounterGroupTable; | ||
unique_ptr<ProducerTable> gGearBoxFlexCounterTable; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like all this is related, so it could be moved to struct/class, but it seems for now that this could be like that, since OA needs ad ground up refactoring to move all global stuff to proper classes/namespaces
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i believe this will work, but some of changes i proposed would be desired
@@ -261,6 +272,20 @@ void initSaiApi() | |||
sai_log_set(SAI_API_TWAMP, SAI_LOG_LEVEL_NOTICE); | |||
} | |||
|
|||
void initFlexCounterTables() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, looks like a constructor task for proposed class :D
@@ -261,6 +272,20 @@ void initSaiApi() | |||
sai_log_set(SAI_API_TWAMP, SAI_LOG_LEVEL_NOTICE); | |||
} | |||
|
|||
void initFlexCounterTables() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be static inline ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need. It is called outside this compiling unit.
orchagent/saihelper.cpp
Outdated
} | ||
else | ||
{ | ||
sai_s8_list.list = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ehter use initSaiRedisCounterEmptyParameter or remove that function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
{ | ||
if (sai_switch_api == nullptr) | ||
{ | ||
// This can happen during destruction of the orchagent daemon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why syncd operation counter would be notifying syncd when OA is destroyed ? sai_switch_api should be destroyed at the end, only scenario i can imaged, is that SAI API init failed and sai_switch_api was NULL yet, but in that case no other operations should be performed and OA should quit, if switch API is NULL you can do nothing, that means sairedis lib initialization failed, so you will not send any notification anywaahre, and such initialization failre should be logged with critical prioirty not just error (not this log)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for mock test scenario only.
In production scenario, the test if (sai_switch_api == nullptr)
won't be true.
static inline void operateFlexCounterDbSingleField(std::vector<FieldValueTuple> &fvTuples, | ||
const string &field, const string &value) | ||
{ | ||
if (!field.empty() && !value.empty()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this prevention here for a reason? ot at some point OA can actually populate empty counter? if any of those is empty this should be logged as ERROR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also for mock test.
The common logic to install plugin is like this:
- load lua plugin and generate SHA code
- install the lua plugin into {
plugin name
:plugin SHA code
} the flex counter group
In mock test scenario, it fails to load the Lua plugin, which leaves the SHA code empty string. But we still want the lua plugin to be installed into the flex counter group. By doing so, we can verify whether the plugin name is correct and meet the coverage requirement.
In production scenario, we will check whether either is empty, if so, we will not install the plugin. This is just a production and can not happen.
if (gTraditionalFlexCounter) | ||
{ | ||
operateFlexCounterGroupDatabase(group, poll_interval, stats_mode, plugin_name, plugins, operation, is_gearbox); | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, normally in the OOP, such cases is resolved in proper way, by having base abstract class, and pointer in that class in OA, then base on command line in OA whether we want to use traditional counters or new one, just one of 2 objects is createed and populate the pointer, then all operations are the same, add field to counder, add counter remove counter etc, and in all entire code there is no need for gTraditionalFlexCounter) check, and i saw is checked in multiple places, and old and new counter realise the same functionality at the end, but im aware that at this stage this conversion without refactoring OA maybe not possible, it should be designed from the start
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, look all the below functions, they all contain the same check, and adding new code could confuse unaware programmer, so moving this to 2 classes with one new and second ol would be great solution to that
i added some potential refactoring changes that could be desired, but i approved current solution, but i don't have permission to approve that, you need @qiluo-msft approve or @prsunny approve |
c8487a1
to
9510d41
Compare
@qiluo-msft @prsunny , Can you please approve and merge ? |
9510d41
to
edb90d1
Compare
…nter polling and create counter group SAI switch API is invoked to enable counter polling and create counter group CLI option to choose whether the new infra should be used Signed-off-by: Stephen Sun <stephens@nvidia.com>
Signed-off-by: Stephen Sun <stephens@nvidia.com>
Signed-off-by: Stephen Sun <stephens@nvidia.com>
Signed-off-by: Stephen Sun <stephens@nvidia.com>
Signed-off-by: Stephen Sun <stephens@nvidia.com>
Signed-off-by: Stephen Sun <stephens@nvidia.com>
2dd1a2f
to
7cca7d9
Compare
What I did
Fix flex counter out-of-order issue by notifying counter operations using SelectableChannel
Depends on sonic-net/sonic-sairedis#1362
Why I did it
Currently, the operations of SAI objects and their counters (if any) are triggered by different channels, which introduces racing conditions:
SelectableChannel
,FLEX_COUNTER
andFLEX_COUNTER_GROUP
tables in theFLEX_COUNTER_DB
syncd
can receive events in a wrong order, eg. it receives destroying an object first and then stopping counter polling on the object, it can poll counter for a non-exist object, which causes errors in vendor SAI.The new solution is to extend SAI redis attributes on the SAI_SWITCH_OBJECT to notify counter polling. As a result, all the objects and their counters are notified using a unified channel, the
SelectableChannel
.How I verified it
Manual test
Regression test
Mock test
Details if related
ProducerTable
) are initialized during orchagent initialization, before any flex counter operations.m_flexCounterTable
andm_flexCounterGroupTable
)switch OID
.ConsumerTable
,ProducerTable
mechanism to communicate between OA and sairedis. However, it works for P2P scenarios only. There is a logic forConsumerTable
to consume the update once a gearbox syncd sees it, leaving all rest gearbox syncd daemons to see nothing. As a result, for each update there is only one gearbox syncd that sees and handles it.VIRTORID
table, which is a WA of the counter out-of-order issue. Now that the issue has been fixed in the new approach, it does not need to check anymore.Performance analysis
The counter operations are handled in the same thread in both the new and old solutions.
In swss, the counter operation was asynchronous in the old solution and is synchronous now, which can introduce a bit more latency. However, as the number of counter operations is small, no performance degradation is observed.