-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add counters enabling redesign document #918
Conversation
doc/counters_enabling_redesign.md
Outdated
|
||
### Ports Daemon | ||
|
||
enable_counters.py script will be refactored to be running as a daemon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be merged into the orchagent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please also add config db schame and yang schema for this feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lguohan
Updated the design according to your request.
regarding the config db schema & yang model - why do we need to add config db schema? we don't add anything new to DB - it's all internal.
If you mean the delay_status attributes, we already have schema & yang for them.
https://github.com/Azure/sonic-swss-common/blob/master/common/schema.h#L225
https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/yang-models/sonic-flex_counter.yang#L49
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think in the config db, we need to add something to set the delay time, for example 300 seconds which allow user to specify how long they want to delay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lguohan
This PR is related to the fact we are moving from sleep mechanism to be event driven.
I agree that adding an option of how much delay the user wants is good, but I don't think it's related to this design.
If you want, our team will take this feature, but in another design and plan.
what do you think?
Will this change increase the time cost of reconciliation after fast/warm reboot since the counter is expected to be enabled earlier after this change? |
@vaibhavhd please review from the fast/warm reboot perspective. |
@vaibhavhd any feedback on warm/fast boot? we wish to close the coding part and would like to get your feedback soon please. |
For fast/warm reboot - |
More for fast-reboot: @shlomibitton with this new enable_counters re-design, do we still need Since enable_counters.py script is now not needed, and the delay mechanism is local to sonic-net/sonic-buildimage#8500 |
@vaibhavhd Yes, we don't need it anymore. |
doc/counters_enabling_redesign.md
Outdated
## Requirements | ||
|
||
Remove enable_counters.py script and merge the logic into FlexCountersOrch. | ||
FlexCountersOrch will wait for system to be up using events from APP DB, then enable the counters using the logic written in enable_counters.py. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FlexCountersOrch will wait for system to be up
Please define/specify what is meant by system to be up
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
doc/counters_enabling_redesign.md
Outdated
|
||
- The daemon will also wait for a timer in order to be able to enable counters even if one of the ports is not stable. | ||
|
||
If after 3 minutes (180 seconds), the counters were not enabled yet, enable counters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the offset for After 3 minutes
? Is it reboot? Please specify that.
Also, in the present design we wait for 3m if the uptime was less than 5mins. I think you are covering this.
However we also have a wait for 1m in cases other than post-reboot. Do you not want to cover this scenario, and why not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
The daemon will create a SelectableTimer and wait for the it to poke. | ||
|
||
When doTask(SelectableTimer*) function will be triggered, it will check if the counters are already enabled. | ||
If not, enable them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chances of race-condition here? What happens if counters enabling is in progress and the SelectableTimer
is also triggered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, after the enabling of counters, what happens to the new data structure defined in FlexCountersOrch
?
There can be two cases to be covered here - event-driven mechanism (new design) and timeout (old design).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chances of race-condition here? What happens if counters enabling is in progress and the
SelectableTimer
is also triggered?
If counters enabling is in progress, it means the daemon first changed a boolean to True. so when the timer will expire, it will check the boolean and will not enable the counters as well.
Also, after the enabling of counters, what happens to the new data structure defined in
FlexCountersOrch
? There can be two cases to be covered here - event-driven mechanism (new design) and timeout (old design).
The data structure will be in use only for the event-driven mechanism.
If we got to a point we need to enable the counters - we will enable the counters for all ports as it was in previous design.
Note: We are using the same behavior as in enable_counters.py script, but here we will trigger the counters enabling when all ports and LAGs are in their expected state.
If the system is not stable, we will do a fallback to the previous design - wait for the timer to expire.
@vaibhavhd @shlomibitton |
When doTask(SelectableTimer*) function will be triggered, it will check if the counters are already enabled. | ||
If not, enable them. | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please include a section for fast/cold/warm reboot and dynamic port breakout scenarios.
@noaOrMlnx
|
8498931
to
8837dc2
Compare
Following offline discussion we will improve fastboot flow and wont move forward with this HLD. thus it is closed. |
enable_counters.py script sleeps for 3 minutes before enabling the counters.
This design purpose is to change the sleep mechanism to be event driven.
After the change, the script logic will be in flexCounterOrch and we will wait for the system to be stable and just then enable the counters.