-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SONiC FM (Fault Manager) HLD #1527
base: master
Are you sure you want to change the base?
Conversation
Mistakenly added right under SONiC/ instead of SONiC/doc
Generic Fault Management Infra document
Enhanced the HLD with following: Updated workflows Added Fault-Action Policy Table sample
Added section describing about all the steps in the block digram.
Added FM use-cases table. Added Revision as 1.0 (as this revision is an Initial Draft for External review)
Updated the Revision number for Initial Draft (for review)
Added Fault's end-to-end WorkFlow sequence section.
Perhaps you can add special handling to avoid endless reboots and shutdowns. |
{ | ||
"type" : "TEMPERATURE_EXCEEDED", | ||
"severity" : "CRITICAL", | ||
"action" : ["syslog", "obfl", "reload"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure obfl is supported by all vendors. So u might want to rename it to a generic term such as "platform-log". Same comment for other places in the doc where obfl is mentioned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we planning to store faults in a separate table with action performed on them? This will be helpful to know the faults over time in the switch.
- action may range from logging (disk, OBFL flash etc.) to reload/shutdown etc. | ||
- Taking action would either be by itself (i.e. in ts own micro-service) or delegating it to action's owner | ||
7. Tabulate event entry (along with action taken) for book-keeping purposes | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be useful to show the Alarm/fault entry schema as represented in EventDB.
# Fault's End-to-End WorkFlow Sequence | ||
Following workflow depicts the end-to-end fault (event) flow from Fault generation to Fault Handling | ||
![Fault Management (FM) Workflow sequence](https://github.com/shyam77git/SONiC/assets/69485234/2b453a1b-6e14-48c6-bf61-ab978e62a3bf) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be useful to mention some examples of processes/daemons which act as FDR.
|
||
{ | ||
|
||
"chassis": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a config-db schema for action configuration for the faults? and SONiC YANG model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to consider VS platform as well, may be, by default no generic "fault_action_policy.json" populated and platform files can provide the default actions and user can override them if required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you planning to have fault-manager enable/disable config knob as well? Global config knob would be useful to disable all actions.
- https://github.com/sonic-net/sonic-buildimage/tree/master/src/sonic-yang-models/yang-models | ||
- sonic-events-swss.yang | ||
- sonic-events-host.yang | ||
- sonic-events-bgp.yang etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we planning to take any actions for the events (legacy ones) via fault manager?
3) Analyze them (in a generic way) against the above-mentioned Policy Table | ||
4) Take action based on the lookup/match in Policy Table | ||
5) Action could either be generic or platform specific | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a section for "out of scope" to mention about controller driven fault manager, FM in chassis..etc
{ | ||
"type": "FANS MISSING", | ||
"severity": "CRITICAL", | ||
"action" : ["syslog", "obfl", "shutdown"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this action be to take "tech-support" or executing some script as well? e.g if case of a critical event, the user may want to log all the states for analysis later.
{ | ||
|
||
"chassis": { | ||
"name": "PID or HWSKU", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need this PID or HWSKU? PID may be changing dynamically, do you want to provide the config knob at the process level granularity?
|
||
"type" : "CUSTOM_EVPROFILE_CHANGE", | ||
"severity" : "MAJOR", | ||
"action" : ["syslog"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
syslog is the default action, correct? do we need it as part of fault-manager action?
1. Formulate platform/HWSKU specific Fault-Action Policy Table (json or yaml file) | ||
- There would be generic (default) table if none provided by platform | ||
- A platform supplied file would override the default one | ||
2. Introduce a new micro-service (fault_manager) at host (Linux Kernel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the plan? Is the fault manager a dedicated docker container or service? or it it colocated with EventD docker container?
5. Analyze them against Fault-Action Policy Table (file) | ||
- Take fault_type and fault_severity as input from the fetched event and perform lookup | ||
on these fields in Fault-Action Policy Table to determine the action(s) needed | ||
6. Handle the fault (i.e. take action) based on action(s) specified in Fault-Action Policy Table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can external controllers override the fault manager policies/actions?
202405 release fork date is coming, can you please accelerate the code PR review and merge the PR by end of 5/30? Thanks. |
@shyam77git can you please update the PR Description with the list of the Code PRs? |
@liat-grozovik will help to follow-up with the reviewers, if no update, will defer it to future release |
HLD is not approved, move to backlog |
code PRs (corresponding to this FM HLD PR)
Fault Manager daemon (faultmgrd): sonic-platform-daemons: sonic-net/sonic-platform-daemons#421
Reboot: sonic-utilities repo: sonic-net/sonic-utilities#3154
Basic Information (context)
Any failure (or an error) impacting a system/chassis or a sub-system is regarded as a fault.
Broadly classified into SW (Software) and HW (Hardware) faults:
They may occur at any of the following stages of system's functioning:
Present State
In SONiC, Fault is represented via an Event or an Alarm.
SONiC has Event Framework HLD which can help event-detector to publish its event to the eventD redisDB.
However, there is no Fault Manager/Handler which can take the needed/ platform-specified action(s) to recover the system from the generated fault.
Need for this feature
This feature aims at adding a generic FM (Fault Management) Infrastructure which can do the following:
Action could either be generic or platform specfic
Benefits
Platform supplied 'Fault-Action Policy table' has a holistic/system-level view of the platform (chassis/board/HWSKU) and can gauge the right action required to recover from the fault. It can either go with the recommended action (provided by the fault source/detector) or override it with the system-level one.