-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SONiC FM (Fault Mgmt) infrastructure -Base version #421
base: master
Are you sure you want to change the base?
Conversation
- This adds a generic FM infrastructure to SONiC for fault analysis and handling. Broadly comprising of following three entities: 1. Faults (Events) publisher daemon which formulates certain fault events and populate them to EVENT_TABLE in redisDB 2. Faults manager daemon which gets events from redisDB, parses them against schema (sonic-event.yang), perform lookup for fault type & severity in fault_policy.json file to determine fault action 3. fault_policy.json file comprises of generic and platform specific F-A (Fault-Action) blocks i.e. for a particular fault type & severity, what all action(s) are needed (to recover the system from the fault). It abstracts platform/HWSKU fault handling nuances from the open source NOS (e.g. SONiC) Signed-off-by: Shyam Kumar <shyakuma@cisco.com>
Following UT scenarios validated: UT logs (Fault injection, publishing, storing in redisDB; processing from redisDB to take needed action(s)): |
@shyam77git can you update the link to HLD in the PR description? |
@Junchao-Mellanox @keboliu can you review? |
- Added faultmgrd micro-service and timer service - Added faultpubd micro-service and timer service Signed-off-by: Shyam Kumar <shyakuma@cisco.com>
Added HLD PR link at the top of the description section. |
- Determined reboot casue from the fault entry - Passed the reboot cause as an argument to system 'reboot' invocation - Updated the mechanism to fetch chassis type (fixed or modular) - Removed faultpubd micro-service and moved it out, as sonic FM HLD focuses on faultmgrd Signed-off-by: Shyam Kumar <shyakuma@cisco.com>
- Determined reboot casue from the fault entry - Passed the reboot cause as an argument to system 'reboot' invocation - Updated the mechanism to fetch chassis type (fixed or modular) - Removed faultpubd micro-service and moved it out, as sonic FM HLD focuses on the Fault Management via faultmgrd Signed-off-by: Shyam Kumar <shyakuma@cisco.com>
- Changed redisDB interface to state DB at global (host) - Updated FM communication with redis DB to subscriber model and listening to DB's EVENT_TABLE SET and DEL operations - Misc cleanup Signed-off-by: Shyam Kumar <shyakuma@cisco.com>
I suggest to have this change in sonic-buildimage than in this repo as Fault manager is intended to run in host as a host service |
HLD PR: sonic-net/SONiC#1527
Additional relevant code PR: sonic-net/sonic-utilities#3154
Summary
This adds a generic FM infrastructure to SONiC for fault analysis and handling.
Broadly comprising of following three entities
Description
Note: faultpubd functionality is part of a separate PR as this is needed until the following prerequisite is satisfied.
Prerequisite: Certain code PRs (esp. committing sonic-event.yang and publishing events' redisDB) yet to be committed. Refer to sonic-net/SONiC#1409
It abstracts platform/HWSKU fault handling from the open source NOS (i.e. SONiC).
Motivation and Context
Objective of producing FM HLD and this code PR is two-fold:
a) Every SONiC NOS deployment may not have External Controller to take the action upon fault occurrence. In that case, SONiC (with its underlying platform) is expected to take the required action to recover the system/chassis from the fault.
b) Platform supplied 'Fault-Action Policy table' has a holistic/system-level view of the platform (chassis/board/HWSKU) and can gauge the right action required to recover the system from the fault. It can either go with the recommended action (provided by the FDR - fault source/detector) or override it with the system-level one.
Fault Manager module would serve the purpose of taking necessary action(s) to log and handle the faults.
Its a new (infra) feature and planned for 202405.
Not planned for any double commit.
How Has This Been Tested?
Please refer to attached logs where:
Additional Information (Optional)