Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SONiC FM (Fault Mgmt) infrastructure -Base version #421

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

shyam77git
Copy link

@shyam77git shyam77git commented Jan 10, 2024

HLD PR: sonic-net/SONiC#1527

Additional relevant code PR: sonic-net/sonic-utilities#3154

Summary

This adds a generic FM infrastructure to SONiC for fault analysis and handling.
Broadly comprising of following three entities

Description

  1. Faults (Events) publisher daemon (faultpubd) is spawned as a micro-service at Linux host, which keeps running and provides following w.r.t system events (faults, alarms):
    • Overall workflow is generic (common to all events)
    • Formulates certain fault events
    • Subscribe to event-framework to receive events
    • Runs an event_receive worker-thread which periodically looks for live incoming events on ZMQ channel
      • On reception of an event, parses it against sonic-events*.yang and publishes (adds) it as a new EVENT_TABLE entry to redisDB
    • This event can be then be consumed from redisDB by services/apps (such as faultmgrd) for event handling

Note: faultpubd functionality is part of a separate PR as this is needed until the following prerequisite is satisfied.
Prerequisite: Certain code PRs (esp. committing sonic-event.yang and publishing events' redisDB) yet to be committed. Refer to sonic-net/SONiC#1409

  1. Faults manager daemon (faultmgrd) runs as a micro-service at Linux host and performs following tasks:
    • registers and listens to redisDB
    • receives events and parses them against schema (sonic-event.yang)
    • perform lookup for fault type & severity in fault_policy.json file to determine fault action
  2. fault_policy.json file comprises of generic and platform specific F-A (Fault-Action) blocks i.e. for a particular fault type & severity, what all action(s) are needed (to recover the system from the fault).
    It abstracts platform/HWSKU fault handling from the open source NOS (i.e. SONiC).

Motivation and Context

Objective of producing FM HLD and this code PR is two-fold:
a) Every SONiC NOS deployment may not have External Controller to take the action upon fault occurrence. In that case, SONiC (with its underlying platform) is expected to take the required action to recover the system/chassis from the fault.
b) Platform supplied 'Fault-Action Policy table' has a holistic/system-level view of the platform (chassis/board/HWSKU) and can gauge the right action required to recover the system from the fault. It can either go with the recommended action (provided by the FDR - fault source/detector) or override it with the system-level one.

Fault Manager module would serve the purpose of taking necessary action(s) to log and handle the faults.
Its a new (infra) feature and planned for 202405.
Not planned for any double commit.

How Has This Been Tested?

Please refer to attached logs where:

  1. Fault producer (faultpubd) injects and producers faults (as events)
  2. Fault Manager (faultmgrd) receives/reads (from DB), parses/analyzes them against fault-action policy table to determine needed actions (to recover from the fault)

Additional Information (Optional)

     - This adds a generic FM infrastructure to SONiC
       for fault analysis and handling. Broadly
       comprising of following three entities:
       1. Faults (Events) publisher daemon which
          formulates certain fault events and
          populate them to EVENT_TABLE in redisDB
       2. Faults manager daemon which gets events
          from redisDB, parses them against schema
          (sonic-event.yang), perform lookup for
          fault type & severity in fault_policy.json
          file to determine fault action
       3. fault_policy.json file comprises of generic
          and platform specific F-A (Fault-Action)
          blocks i.e. for a particular fault type &
          severity, what all action(s) are needed
          (to recover the system from the fault).
          It abstracts platform/HWSKU fault handling
          nuances from the open source NOS (e.g. SONiC)

Signed-off-by: Shyam Kumar <shyakuma@cisco.com>
@bmridul bmridul self-requested a review January 11, 2024 19:26
@shyam77git
Copy link
Author

shyam77git commented Jan 16, 2024

Following UT scenarios validated:
UT scenario A:
Injected temperature sensor fault (temp exceeded) by instrumenting thermalctld, published it via chassis redisDB EVENT_TABLE entry, received and processed it (based on fault_action policy table).
UT scenario B:
Injected events-monit (resource) 'monit periodic test', published it via chassis redisDB EVENT_TABLE entry, received and processed it (based on fault_action policy table).

UT logs (Fault injection, publishing, storing in redisDB; processing from redisDB to take needed action(s)):
FM-thermal-fault-injection.txt
FM-UT-logs.txt
FM-redisDB-faults-events.txt

@prgeor
Copy link
Collaborator

prgeor commented Jan 23, 2024

@shyam77git can you update the link to HLD in the PR description?

@prgeor
Copy link
Collaborator

prgeor commented Jan 23, 2024

@Junchao-Mellanox @keboliu can you review?

    - Added faultmgrd micro-service and timer service
    - Added faultpubd micro-service and timer service

Signed-off-by: Shyam Kumar <shyakuma@cisco.com>
@shyam77git
Copy link
Author

@shyam77git can you update the link to HLD in the PR description?

Added HLD PR link at the top of the description section.

@shyam77git shyam77git marked this pull request as ready for review February 7, 2024 03:24
    - Determined reboot casue from the fault entry
    - Passed the reboot cause as an argument to
      system 'reboot' invocation
    - Updated the mechanism to fetch chassis type
      (fixed or modular)
    - Removed faultpubd micro-service and moved it
      out, as sonic FM HLD focuses on faultmgrd

Signed-off-by: Shyam Kumar <shyakuma@cisco.com>
   - Determined reboot casue from the fault entry
   - Passed the reboot cause as an argument to
     system 'reboot' invocation
   - Updated the mechanism to fetch chassis type
     (fixed or modular)
   - Removed faultpubd micro-service and moved it
     out, as sonic FM HLD focuses on the Fault
     Management via faultmgrd

Signed-off-by: Shyam Kumar <shyakuma@cisco.com>
    - Changed redisDB interface to state DB
      at global (host)
    - Updated FM communication with redis DB
      to subscriber model and listening to
      DB's EVENT_TABLE SET and DEL operations
    - Misc cleanup

Signed-off-by: Shyam Kumar <shyakuma@cisco.com>
@prgeor
Copy link
Collaborator

prgeor commented Feb 15, 2024

@shyam77git

Faults manager daemon (faultmgrd) runs as a micro-service at Linux host and performs following tasks:
registers and listens to redisDB
receives events and parses them against schema (sonic-event.yang)
perform lookup for fault type & severity in fault_policy.json file to determine fault action

I suggest to have this change in sonic-buildimage than in this repo as Fault manager is intended to run in host as a host service

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants