Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle service start-limit-hit failure event case in sysmonitor #16009

Open
wants to merge 1 commit into
base: 202211
Choose a base branch
from

Conversation

sg893052
Copy link
Contributor

@sg893052 sg893052 commented Aug 1, 2023

Why I did it

To fix #15935

Work item tracking
  • Microsoft ADO (number only):

How I did it

#15935 mentioned issue occurring when a service encounters a start-limit-hit failure. Handling the failure result from sysbus event in sysmonitor processing code

How to verify it

Mimic start-limit-hit failure for a service. Issue show system-health sysready-status CLI.

[#]show system-health sysready-status
System is not ready - one or more services are not up

Service-Name                      Service-Status    App-Ready-Status    Down-Reason
--------------------------------  ----------------  ------------------  ---------------
as7326-56x-pddf-platform-monitor  OK                OK                  -
auditd                            OK                OK                  -
bgp                               OK                OK                  -
caclmgrd                          OK                OK                  -
config-chassisdb                  OK                OK                  -
config-setup                      OK                OK                  -
containerd                        OK                OK                  -
cron                              OK                OK                  -
database                          OK                OK                  -
determine-reboot-cause            OK                OK                  -
docker                            OK                OK                  -
eventd                            OK                OK                  -
kdump-tools                       OK                OK                  -
lldp                              OK                OK                  -
mgmt-framework                    OK                OK                  -
netfilter-persistent              OK                OK                  -
ntp                               OK                OK                  -
opennsl-modules                   OK                OK                  -
pddf-platform-init                OK                OK                  -
pmon                              OK                OK                  -
procdockerstatsd                  OK                OK                  -
radv                              OK                OK                  -
ras-mc-ctl                        OK                OK                  -
rasdaemon                         OK                OK                  -
rsyslog                           OK                OK                  -
smartmontools                     OK                OK                  -
snmp                              OK                OK                  -
sonic-hostservice                 OK                OK                  -
ssh                               OK                OK                  -
swss                              OK                OK                  -
syncd                             OK                OK                  -
sysstat                           OK                OK                  -
teamd                             OK                OK                  -
telemetry                         Down              Down                start-limit-hit
[#]

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211
  • 202305

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@sg893052 sg893052 requested a review from lguohan as a code owner August 1, 2023 09:52
@sg893052
Copy link
Contributor Author

sg893052 commented Aug 3, 2023

Build failed with the following reason unrelated to the submitted change. Request Authorized person to issue rebuild as the submitter don't own permission.

2023-08-01T12:18:17.7072456Z Operation failed with exception: Exception('Test plan id: 64c8e5f08317667c692289be, status: FAILED, result: EXECUTING, Elapsed 3727 seconds. Check https://elastictest.org/scheduler/testplan/64c8e5f08317667c692289be for test plan status')
2023-08-01T12:18:17.7226762Z ##[error]Bash exited with code '3'.
2023-08-01T12:18:17.7252380Z ##[section]Finishing: Run test

@liat-grozovik
Copy link
Collaborator

liat-grozovik commented Aug 15, 2023

@sg893052 when can you please have the checkers passing?

@liat-grozovik
Copy link
Collaborator

also, as I see the fix is on 202211 which we did also find it not working can you confirm the issue is not present on 202305 and above? otherwise we need a fix there as well

@dgsudharsan
Copy link
Collaborator

@sg893052 The fix needs to go to master, 202205, 202305 as well. The workflow is first we raise against master and cherry-pick to other branches. Can you please raise PR against master?

@sg893052
Copy link
Contributor Author

sg893052 commented Aug 16, 2023

@sg893052 The fix needs to go to master, 202205, 202305 as well. The workflow is first we raise against master and cherry-pick to other branches. Can you please raise PR against master?

Sure, I shall raise a PR against master.

PR created for master -> #16174

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants