Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[System-ready] System-ready status sometimes is not reflecting the correct status #15935

Closed
dgsudharsan opened this issue Jul 21, 2023 · 10 comments · Fixed by #16174
Closed

[System-ready] System-ready status sometimes is not reflecting the correct status #15935

dgsudharsan opened this issue Jul 21, 2023 · 10 comments · Fixed by #16174
Labels

Comments

@dgsudharsan
Copy link
Collaborator

Description

System ready status is not reflecting the correct status. Below is an example. The telemetry status shows ok even though the app has exited. The overall system status shows not ready but all services are shown as Ok.

show system-health sysready-status
System is not ready - one or more services are not up

Service-Name            Service-Status    App-Ready-Status    Down-Reason
----------------------  ----------------  ------------------  -------------
auditd                  OK                OK                  -
bgp                     OK                OK                  -
caclmgrd                OK                OK                  -
config-chassisdb        OK                OK                  -
config-setup            OK                OK                  -
containerd              OK                OK                  -
cron                    OK                OK                  -
database                OK                OK                  -
determine-reboot-cause  OK                OK                  -
doai                    OK                OK                  -
docker                  OK                OK                  -
eventd                  OK                OK                  -
hw-management           OK                OK                  -
hw-management-tc        OK                OK                  -
kdump-tools             OK                OK                  -
lldp                    OK                OK                  -
mgmt-framework          OK                OK                  -
netfilter-persistent    OK                OK                  -
ntp                     OK                OK                  -
pmon                    OK                OK                  -
procdockerstatsd        OK                OK                  -
radv                    OK                OK                  -
ras-mc-ctl              OK                OK                  -
rsync                   OK                OK                  -
rsyslog                 OK                OK                  -
smartmontools           OK                OK                  -
snmp                    OK                OK                  -
ssh                     OK                OK                  -
swss                    OK                OK                  -
syncd                   OK                OK                  -
sysstat                 OK                OK                  -
teamd                   OK                OK                  -
telemetry               OK                OK                  -
what-just-happened      OK                OK                  -
ztp                     OK                OK                  -

This deviates from the output of system health detail

show system-health detail
System status summary

  System status LED  red
  Services:
    Status: Not OK
    Not Running: container_checker, telemetry
  Hardware:
    Status: Not OK
    Reasons: PSU 2 is out of power
             psu2_fan1 is missing

System services and devices monitor list

Name                        Status    Type
--------------------------  --------  ----------
container_checker           Not OK    Program
telemetry                   Not OK    Service
sonic                       OK        System
rsyslog                     OK        Process
root-overlay                OK        Filesystem
var-log                     OK        Filesystem
routeCheck                  OK        Program
diskCheck                   OK        Program
vnetRouteCheck              OK        Program
memory_check                OK        Program
container_memory_telemetry  OK        Program
container_memory_snmp       OK        Program
container_eventd            OK        Program
database:redis              OK        Process
teamd:teammgrd              OK        Process
teamd:teamsyncd             OK        Process
teamd:tlm_teamd             OK        Process
bgp:zebra                   OK        Process
bgp:staticd                 OK        Process
bgp:bgpd                    OK        Process
bgp:fpmsyncd                OK        Process
bgp:bgpcfgd                 OK        Process
swss:orchagent              OK        Process
swss:portsyncd              OK        Process
swss:neighsyncd             OK        Process
swss:fdbsyncd               OK        Process
swss:vlanmgrd               OK        Process
swss:intfmgrd               OK        Process
swss:portmgrd               OK        Process
swss:buffermgrd             OK        Process
swss:vrfmgrd                OK        Process
swss:nbrmgrd                OK        Process
swss:vxlanmgrd              OK        Process
swss:coppmgrd               OK        Process
swss:tunnelmgrd             OK        Process
eventd:eventd               OK        Process
syncd:syncd                 OK        Process
what-just-happened:wjhd     OK        Process
snmp:snmpd                  OK        Process
snmp:snmp-subagent          OK        Process
lldp:lldpd                  OK        Process
lldp:lldp-syncd             OK        Process
lldp:lldpmgrd               OK        Process
psu2_fan1                   Not OK    Fan
PSU 2                       Not OK    PSU
ASIC                        OK        ASIC
fan1                        OK        Fan
fan2                        OK        Fan
fan3                        OK        Fan
fan4                        OK        Fan
fan5                        OK        Fan
fan6                        OK        Fan
fan7                        OK        Fan
fan8                        OK        Fan
psu1_fan1                   OK        Fan
PSU 1                       OK        PSU

System services and devices ignore list

Name         Status    Type
-----------  --------  ------
psu.voltage  Ignored   Device

root@qa-eth-vt01-3-2700a0:~# service telemetry status
● telemetry.service - Telemetry container
     Loaded: loaded (/lib/systemd/system/telemetry.service; static)
    Drop-In: /etc/systemd/system/telemetry.service.d
             └─auto_restart.conf
     Active: failed (Result: start-limit-hit) since Thu 2023-07-20 10:27:57 UTC; 1 day 8h ago
    Process: 13759 ExecStartPre=/usr/local/bin/telemetry.sh start (code=exited, status=0/SUCCESS)
    Process: 13834 ExecStart=/usr/local/bin/telemetry.sh wait (code=exited, status=0/SUCCESS)
    Process: 13939 ExecStop=/usr/local/bin/telemetry.sh stop (code=exited, status=0/SUCCESS)
   Main PID: 13834 (code=exited, status=0/SUCCESS)

Jul 20 10:27:57 qa-eth-vt01-3-2700a0 systemd[1]: telemetry.service: Scheduled restart job, restart counter is at 4.
Jul 20 10:27:57 qa-eth-vt01-3-2700a0 systemd[1]: Stopped Telemetry container.
Jul 20 10:27:57 qa-eth-vt01-3-2700a0 systemd[1]: telemetry.service: Start request repeated too quickly.
Jul 20 10:27:57 qa-eth-vt01-3-2700a0 systemd[1]: telemetry.service: Failed with result 'start-limit-hit'.
Jul 20 10:27:57 qa-eth-vt01-3-2700a0 systemd[1]: Failed to start Telemetry container.

Steps to reproduce the issue:

  1. Load a system with some services exited
  2. Check show show system-health sysready-status
  3. Compare with show system-health detail.

Describe the results you received:

show system-health sysready-status shows conflicting status.

Describe the results you expected:

show system-health sysready-status should align with actual status of the system

Output of show version:

show version

SONiC Software Version: SONiC.202211_1_RC2.27-edc5679a4_Internal
SONiC OS Version: 11
Distribution: Debian 11.7
Kernel: 5.10.0-18-2-amd64
Build commit: edc5679a4
Build date: Fri Jul 14 09:49:33 UTC 2023
Built by: sw-r2d2-bot@r-build-sonic-ci03-241

Platform: x86_64-mlnx_msn2700-r0
HwSKU: Mellanox-SN2700
ASIC: mellanox
ASIC Count: 1
Serial Number: MT1649X04248
Model Number: MSN2700-CS2F
Hardware Revision: Not Specified
Uptime: 18:39:09 up 1 day,  8:17,  3 users,  load average: 0.64, 0.84, 0.71
Date: Fri 21 Jul 2023 18:39:09

Docker images:
REPOSITORY                                         TAG                                  IMAGE ID       SIZE
docker-syncd-mlnx                                  202211_1_RC2.27-edc5679a4_Internal   626ccc18b4bb   950MB
docker-syncd-mlnx                                  latest                               626ccc18b4bb   950MB
docker-platform-monitor                            202211_1_RC2.27-edc5679a4_Internal   8ead5c216e3f   950MB
docker-platform-monitor                            latest                               8ead5c216e3f   950MB
urm.nvidia.com/sw-nbu-sws-sonic-docker/sonic-wjh   1.5.3-202211-15                      89cd8d80e8ce   432MB
docker-orchagent                                   202211_1_RC2.27-edc5679a4_Internal   4802e7c0500d   475MB
docker-orchagent                                   latest                               4802e7c0500d   475MB
docker-fpm-frr                                     202211_1_RC2.27-edc5679a4_Internal   1373022720e8   485MB
docker-fpm-frr                                     latest                               1373022720e8   485MB
docker-teamd                                       202211_1_RC2.27-edc5679a4_Internal   cfeed9c5a7f5   456MB
docker-teamd                                       latest                               cfeed9c5a7f5   456MB
docker-macsec                                      latest                               9f4fb5ceebee   458MB
docker-snmp                                        202211_1_RC2.27-edc5679a4_Internal   4c4b76aafff4   484MB
docker-snmp                                        latest                               4c4b76aafff4   484MB
docker-sonic-telemetry                             202211_1_RC2.27-edc5679a4_Internal   49d8614ea65f   738MB
docker-sonic-telemetry                             latest                               49d8614ea65f   738MB
docker-eventd                                      202211_1_RC2.27-edc5679a4_Internal   e5ca436db367   439MB
docker-eventd                                      latest                               e5ca436db367   439MB
docker-dhcp-relay                                  latest                               d3059d734f06   449MB
docker-router-advertiser                           202211_1_RC2.27-edc5679a4_Internal   94153ba6dcea   439MB
docker-router-advertiser                           latest                               94153ba6dcea   439MB
docker-lldp                                        202211_1_RC2.27-edc5679a4_Internal   c3b8fc719046   481MB
docker-lldp                                        latest                               c3b8fc719046   481MB
docker-sonic-p4rt                                  202211_1_RC2.27-edc5679a4_Internal   d8b42c56fff5   521MB
docker-sonic-p4rt                                  latest                               d8b42c56fff5   521MB
docker-mux                                         202211_1_RC2.27-edc5679a4_Internal   a1a8add892f7   488MB
docker-mux                                         latest                               a1a8add892f7   488MB
docker-database                                    202211_1_RC2.27-edc5679a4_Internal   043a84e34958   439MB
docker-database                                    latest                               043a84e34958   439MB
docker-sonic-mgmt-framework                        202211_1_RC2.27-edc5679a4_Internal   3e0c9556058e   553MB
docker-sonic-mgmt-framework                        latest                               3e0c9556058e   553MB
docker-sflow                                       202211_1_RC2.27-edc5679a4_Internal   ba0d71d0e65c   422MB
docker-sflow                                       latest                               ba0d71d0e65c   422MB
docker-nat                                         202211_1_RC2.27-edc5679a4_Internal   0ac282c30f26   424MB
docker-nat                                         latest                               0ac282c30f26   424MB
urm.nvidia.com/sw-nbu-sws-sonic-docker/doai        1.0.0-master-internal-25             475e4a384e19   201MB

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

sonic_dump_qa-eth-vt01-3-2700a0_20230721_183209.tar.gz

@dgsudharsan
Copy link
Collaborator Author

@adyeung @sg893052 Can you please investigate this issue?

@adyeung
Copy link
Collaborator

adyeung commented Jul 27, 2023

sg893052 is taking a look, will share findings

@dgsudharsan
Copy link
Collaborator Author

dgsudharsan commented Jan 29, 2024

@Praveen-Brcm @adyeung @sg893052

The issue now occurs statistically. Here is output from one of our devices

root@qa-eth-vt02-5-2700a1:~# show system-health sysready-status
System is not ready - one or more services are not up

Service-Name            Service-Status    App-Ready-Status    Down-Reason
----------------------  ----------------  ------------------  -------------
auditd                  OK                OK                  -
bgp                     OK                OK                  -
caclmgrd                OK                OK                  -
config-chassisdb        OK                OK                  -
config-setup            OK                OK                  -
containerd              OK                OK                  -
cron                    OK                OK                  -
database                OK                OK                  -
determine-reboot-cause  OK                OK                  -
dhcp_relay              OK                OK                  -
docker                  OK                OK                  -
eventd                  OK                OK                  -
hw-management           OK                OK                  -
hw-management-tc        OK                OK                  -
kdump-tools             OK                OK                  -
lldp                    OK                OK                  -
mgmt-framework          OK                OK                  -
netfilter-persistent    OK                OK                  -
ntp                     OK                OK                  -
nv-syncd-shared         OK                OK                  -
pmon                    OK                OK                  -
procdockerstatsd        OK                OK                  -
process-reboot-cause    OK                OK                  -
radv                    OK                OK                  -
ras-mc-ctl              OK                OK                  -
rsyslog                 OK                OK                  -
smartmontools           OK                OK                  -
snmp                    OK                OK                  -
ssh                     OK                OK                  -
swss                    OK                OK                  -
syncd                   OK                OK                  -
sysstat                 OK                OK                  -
teamd                   OK                OK                  -
telemetry               OK                OK                  -
what-just-happened      OK                OK                  -
ztp                     OK                OK                  -

Attaching STATE_DB and syslog here

STATE_DB.json
syslog.gz

@dgsudharsan
Copy link
Collaborator Author

@Praveen-Brcm @adyeung @sg893052 Any update on this?

@sg893052
Copy link
Contributor

sg893052 commented Feb 22, 2024

I will look into this and get back in couple of days. Please let me know if I could try out with the latest master https://sonic-build.azurewebsites.net/ui/sonic/pipelines/138/builds?branchName=master and just loading the image results in this issue?

@sg893052
Copy link
Contributor

@dgsudharsan I couldn't reproduce the issue. Please let me know if there is any repro scenario or platform specific.

[#]show version

SONiC Software Version: SONiC.master.485343-54c1a4963
SONiC OS Version: 12
Distribution: Debian 12.4
Kernel: 6.1.0-11-2-amd64
Build commit: 54c1a4963
Build date: Sun Feb 25 13:07:27 UTC 2024
Built by: AzDevOps@vmss-soni0035TT

Platform: x86_64-accton
HwSKU: Accton
ASIC: broadcom
ASIC Count: 1
Serial Number: 781664X1924004
Model Number: FP3ZZ7664020A
Hardware Revision: N/A
Uptime: 01:53:58 up 5 min,  1 user,  load average: 0.48, 0.69, 0.38
Date: Fri 10 Nov 2023 01:53:58

Docker images:
REPOSITORY                    TAG                       IMAGE ID       SIZE
docker-syncd-brcm             latest                    97e6b1f98246   716MB
docker-syncd-brcm             master.485343-54c1a4963   97e6b1f98246   716MB
docker-gbsyncd-broncos        latest                    298844092c0a   353MB
docker-gbsyncd-broncos        master.485343-54c1a4963   298844092c0a   353MB
docker-gbsyncd-credo          latest                    dd7699e9fa4f   325MB
docker-gbsyncd-credo          master.485343-54c1a4963   dd7699e9fa4f   325MB
docker-teamd                  latest                    80f778a31d27   327MB
docker-teamd                  master.485343-54c1a4963   80f778a31d27   327MB
docker-database               latest                    fec2da43868b   308MB
docker-database               master.485343-54c1a4963   fec2da43868b   308MB
docker-router-advertiser      latest                    db13c7a69436   300MB
docker-router-advertiser      master.485343-54c1a4963   db13c7a69436   300MB
docker-dhcp-relay             latest                    1ecbb8d7fa42   312MB
docker-orchagent              latest                    eabb3a4c21ca   342MB
docker-orchagent              master.485343-54c1a4963   eabb3a4c21ca   342MB
docker-macsec                 latest                    065b8259bd50   333MB
docker-fpm-frr                latest                    82e50c8f9046   362MB
docker-fpm-frr                master.485343-54c1a4963   82e50c8f9046   362MB
docker-eventd                 latest                    475212315ec6   302MB
docker-eventd                 master.485343-54c1a4963   475212315ec6   302MB
docker-nat                    latest                    a000c6cf6081   333MB
docker-nat                    master.485343-54c1a4963   a000c6cf6081   333MB
docker-sflow                  latest                    f7e722e72500   332MB
docker-sflow                  master.485343-54c1a4963   f7e722e72500   332MB
docker-platform-monitor       latest                    6d1fabdbdbcd   424MB
docker-platform-monitor       master.485343-54c1a4963   6d1fabdbdbcd   424MB
docker-snmp                   latest                    d1f75a87e1ef   342MB
docker-snmp                   master.485343-54c1a4963   d1f75a87e1ef   342MB
docker-mux                    latest                    02583a1d0e07   351MB
docker-mux                    master.485343-54c1a4963   02583a1d0e07   351MB
docker-lldp                   latest                    ead5fd746954   345MB
docker-lldp                   master.485343-54c1a4963   ead5fd746954   345MB
docker-sonic-gnmi             latest                    4c66356653f1   391MB
docker-sonic-gnmi             master.485343-54c1a4963   4c66356653f1   391MB
docker-sonic-mgmt-framework   latest                    8fe76ae4247a   387MB
docker-sonic-mgmt-framework   master.485343-54c1a4963   8fe76ae4247a   387MB

[#]
[#]show system-health sysready-status
System is ready

Service-Name                  Service-Status    App-Ready-Status    Down-Reason
----------------------------  ----------------  ------------------  -------------
as7816-pddf-platform-monitor  OK                OK                  -
auditd                        OK                OK                  -
bgp                           OK                OK                  -
caclmgrd                      OK                OK                  -
config-chassisdb              OK                OK                  -
config-setup                  OK                OK                  -
containerd                    OK                OK                  -
cron                          OK                OK                  -
database                      OK                OK                  -
docker                        OK                OK                  -
e2scrub_reap                  OK                OK                  -
eventd                        OK                OK                  -
gnmi                          OK                OK                  -
lldp                          OK                OK                  -
mgmt-framework                OK                OK                  -
netfilter-persistent          OK                OK                  -
ntpsec                        OK                OK                  -
opennsl-modules               OK                OK                  -
pddf-platform-init            OK                OK                  -
pmon                          OK                OK                  -
procdockerstatsd              OK                OK                  -
radv                          OK                OK                  -
ras-mc-ctl                    OK                OK                  -
rsyslog                       OK                OK                  -
smartmontools                 OK                OK                  -
snmp                          OK                OK                  -
ssh                           OK                OK                  -
swss                          OK                OK                  -
syncd                         OK                OK                  -
sysstat                       OK                OK                  -
teamd                         OK                OK                  -
[#]
[#]

@dgsudharsan
Copy link
Collaborator Author

@sg893052 Does sysready get affected by absence or bad PSU? From the state_db log I find this. If this is the case can't sys ready highlight what the issue is?

  "SYSTEM_HEALTH_INFO": {
    "expireat": 1706532628.9363012,
    "ttl": -0.001,
    "type": "hash",
    "value": {
      "PSU 1": "PSU 1 is out of power",
      "psu1_fan1": "psu1_fan1 is missing",
      "summary": "Not OK"
    }
  },
  "SYSTEM_READY|SYSTEM_STATE": {
    "expireat": 1706532628.933142,
    "ttl": -0.001,
    "type": "hash",
    "value": {
      "Status": "DOWN"
    }
  },

@dgsudharsan
Copy link
Collaborator Author

In another repro, I find that we don't have the SYSTEM_READY|SYSTEM_STATE table in state_db but all app ready status are present. Do we know what could be the root cause of this?

@dgsudharsan
Copy link
Collaborator Author

I was able to reproduce the problem and have a backtrace

Mar 13 21:06:25.810087 sonic NOTICE healthd: System is ready
Mar 13 21:11:49.586645 sonic NOTICE healthd: System is not ready - one or more services are not up
Mar 13 21:12:29.471696 sonic NOTICE healthd: Caught SIGTERM - exiting...
Mar 13 21:12:29.477763 sonic NOTICE healthd: Caught SIGTERM - exiting...
Mar 13 21:12:29.509937 sonic NOTICE healthd: message repeated 2 times: [ Caught SIGTERM - exiting...]
Mar 13 21:12:29.543645 sonic INFO healthd[6053]: ERROR:dbus.connection:Exception in handler for D-Bus signal:
Mar 13 21:12:29.543764 sonic INFO healthd[6053]: Traceback (most recent call last):
Mar 13 21:12:29.543827 sonic INFO healthd[6053]: File "/usr/lib/python3/dist-packages/dbus/connection.py", line 232, in maybe_handle_message
Mar 13 21:12:29.543884 sonic INFO healthd[6053]: self._handler(*args, **kwargs)
Mar 13 21:12:29.543949 sonic INFO healthd[6053]: File "/usr/local/lib/python3.9/dist-packages/health_checker/sysmonitor.py", line 80, in on_job_removed
Mar 13 21:12:29.544003 sonic INFO healthd[6053]: self.task_notify(msg)
Mar 13 21:12:29.544059 sonic INFO healthd[6053]: File "/usr/local/lib/python3.9/dist-packages/health_checker/sysmonitor.py", line 108, in task_notify
Mar 13 21:12:29.544114 sonic INFO healthd[6053]: self.task_queue.put(msg)
Mar 13 21:12:29.544174 sonic INFO healthd[6053]: File "", line 2, in put
Mar 13 21:12:29.544228 sonic INFO healthd[6053]: File "/usr/lib/python3.9/multiprocessing/managers.py", line 809, in _callmethod
Mar 13 21:12:29.544280 sonic INFO healthd[6053]: kind, result = conn.recv()
Mar 13 21:12:29.544332 sonic INFO healthd[6053]: File "/usr/lib/python3.9/multiprocessing/connection.py", line 255, in recv
Mar 13 21:12:29.544384 sonic INFO healthd[6053]: buf = self._recv_bytes()
Mar 13 21:12:29.544437 sonic INFO healthd[6053]: File "/usr/lib/python3.9/multiprocessing/connection.py", line 419, in _recv_bytes
Mar 13 21:12:29.544488 sonic INFO healthd[6053]: buf = self._recv(4)
Mar 13 21:12:29.544548 sonic INFO healthd[6053]: File "/usr/lib/python3.9/multiprocessing/connection.py", line 384, in _recv
Mar 13 21:12:29.544601 sonic INFO healthd[6053]: chunk = read(handle, remaining)
Mar 13 21:12:29.544652 sonic INFO healthd[6053]: ConnectionResetError: [Errno 104] Connection reset by peer
Mar 13 21:12:30.511605 sonic NOTICE healthd: Caught SIGTERM - exiting...
Mar 13 21:13:20.013595 qa-eth-vt05-3-2700a1 NOTICE healthd[5398]: Starting up...
Mar 13 21:16:19.364421 qa-eth-vt05-3-2700a1 NOTICE healthd: System is ready
Mar 13 22:17:16.972479 qa-eth-vt05-3-2700a1 NOTICE healthd: System is not ready - one or more services are not up

@dgsudharsan
Copy link
Collaborator Author

Closing this issue as the fix is merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants