Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Process Reboot Cause Service as Upholds of Database Service #18772

Conversation

xincunli-sonic
Copy link
Contributor

Why I did it

Addressing this issue: Fixed determine/process reboot-cause service dependency

Work item tracking
  • Microsoft ADO (number only): 26209780

How I did it

Add Upholds for process-reboot-cause.service

How to verify it

  1. Before the change
admin@str3-msn4600c-acs-05:~$ show reboot-cause history 
Name                 Cause                                              Time                             User    Comment
-------------------  -------------------------------------------------  -------------------------------  ------  ---------
2024_04_23_19_09_23  reboot                                             Tue 23 Apr 2024 07:07:23 PM UTC  admin   N/A
2024_04_23_07_02_00  Unknown (First boot of SONiC version 20231110.09)  N/A                              N/A     N/A
2024_04_23_06_48_50  reboot                                             Tue 23 Apr 2024 06:48:02 AM UTC  admin   N/A
2024_04_23_06_43_13  reboot                                             Tue 23 Apr 2024 06:42:25 AM UTC  admin   N/A
2024_04_23_06_08_24  Unknown                                            N/A                              N/A     N/A
2024_04_22_21_29_03  reboot                                             Mon 22 Apr 2024 09:28:16 PM UTC  admin   N/A
2024_04_22_21_22_52  reboot                                             Mon 22 Apr 2024 09:22:05 PM UTC  admin   N/A
2024_04_22_21_16_52  reboot                                             Mon 22 Apr 2024 09:16:05 PM UTC  admin   N/A
2024_04_22_21_10_47  Watchdog                                           N/A                              N/A     Unknown
2024_04_22_21_04_44  soft-reboot                                        Mon 22 Apr 2024 09:04:24 PM UTC  admin   N/A
  1. Add Upholds in database
admin@str3-msn4600c-acs-05:~$ systemctl cat database.service 
# /lib/systemd/system/database.service
[Unit]
Description=Database container

Wants=database-chassis.service
After=database-chassis.service
Requires=docker.service
After=docker.service
After=rc-local.service
Upholds=process-reboot-cause.service
StartLimitIntervalSec=1200
StartLimitBurst=3

[Service]
User=root
ExecStartPre=/usr/local/bin/database.sh start
ExecStart=/usr/local/bin/database.sh wait
ExecStop=/usr/local/bin/database.sh stop
RestartSec=30

[Install]
WantedBy=multi-user.target

# /etc/systemd/system/database.service.d/auto_restart.conf
[Service]
Restart=always
  1. Stop database
admin@str3-msn4600c-acs-05:~$ docker stop database 
database

admin@str3-msn4600c-acs-05:~$ docker ps -a
CONTAINER ID   IMAGE                                COMMAND                  CREATED       STATUS                     PORTS     NAMES
1de2b92d4c59   docker-snmp:latest                   "/usr/local/bin/supe…"   2 hours ago   Up 2 hours                           snmp
ba6a737312f9   docker-sonic-mgmt-framework:latest   "/usr/local/bin/supe…"   2 hours ago   Up 2 hours                           mgmt-framework
8a4389c45bdf   docker-lldp:latest                   "/usr/bin/docker-lld…"   2 hours ago   Up 2 hours                           lldp
90a875824801   docker-sonic-gnmi:latest             "/usr/local/bin/supe…"   2 hours ago   Up 2 hours                           gnmi
ce19ec771ac5   docker-platform-monitor:latest       "/usr/bin/docker_ini…"   2 hours ago   Up 2 hours                           pmon
c456c8548ee6   docker-router-advertiser:latest      "/usr/bin/docker-ini…"   2 hours ago   Up 2 hours                           radv
c521123a450c   docker-syncd-mlnx:latest             "/usr/local/bin/supe…"   2 hours ago   Up 2 hours                           syncd
03b4d87822fa   docker-fpm-frr:latest                "/usr/bin/docker_ini…"   2 hours ago   Up 2 hours                           bgp
226ae26a0614   docker-teamd:latest                  "/usr/local/bin/supe…"   2 hours ago   Up 2 hours                           teamd
bc689d1e75c5   docker-orchagent:latest              "/usr/bin/docker-ini…"   2 hours ago   Up 2 hours                           swss
5cdc86b679f8   docker-eventd:latest                 "/usr/local/bin/supe…"   2 hours ago   Up 2 hours                           eventd
34fe36f3428f   docker-database:latest               "/usr/local/bin/dock…"   2 hours ago   Exited (0) 6 seconds ago             database

admin@str3-msn4600c-acs-05:~$ show reboot-cause history
Traceback (most recent call last):
  File "/usr/local/bin/show", line 5, in <module>
    from show.main import cli
  File "/usr/local/lib/python3.11/dist-packages/show/main.py", line 325, in <module>
    if is_gearbox_configured():
       ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/show/main.py", line 266, in is_gearbox_configured
    app_db.connect(app_db.APPL_DB)
  File "/usr/lib/python3/dist-packages/swsscommon/swsscommon.py", line 1986, in connect
    return _swsscommon.SonicV2Connector_Native_connect(self, db_name, retry_on)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Unable to connect to redis: Cannot assign requested address
admin@str3-msn4600c-acs-05:~$ systemctl status process-reboot-cause.
Unit process-reboot-cause..service could not be found.
admin@str3-msn4600c-acs-05:~$ systemctl status process-reboot-cause.service 
× process-reboot-cause.service - Retrieve the reboot cause from the history files and save them to StateDB
     Loaded: loaded (/lib/systemd/system/process-reboot-cause.service; static)
     Active: failed (Result: start-limit-hit) since Tue 2024-04-23 21:27:53 UTC; 41s ago
   Duration: 62ms
TriggeredBy: ● process-reboot-cause.timer
    Process: 83673 ExecStart=/usr/local/bin/process-reboot-cause (code=exited, status=0/SUCCESS)
   Main PID: 83673 (code=exited, status=0/SUCCESS)
  1. Restart Database
admin@str3-msn4600c-acs-05:~$ docker start database 
database
admin@str3-msn4600c-acs-05:~$ docker ps
CONTAINER ID   IMAGE                                COMMAND                  CREATED       STATUS          PORTS     NAMES
1de2b92d4c59   docker-snmp:latest                   "/usr/local/bin/supe…"   2 hours ago   Up 2 hours                snmp
ba6a737312f9   docker-sonic-mgmt-framework:latest   "/usr/local/bin/supe…"   2 hours ago   Up 2 hours                mgmt-framework
8a4389c45bdf   docker-lldp:latest                   "/usr/bin/docker-lld…"   2 hours ago   Up 2 hours                lldp
90a875824801   docker-sonic-gnmi:latest             "/usr/local/bin/supe…"   2 hours ago   Up 2 hours                gnmi
ce19ec771ac5   docker-platform-monitor:latest       "/usr/bin/docker_ini…"   2 hours ago   Up 2 hours                pmon
c456c8548ee6   docker-router-advertiser:latest      "/usr/bin/docker-ini…"   2 hours ago   Up 2 hours                radv
03b4d87822fa   docker-fpm-frr:latest                "/usr/bin/docker_ini…"   2 hours ago   Up 2 hours                bgp
226ae26a0614   docker-teamd:latest                  "/usr/local/bin/supe…"   2 hours ago   Up 2 hours                teamd
bc689d1e75c5   docker-orchagent:latest              "/usr/bin/docker-ini…"   2 hours ago   Up 2 hours                swss
34fe36f3428f   docker-database:latest               "/usr/local/bin/dock…"   2 hours ago   Up 22 seconds             database
admin@str3-msn4600c-acs-05:~$ show reboot-cause history
Name                 Cause                                              Time                             User    Comment
-------------------  -------------------------------------------------  -------------------------------  ------  ---------
2024_04_23_19_09_23  reboot                                             Tue 23 Apr 2024 07:07:23 PM UTC  admin   N/A
2024_04_23_07_02_00  Unknown (First boot of SONiC version 20231110.09)  N/A                              N/A     N/A
2024_04_23_06_48_50  reboot                                             Tue 23 Apr 2024 06:48:02 AM UTC  admin   N/A
2024_04_23_06_43_13  reboot                                             Tue 23 Apr 2024 06:42:25 AM UTC  admin   N/A
2024_04_23_06_08_24  Unknown                                            N/A                              N/A     N/A
2024_04_22_21_29_03  reboot                                             Mon 22 Apr 2024 09:28:16 PM UTC  admin   N/A
2024_04_22_21_22_52  reboot                                             Mon 22 Apr 2024 09:22:05 PM UTC  admin   N/A
2024_04_22_21_16_52  reboot                                             Mon 22 Apr 2024 09:16:05 PM UTC  admin   N/A
2024_04_22_21_10_47  Watchdog                                           N/A                              N/A     Unknown
2024_04_22_21_04_44  soft-reboot                                        Mon 22 Apr 2024 09:04:24 PM UTC  admin   N/A

admin@str3-msn4600c-acs-05:~$ systemctl status process-reboot-cause.service 
× process-reboot-cause.service - Retrieve the reboot cause from the history files and save them to StateDB
     Loaded: loaded (/lib/systemd/system/process-reboot-cause.service; static)
     Active: failed (Result: start-limit-hit) since Tue 2024-04-23 21:27:53 UTC; 1min 32s ago
   Duration: 62ms
TriggeredBy: ● process-reboot-cause.timer
    Process: 83673 ExecStart=/usr/local/bin/process-reboot-cause (code=exited, status=0/SUCCESS)
   Main PID: 83673 (code=exited, status=0/SUCCESS)

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211
  • 202305
  • 202405
  • 202411

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@xincunli-sonic xincunli-sonic requested a review from lguohan as a code owner April 23, 2024 21:47
@prgeor
Copy link
Contributor

prgeor commented Apr 23, 2024

@anamehra please review for T2 chassis

@xincunli-sonic
Copy link
Contributor Author

@saiarcot895 Would you mind review this change, it was wrongly use upholds in this PR: sonic-net/sonic-host-services#100

@liushilongbuaa
Copy link
Contributor

/azp run Azure.sonic-buildimage

Copy link

Commenter does not have sufficient privileges for PR 18772 in repo sonic-net/sonic-buildimage

@xumia
Copy link
Collaborator

xumia commented Apr 25, 2024

/azp run Azure.sonic-buildimage

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@anamehra
Copy link
Contributor

anamehra commented May 1, 2024

@anamehra please review for T2 chassis

Hi @prgeor , is this changes tested on multi-asic?

@xincunli-sonic
Copy link
Contributor Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@xincunli-sonic
Copy link
Contributor Author

/azpw run Azure.sonic-buildimage

@mssonicbld
Copy link
Collaborator

/AzurePipelines run Azure.sonic-buildimage

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@xincunli-sonic
Copy link
Contributor Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@xincunli-sonic
Copy link
Contributor Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@xincunli-sonic
Copy link
Contributor Author

/azpw run Azure.sonic-buildimage (Test kvmtest-multi-asic-t1-lag by Elastictest)

@mssonicbld
Copy link
Collaborator

/AzurePipelines run Azure.sonic-buildimage (Test kvmtest-multi-asic-t1-lag by Elastictest)

Copy link

No pipelines are associated with this pull request.

@anamehra
Copy link
Contributor

anamehra commented May 2, 2024

@prgeor , I am seeing the following error if add this Upholds string in database.service file on LC. I did not build a fresh image but edited a router and rebooted. That should not be any different.

root@sfd-t2-lc2:/home/cisco# systemctl status process-reboot-cause.service
× process-reboot-cause.service - Retrieve the reboot cause from the history files and save them to StateDB
Loaded: loaded (/lib/systemd/system/process-reboot-cause.service; static)
Active: failed (Result: start-limit-hit) since Thu 2024-05-02 20:49:45 UTC; 5s ago
Duration: 48ms
TriggeredBy: ● process-reboot-cause.timer
Process: 1609294 ExecStart=/usr/local/bin/process-reboot-cause (code=exited, status=0/SUCCESS)
Main PID: 1609294 (code=exited, status=0/SUCCESS)

May 02 20:49:45 sfd-t2-lc2 systemd[1]: process-reboot-cause.service: Start request repeated too quickly.
May 02 20:49:45 sfd-t2-lc2 systemd[1]: process-reboot-cause.service: Failed with result 'start-limit-hit'.
May 02 20:49:45 sfd-t2-lc2 systemd[1]: Failed to start process-reboot-cause.service - Retrieve the reboot cause from the history files and save them to StateDB.
May 02 20:49:45 sfd-t2-lc2 systemd[1]: process-reboot-cause.service: Start request repeated too quickly.
May 02 20:49:45 sfd-t2-lc2 systemd[1]: process-reboot-cause.service: Failed with result 'start-limit-hit'.
May 02 20:49:45 sfd-t2-lc2 systemd[1]: Failed to start process-reboot-cause.service - Retrieve the reboot cause from the history files and save them to StateDB.
May 02 20:49:45 sfd-t2-lc2 systemd[1]: process-reboot-cause.service: Start request repeated too quickly.
May 02 20:49:45 sfd-t2-lc2 systemd[1]: process-reboot-cause.service: Failed with result 'start-limit-hit'.
May 02 20:49:45 sfd-t2-lc2 systemd[1]: Failed to start process-reboot-cause.service - Retrieve the reboot cause from the history files and save them to StateDB.
May 02 20:49:45 sfd-t2-lc2 systemd[1]: process-reboot-cause.service: Unit needs to be started because active unit database.service upholds it, but not starting since we tried this too often recently. Will retry later.

Could you please try any multi-asic system at your end?
Thanks

@anamehra
Copy link
Contributor

Hi @prgeor , any input on my comment above? Thanks

@anamehra
Copy link
Contributor

anamehra commented Jun 5, 2024

@abdosi , for your viz
This is needed for chassis

@abdosi
Copy link
Contributor

abdosi commented Jun 11, 2024

@xincunli-sonic / @prgeor this change is not working for chassis. After making this change as mentioned by @anamehra seeing below issue post LC reboot. I feel this is not straight forward to fix for multi-asic as their are multiple database service.

Can we merge this PR for master/202405 #17406 .This has been tested for 202305 and 202205 image and looks stable fix.

@anamehra wondering this issue is coming because timer service of process-reboot-cause. In your PR :#17406 it seems time service is removed. Wondering do we need to do same here in context of this PR ?

admin@str2-xxxx-lc1-2:/var/log$ sudo systemctl status process-reboot-cause.service
× process-reboot-cause.service - Retrieve the reboot cause from the history files and save them to StateDB
     Loaded: loaded (/lib/systemd/system/process-reboot-cause.service; static)
     Active: failed (Result: start-limit-hit) since Tue 2024-06-11 07:25:50 UTC; 4s ago
   Duration: 57ms
TriggeredBy: ● process-reboot-cause.timer
    Process: 2868400 ExecStart=/usr/local/bin/process-reboot-cause (code=exited, status=0/SUCCESS)
   Main PID: 2868400 (code=exited, status=0/SUCCESS)

Jun 11 07:25:50 str2-8800-lc1-2 systemd[1]: process-reboot-cause.service: Start request repeated too quickly.
Jun 11 07:25:50 str2-8800-lc1-2 systemd[1]: process-reboot-cause.service: Failed with result 'start-limit-hit'.
Jun 11 07:25:50 str2-8800-lc1-2 systemd[1]: Failed to start process-reboot-cause.service - Retrieve the reboot cause from the history files and save them to StateDB.
Jun 11 07:25:50 str2-8800-lc1-2 systemd[1]: process-reboot-cause.service: Start request repeated too quickly.
Jun 11 07:25:50 str2-8800-lc1-2 systemd[1]: process-reboot-cause.service: Failed with result 'start-limit-hit'.
Jun 11 07:25:50 str2-8800-lc1-2 systemd[1]: Failed to start process-reboot-cause.service - Retrieve the reboot cause from the history files and save them to StateDB.
Jun 11 07:25:50 str2-8800-lc1-2 systemd[1]: process-reboot-cause.service: Start request repeated too quickly.
Jun 11 07:25:50 str2-8800-lc1-2 systemd[1]: process-reboot-cause.service: Failed with result 'start-limit-hit'.
Jun 11 07:25:50 str2-8800-lc1-2 systemd[1]: Failed to start process-reboot-cause.service - Retrieve the reboot cause from the history files and save them to StateDB.
Jun 11 07:25:50 str2-8800-lc1-2 systemd[1]: process-reboot-cause.service: Unit needs to be started because active unit database@2.service upholds it, but not starting since we tried this too often recently. Will retry later.

@anamehra
Copy link
Contributor

Hi @abdosi , removing timer does not help as well.

@arlakshm
Copy link
Contributor

@xincunli-sonic, can you please resolve the comments on this PR. Also please confirm if these changes will work on multi-asic platforms

@prgeor
Copy link
Contributor

prgeor commented Jul 18, 2024

@xincunli-sonic / @prgeor this change is not working for chassis. After making this change as mentioned by @anamehra seeing below issue post LC reboot. I feel this is not straight forward to fix for multi-asic as their are multiple database service.

Can we merge this PR for master/202405 #17406 .This has been tested for 202305 and 202205 image and looks stable fix.

@anamehra wondering this issue is coming because timer service of process-reboot-cause. In your PR :#17406 it seems time service is removed. Wondering do we need to do same here in context of this PR ?

admin@str2-xxxx-lc1-2:/var/log$ sudo systemctl status process-reboot-cause.service
× process-reboot-cause.service - Retrieve the reboot cause from the history files and save them to StateDB
     Loaded: loaded (/lib/systemd/system/process-reboot-cause.service; static)
     Active: failed (Result: start-limit-hit) since Tue 2024-06-11 07:25:50 UTC; 4s ago
   Duration: 57ms
TriggeredBy: ● process-reboot-cause.timer
    Process: 2868400 ExecStart=/usr/local/bin/process-reboot-cause (code=exited, status=0/SUCCESS)
   Main PID: 2868400 (code=exited, status=0/SUCCESS)

Jun 11 07:25:50 str2-8800-lc1-2 systemd[1]: process-reboot-cause.service: Start request repeated too quickly.
Jun 11 07:25:50 str2-8800-lc1-2 systemd[1]: process-reboot-cause.service: Failed with result 'start-limit-hit'.
Jun 11 07:25:50 str2-8800-lc1-2 systemd[1]: Failed to start process-reboot-cause.service - Retrieve the reboot cause from the history files and save them to StateDB.
Jun 11 07:25:50 str2-8800-lc1-2 systemd[1]: process-reboot-cause.service: Start request repeated too quickly.
Jun 11 07:25:50 str2-8800-lc1-2 systemd[1]: process-reboot-cause.service: Failed with result 'start-limit-hit'.
Jun 11 07:25:50 str2-8800-lc1-2 systemd[1]: Failed to start process-reboot-cause.service - Retrieve the reboot cause from the history files and save them to StateDB.
Jun 11 07:25:50 str2-8800-lc1-2 systemd[1]: process-reboot-cause.service: Start request repeated too quickly.
Jun 11 07:25:50 str2-8800-lc1-2 systemd[1]: process-reboot-cause.service: Failed with result 'start-limit-hit'.
Jun 11 07:25:50 str2-8800-lc1-2 systemd[1]: Failed to start process-reboot-cause.service - Retrieve the reboot cause from the history files and save them to StateDB.
Jun 11 07:25:50 str2-8800-lc1-2 systemd[1]: process-reboot-cause.service: Unit needs to be started because active unit database@2.service upholds it, but not starting since we tried this too often recently. Will retry later.

@anamehra @abdosi can you tell me if you are testing this change in master image for multi-asic? Which sonic version are you testing on chassis platform?

@abdosi
Copy link
Contributor

abdosi commented Oct 23, 2024

this change is not working on multi-asic/chassis subsystem. We had another approach to fix this which is merged in master also. so closing this PR.

@abdosi abdosi closed this Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Status: Done
Development

Successfully merging this pull request may close these issues.

9 participants