Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to make determine/process reboot-cause services restartable #86

Merged
merged 2 commits into from
Nov 21, 2023

Conversation

anamehra
Copy link
Contributor

@anamehra anamehra commented Nov 18, 2023

Signed-off-by: anamehra anamehra@cisco.com

Why I did it

Fixes sonic-net/sonic-buildimage#16990

This PR can be merged independently. The PR (sonic-net/sonic-buildimage#17220) will need this host-services PR to be merged and released.

MSFT ADO: 25892864

  1. determine-reboot-cause and process-reboot-cause service does not start If the database service fails to restart in the first attempt. Even if the Database service succeeds in next attempt, these reboot-cause services do not start.

  2. The process-reboot-service does not restart if the docker or database service restarts, which leads to an empty reboot-cause history

  3. deploy-mg from sonic-mgmt also triggers the docker service restart. The restart of the docker service caused the issue stated in 2 above. The docker restart also triggers determine-reboot-cause to restart which creates an additional reboot-cause file in history and modifies the last reboot-cause.

This PR along with sonic-buildimage PR (17220) fixes these issues by making both processes to start again when dependency meets after dependency failure, making both processes restart when the database service restarts, and preventing duplicate processing of the last reboot reason.

How I did it

  1. Modified systemd unit files to make determine-reboot-cause and process-reboot-cause services restartable when the database service restarts.
  2. On the restart, the determine-reboot-cause service should not recreate a new reboot-cause entry in the database. Added check for first start or restart to skip entry for restart case.

How to verify it

On single asic pizza box:

  1. Installed the image and check reboot-cause history
  2. restart database service and verify that determine-reboot-cause and process-reboot-cause services also restart. Verify that reboot-cause shows correct data and no new entry is created for restart.

On Chassis:

  1. Installed the image and check reboot-cause history
  2. restart the database service and verify that determine-reboot-cause and process-reboot-cause services also restart. Verify that reboot-cause shows correct data and no new entry is created for restart.
  3. Reboot LC. On Supervicor, stop database-chassis service.
    Let database service on LC fail the first time. determine-reboot-cause and process-reboot-cause would fail to start due to dependency failure
    start database-chassis on Supervisor. Database service on LC should now start successfully.
    Verify determine-reboot-cause and process-reboot-cause also starts
    Verify show reboot-cause history output

@anamehra
Copy link
Contributor Author

Hi @abdosi , @gechiang , please review. Thanks

@gechiang
Copy link

@anamehra , the PR test is failing:

=================================== FAILURES ===================================
_ TestDetermineRebootCause.test_determine_reboot_cause_main_with_reboot_cause_dir _

self = <tests.determine-reboot-cause_test.TestDetermineRebootCause object at 0x7f253c76de20>

    @mock.patch('determine_reboot_cause.REBOOT_CAUSE_DIR', os.path.join(os.getcwd(), REBOOT_CAUSE_DIR))
    @mock.patch('determine_reboot_cause.REBOOT_CAUSE_HISTORY_DIR', os.path.join(os.getcwd(), 'host/reboot-cause/history/'))
    @mock.patch('determine_reboot_cause.PREVIOUS_REBOOT_CAUSE_FILE', os.path.join(os.getcwd(), 'host/reboot-cause/previous-reboot-cause.json'))
    @mock.patch('determine_reboot_cause.REBOOT_CAUSE_FILE', os.path.join(os.getcwd(),'host/reboot-cause/reboot-cause.txt'))
    def test_determine_reboot_cause_main_with_reboot_cause_dir(self):
        with mock.patch("os.geteuid", return_value=0):
>           determine_reboot_cause.main()

tests/determine-reboot-cause_test.py:199: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

    def main():
        # Configure logger to log all messages INFO level and higher
        sonic_logger.set_min_log_priority_info()
    
        sonic_logger.log_info("Starting up...")
    
        if not os.geteuid() == 0:
            sonic_logger.log_error("User {} does not have permission to execute".format(pwd.getpwuid(os.getuid()).pw_name))
            sys.exit("This utility must be run as root")
    
        if os.path.exists(REBOOT_PROCESSED_FILE):
            sonic_logger.log_info("User {} : reboot-cause already processed. Nothing to do. Exiting...".format(pwd.getpwuid(os.getuid()).pw_name))
>           sys.exit(0)
E           SystemExit: 0

scripts/determine-reboot-cause:224: SystemExit

Please fix.
Thanks!

Signed-off-by: anamehra <anamehra@cisco.com>
@gechiang
Copy link

gechiang commented Nov 20, 2023

@anamehra , my understanding is this PR needs to go in first before the PR from buildimage correct? or it is safe tohave both PRs merged independently? Please clarify as we don't want to merge with wrong order to cause any regressions. Thanks!

@anamehra
Copy link
Contributor Author

This PR can go independently.

The PR in sonic-buildimage will need this PR.

@gechiang
Copy link

Thaks for the clarification.
@prgeor , Can you help review/approve this change?
Thanks!

@@ -218,6 +220,10 @@ def main():
sonic_logger.log_error("User {} does not have permission to execute".format(pwd.getpwuid(os.getuid()).pw_name))
sys.exit("This utility must be run as root")

if os.path.exists(REBOOT_PROCESSED_FILE):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anamehra can you add RemainAfterExit=true in the determine-reboot-cause.service, that should ensure systemd starts this service only once (unless someone manually starts the service, not the point of this PR)

@gechiang gechiang merged commit 5dcd1e5 into sonic-net:master Nov 21, 2023
5 checks passed
@gechiang
Copy link

@StormLiangMS , @yxieca , MSFT ADO: 25892864. Please help review/approve for 202305 and 202205 branches. Thanks!

@anamehra
Copy link
Contributor Author

@StormLiangMS , @yxieca , MSFT ADO: 25892864. Please help review/approve for 202305 and 202205 branches. Thanks!

202205 will require manual PR as the sonic-host-service is not a submodule but part of sonic-buildimage repo. I will raise a PR for 202205 once sonic-net/sonic-buildimage#17220 is merged.

@prgeor
Copy link
Contributor

prgeor commented Nov 22, 2023

@anamehra @gechiang please revert this PR instead add this unit service file change to skip running the determine-reboot-cause service again after boot up

image

gechiang added a commit that referenced this pull request Nov 22, 2023
@anamehra
Copy link
Contributor Author

@prgeor , process-reboot-cause has a dependency on this service. If we add a conditional check, that dependency won't meet, and process-reboot -cause won't run. Let me check modifications in the precess-reboot-cause unit file to handle this.

@anamehra anamehra deleted the anamehra/reboot_cause_restart branch November 28, 2023 19:10
anamehra added a commit to anamehra/sonic-host-services that referenced this pull request Nov 28, 2023
gechiang pushed a commit that referenced this pull request Nov 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants