Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SmartSwitch] Extend reboot script for rebooting SmartSwitch #3566

Open
wants to merge 46 commits into
base: master
Choose a base branch
from

Conversation

vvolam
Copy link

@vvolam vvolam commented Oct 3, 2024

What I did

Extended reboot script for SmartSwitch cases to reboot entire SmartSwitch or a specific DPU

How I did it

Implemented changes according to https://github.com/sonic-net/SONiC/blob/605c3a56ac2717dbbb638433e7bb13054fc05a31/doc/smart-switch/reboot/reboot-hld.md

How to verify it

  • Verified the script on non-smart switch and didn't find any regressions. Also, script throws errors if any new smart switch related parameters are given by user.
  • Verified on NVIDIA smart switch.

Previous command output (if the output of a command-line utility has changed)

New command output (if the output of a command-line utility has changed)

@vvolam vvolam force-pushed the ss-reboot branch 2 times, most recently from 460146c to c72fbc0 Compare October 3, 2024 19:36
@vvolam vvolam force-pushed the ss-reboot branch 3 times, most recently from 8746356 to d6fc624 Compare October 23, 2024 20:33
scripts/reboot Outdated Show resolved Hide resolved
scripts/reboot Outdated Show resolved Hide resolved
scripts/reboot Outdated Show resolved Hide resolved
scripts/reboot Outdated Show resolved Hide resolved
scripts/reboot Outdated Show resolved Hide resolved
scripts/reboot Outdated Show resolved Hide resolved
scripts/reboot Outdated Show resolved Hide resolved
scripts/reboot Outdated Show resolved Hide resolved
scripts/reboot Outdated Show resolved Hide resolved
@vvolam vvolam changed the title Extend reboot script for rebooting SmartSwitch [SmartSwitch] Extend reboot script for rebooting SmartSwitch Nov 4, 2024
@vvolam vvolam marked this pull request as ready for review November 4, 2024 19:39
scripts/reboot_helper.py Outdated Show resolved Hide resolved
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@gpunathilell
Copy link
Contributor

2025-01-28 21:11:44 - Error: Failed to send reboot status command to DPU dpu1
/usr/local/bin/reboot_smartswitch_helper: line 126: [: null: integer expression expected

Error seen with latest changes

@gpunathilell
Copy link
Contributor

Also there are two issues which need to be addressed, the provisioning for the DPU and the switch GNMI container configuration (as these configurations are required to make sure that the gnoi_client command can be executed from the switch)
GNMI command execution fails with EOF error (due to GNMI container being shut down during pre shutdown) This issue has to be handled as per HLD (make sure GNMI container is not shutdown on DPU)

@KrisNey-MSFT
Copy link

Dependent upon PMON via Cisco, and a few Issues filed.

@gpunathilell
Copy link
Contributor

For the pcie related changes which are being done, we need to have differentiation in the way DPU reboot is handled in single DPU reboot and system reboot scenarios (ModuleBase.MODULE_REBOOT_DPU vs ModuleBase.MODULE_REBOOT_SMARTSWITCH) for MODULE_REBOOT_SMARTSWITCH we just schedule the DPU reboot and exit from the reboot call i.e. pcie is not back up after the reboot is executed, so we can not remove the STATE_DB entry in case of ModuleBase.MODULE_REBOOT_SMARTSWITCH (or perform rescan) so the entry deletion has to be skippped in case of system reboot

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@vvolam
Copy link
Author

vvolam commented Feb 4, 2025

2025-01-28 21:11:44 - Error: Failed to send reboot status command to DPU dpu1
/usr/local/bin/reboot_smartswitch_helper: line 126: [: null: integer expression expected

Error seen with latest changes

This is because of dpu_halt_services_timeout not present in platform.json. I have updated script to have a default timeout in such cse. Thank you!

@vvolam
Copy link
Author

vvolam commented Feb 4, 2025

not back up after the reboot is executed, so we can not remove the STATE_DB entry in case of ModuleBase.MODULE_R

This is done. Thank you!

@vvolam vvolam requested review from qiluo-msft and gpunathilell and removed request for gpunathilell February 5, 2025 04:52
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@vvolam vvolam requested a review from ganglyu February 5, 2025 17:27
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants