Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRAFT: Independent DPU Upgrade HLD #1906

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

hdwhdw
Copy link

@hdwhdw hdwhdw commented Jan 28, 2025

This PR contains a draft HLD for DPU Independent Upgrade.

What we did:
Supports Independent smartswitch DPU upgrade.

Why we did it:
Supports managing DPU SONiC version indepedently in smartswitch.

PRs and States:

  • PRs for generic switch upgrade.
Repo PR title State
sonic-gnmi GNOI Implementation for OS.Verify GitHub issue/pull request detail
sonic-gnmi GNOI Implementation for OS.Activate GitHub issue/pull request detail
sonic-gnmi GNOI Implementation for System.SetPackage GitHub issue/pull request detail
sonic-host-services Implementation for ImageService.List GitHub issue/pull request detail
sonic-host-services Add ImageService.set_next_boot for GNOI Activate OS GitHub issue/pull request detail
sonic-host-services DBUS API for GNOI System.SetPackage GitHub issue/pull request detail

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

No pipelines are associated with this pull request.

@KrisNey-MSFT
Copy link

Hi @hdwhdw - are you looking for a Reviewer for this PR?

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

No pipelines are associated with this pull request.

@hdwhdw hdwhdw marked this pull request as ready for review February 7, 2025 21:33
@hdwhdw
Copy link
Author

hdwhdw commented Feb 7, 2025

@KrisNey-MSFT yes please. Thank you :)

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

No pipelines are associated with this pull request.

### 2. Scope

This document describes the high-level design of the sequence to independently upgrade a SmartSwitch DPU with minimal impact to other DPUs and the NPU, through GNOI API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicitly define dependencies and any ordering constraints to prevent unexpected failures.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. Let me know if there are any other dependencies.

* 'System.SetPackage'
* 'OS.Activate'
* 'Containerz.Deploy'
* Rollback:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rollback

What happens if rollback fails or the previous image is corrupted.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a sentence for that. If Activate fail, we should SetPackage (install) the previous image and Activate again.

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

No pipelines are associated with this pull request.

3. DPU and NPU image compatibility: The upgrade process assumes that the DPU and NPU images are compatible with each other. It is up to the client to ensure the compatibility of the images.
4. Eliminating human intervention: The upgrade process may require human intervention to resolve issues that cannot be handled automatically, in particular, when both the upgrade process fails and the rollback process fails, the system may be left in an inconsistent state that requires manual intervention.

### 6. Architecture Degn
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Architecture Design

* 'Containerz.Deploy'
* Rollback:
* Rollback the new SONiC image on the DPU. Client issues 'OS.Activate' with the old SONiC image.
* Rollback the new offloaded container images on the NPU. Client issues 'Containerz.RemoveImage' with the old container images.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why rollback new offloaded img will issue RemoveImage with old img?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants