-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add reset annotation and CRI command #13
Conversation
73388ac
to
eff8a6f
Compare
Oh I didn't even know this existed haha. For PRs to Kevins forks you'll have to @ us manually as cplat doesn't get auto tagged, I should've brought this up. @ambarve @kevpar @katiewasnothere @anmaxvl @msscotb |
Adds tests for resetting tasks and containers explicitly with CRI plugin API, and implicitly using annotations and start/stop commands. PR relies on accompanying CRI PR (kevpar/cri#13) being merged. Signed-off-by: Hamza El-Saawy <hamzaelsaawy@microsoft.com>
1e33171
to
f63c437
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
This commit supports restarting containers and pods using CRI: kevpar/cri#13 This PR allows the service to remove tasks from a pods workloadTasks map after the task and associated execs have been shut down during in a delete task request, allowing for proper deletion of the task and freeing up associated resources when received by the service. Namely, this frees up the deleted task's ID, so that new tasks can be created with that same ID after the original task has been deleted (ie, so a task can be restarted within a running pod). A DeleteTask function was added to the shimPod interface to implement most of this functionality. Additionally, the service, in deleteInternal, resets its internal reference to the init task (shimPod or shimTask) reference, taskOrPod, if the delete is issued for the init task, as a marker that the service is no longer operational and to prevent future operations from occurring. Signed-off-by: Hamza El-Saawy hamzaelsaawy@microsoft.com
rpc ResetPodSandbox(ResetPodSandboxRequest) returns (ResetPodSandboxResponse) {} | ||
// ResetContainer resets a stopped container back to the created state, keeping | ||
// its scratch space untouched. | ||
// This call is idempotent, and must not return an error if the container is already | ||
// in the created state. | ||
rpc ResetContainer(ResetContainerRequest) returns (ResetContainerResponse) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need these new API calls if we can accomplish what we want in a call to StartContainer
or RunPodSandbox
?
// | | +----+----+ | | | ||
// | | | | | | ||
// | | Reset | Stop/Exit | | | ||
// | | | | | | ||
// | | +----v----+ | | | ||
// | | | <---------+ +----v----+ | ||
// | +------+ NOTREADY| | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I follow what the change to this chart is doing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a transition from the not ready state (stopped) back to ready (running).
It does look pretty confusing though, since the Ready --> Ready
transition by way of PortForward
convolutes the new arrow.
Added annotation (`io.microsoft.cri.allowreset`) to allow run (for pods) and start (for containers) to reset previously created pods or containers (respectively). Additionally, there is are CRI grpc commands to reset the state of stopped pods and containers. Reset will cause a stopped pod to run again and start its sandbox/pause container, and recreate its network namespace, as well as update the `IP`, `NetNSPath`, `CNIResult`, and `Container` fields. For containers, it will reset its state back to `CREATED`, recreating IO pipes and other resources as needed. In all cases, the scratch VHD (`sandbox.vhdx`) will be reused, preserving the disk state of the containers. Resetting a pod will reset its containers as well. Resetting state was an easier approach than adding in a `CREATED` state to pods, so that they would have similar creation and start semantics to containers. The latter approach involved changing how most CRI pods commands functioned. Added combination `LoadNetNS` and `Remove` function for HCN namespaces: function skips duplicate `hcn.GetNamespaceByID` and error checking. Combined namespace removal with recreation, as they are likely always happen together. Rather than flushing uVM file buffers on shutdown to ensure VHD logs SCSI unmounting for container scratch space from the uVM and removes the corresponding junction points, code now removes old uVM scratch space/snapshot and creates a new one when resetting hypervisor WCOW pods. Signed-off-by: Hamza El-Saawy <hamzaelsaawy@microsoft.com>
state := sandbox.Status.Get().State | ||
switch state { | ||
case sandboxstore.StateReady: | ||
entity.Debugf("sandbox is already running") | ||
return nil | ||
case sandboxstore.StateNotReady: | ||
default: | ||
return errors.Errorf("sandbox %q is in invalid state %q", id, state) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I know we don't have a corresponding "Created" state for pods, but does this mean we have no way of telling if a pod has actually run already and exited when we go to reset it? Could that be a problem for us?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ish: on pod "creation" (start), the underlying task is automatically started and run, and the pod goes into the ready state.
So, if a pod exists and is not ready, the assumption is that either it was previously running (ready) and was stopped or was loaded (after a containerd/CRI restart) in a stopped or unknown state. From what I can tell, CRI guarantees that pods that fail to start their tasks when created are deleted automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple comments but otherwise looks good to me
Adds tests for resetting tasks and containers explicitly with CRI plugin API, and implicitly using annotations and start/stop commands. PR relies on accompanying CRI PR (kevpar/cri#13) being merged. Signed-off-by: Hamza El-Saawy <hamzaelsaawy@microsoft.com>
Adds tests for resetting tasks and containers explicitly with CRI plugin API, and implicitly using annotations and start/stop commands. PR relies on accompanying CRI PR (kevpar/cri#13) being merged. Signed-off-by: Hamza El-Saawy <hamzaelsaawy@microsoft.com>
* Tests for task and sandbox reset/restart Adds tests for resetting tasks and containers explicitly with CRI plugin API, and implicitly using annotations and start/stop commands. PR relies on accompanying CRI PR (kevpar/cri#13) being merged. Signed-off-by: Hamza El-Saawy <hamzaelsaawy@microsoft.com> * PR: wrappers, annotation, comments Signed-off-by: Hamza El-Saawy <hamzaelsaawy@microsoft.com>
This PR supports reusing a container's scratch space by allowing pods and containers to be reset.
Containerd allows containers to be started and stopped repeatedly, but CRI explicitly enforces a state machine for containers (
pkg/store/container/status.go
) and pods (pkg/store/sandbox/status.go
). Resetting keeps as much of a pod's or container's existing resources allocated and reused (namely the scratch space VHD) and recreate closed resources, as needed.Two new gPRC commands were added to the CRI plugin to reset a pod back to the
READY
state fromNOTREADY
and a container back toCREATED
from theEXITED
state, respectively.Additionally, a reset annotation (
io.microsoft.cri.enablereset
) was added to allow run (or start) to work onNOTREADY
(orEXITED
) pods (or containers).Resetting a pod will restart its sandbox/pause container, recreate the HCN namespace and volatile directories, and reset the containers within it.
Resetting a container will change its state back to
CREATED
, and recreate IO pipes and directories, and update the Linux namespaces (if applicable).In all cases, the scratch vhd (
sandbox.vhdx
) will be reused, preserving the disk state of the containers.It is possible to reset a container and start it again without having to the stop the pod it is in.
Resetting state pod state back to
READY
was an easier and approach than adding in aCREATED
state to them to mirror the container state machine. The latter approach involved changing how the CRI plugin handled and tracked pods internally, and possibly risk corrupting state it persisted on disk.Associated PRs:
Currently, after a container is stopped, CRI issues a task delete command to the shim, but runhcsshim does not delete the task ID from the underlying
pod
objects internal state, even after the task itself is removed. This causes the pod to track a phantom task that no longer exists, and prevents the pod from creating a task with the same ID as a prior task that was removed.The PR to fix it is: microsoft/hcsshim#1271.
For LCOW, the Linux gcs prematurely deletes the underlying runc container when a stop task request is issued, which causes the subsequent delete task request to fail, since it cannot find the container. That prevents the above issue, with removing the task from the shim's internal state, from occuring, as the delete task request errors half-way through.
The PR to fix it is: microsoft/hcsshim#1272
Tests of end to end reset functionality and persisting state across restarts are in draft PR, waiting on this and associated PRs to merged: microsoft/hcsshim#1273
Signed-off-by: Hamza El-Saawy hamzaelsaawy@microsoft.com