Skip to content

Commit

Permalink
Rerun CLI commands doc (#2330)
Browse files Browse the repository at this point in the history
  • Loading branch information
taneyland authored Jul 21, 2022
1 parent 62b8269 commit b5c6f64
Showing 1 changed file with 114 additions and 0 deletions.
114 changes: 114 additions & 0 deletions designs/rerun-cli-commands.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# Rerun CLI commands after failure

## Introduction
**Problem:**
Most of the errors returned by the CLI cluster commands are either transient or fixable with manual intervention.
However, the CLI commands are not idempotent.
This makes impossible to rerun commands after they have already failed, even if the root cause of the issues has already been solved.
This makes the experience very disruptive, specially for the upgrade cluster command, which could leave the cluster in an irrecoverable state without the necessary manual cleanup.
This "cleanup" is not documented and it heavily dependents on the internal CLI's implementation.
For certain scenarios, it is not even possible, requiring users to destroy and recreate their clusters from scratch.

## Solution
The proposed solution is to implement a "checkpoint" capability.
This will serve as the first step to a larger story of improving the customer experience with upgrading clusters and troubleshooting in general.
The CLI will keep a registry of all completed steps for a workflow and the necessary data to restore its state.
If the `upgrade` command fails and the error is manually fixable, the user can simply rerun the same command after implementing the fix.
When a command is rerun after an error, the CLI will skip the completed steps, restoring the state of the program and proceeding with the upgrade from there.
The checkpoint data would be stored in the `<clusterName>/generated` folder as a file named `<clusterName>-checkpoint.yaml`.

Example checkpoint file:

```yaml
completedTasks:
bootstrap-cluster-init:
checkpoint:
ExistingManagement: false
KubeconfigFile: test/generated/test.kind.kubeconfig
Name: test
ensure-etcd-capi-components-exist:
checkpoint: null
install-capi:
checkpoint: null
pause-controllers-reconcile:
checkpoint: null
setup-and-validate:
checkpoint: null
update-secrets:
checkpoint: null
upgrade-core-components:
checkpoint:
components: []
upgrade-needed:
checkpoint: null
```
---
We will add two methods, `Restore()` and `Checkpoint()`.
The logic in both methods will vary depending on what is needed from each task.

```go
type Task interface {
Run(ctx context.Context, commandContext *CommandContext) Task
Name() string
Checkpoint() TaskCheckpoint
Restore(ctx context.Context, commandContext *CommandContext, checkpoint TaskCheckpoint) (Task, error)
}
```

The `Checkpoint()` method will return the information necessary to successfully restore the task if the operation fails.
This information will then be retained in the checkpoint file.

The `Restore()` method will use the information from the checkpoint file to restore the state.
Again, this will vary depending on what's needed for that specific task.
Once the restoration succeeds for a task, we will return the next task (essentially skipping the task that was restored).

---
When a task finishes with no errors, we will add it as a completed task in the checkpoint file with the necessary information (if any) to allow the command to rerun successfully.

```go
if commandContext.OriginalError == nil {
// add task to completedTasks in checkpoint file
checkpointInfo.taskCompleted(task.Name(), task.Checkpoint())
}
```

When the user re-runs the command, we will check if the task exists as a completed task in the checkpoint file.
If so, we will restore any necessary information from the checkpoint file and set the next task.

```go
for task != nil {
if completedTask, ok := checkpointInfo.CompletedTasks[task.Name()]; ok {
// Restore any necessary information from the checkpoint file & set task to the next task
nextTask, err := task.Restore(ctx, commandContext, completedTask)
...
continue
}
logger.V(4).Info("Task start", "task_name", task.Name())
...
...
}
```

### Other notable changes:
* If the CLI fails to install CAPI objects to the bootstrap cluster, the CLI currently deletes the bootstrap cluster.
This will obviously cause problems if we were to rerun the command at that point.
Instead of deleting the bootstrap cluster if `InstallCAPITask` fails, it will proceed to `CollectDiagnosticsTask`.
From there, the user can re-run the command to attempt installing the CAPI components on the previously created bootstrap cluster.


* If the CLI fails when moving CAPI management, things start to get a bit tricky.
We could move CAPI management back to the source cluster if it fails, but this operation can be risky & potentially leave the cluster in an undesirable state.
For now, we will retry moving the CAPI objects if the task fails the first time.
A deeper dive would be required to decide on a better way to roll back those changes if moving CAPI fails.
An idea for a future improvement is to backup the CAPI objects associated with the move using the `clusterctl backup` command. See [here](https://github.com/kubernetes-sigs/cluster-api/blob/main/cmd/clusterctl/cmd/backup.go).


* The only task that we should always run is the `setup-and-validate` task.
Skipping this task would leave a blind spot for misconfigurations from the user input.

### Future Work

* Rolling back changes if moving CAPI management fails
* Add more validations to ensure each task's purpose was successful & left the cluster in a healthy state before skipping
* Building on top of this framework to further improve user experience

0 comments on commit b5c6f64

Please sign in to comment.