Improve handling of errors during PVC cleanup #879

amisevsk · 2022-06-28T02:17:42Z

What does this PR do?

While working on a fix for #877, I began by attempting to unify how workspace status is managed in the finalize function (rather than having multiple places where it may be updated). In doing so, I came across a bug where every time finalize was called, it would first set the on-cluster workspace's status to Terminating, then potentially update it to Errored if PVC cleanup encountered an error. This turns out to be the underlying cause of #845.

See commits messages in this PR for more information, but this PR:

Reuses the same trick from the reconcile function in handling workspace status -- we defer a function that sets the workspace's status so that there is only one call per finalize to update the status. This resolves Handling of cleanup job errors should be improved #877 and avoids reintroducing DWO continues to reconcile workspace when common PVC cleanup job fails #845 as we are no longer updating the workspace constantly
Changes the finalize function to continue attempting to finalize terminating workspaces even if they are in an errored state (partially undoing the fix for DWO continues to reconcile workspace when common PVC cleanup job fails #845) in order to resolve Handling of cleanup job errors should be improved #877
A minor fix to attempt to avoid logging a strange error that I've been seeing lately where we attempt to update the status of a workspace after it has been garbage collected (need to verify this works fully) -- see StorageError sometimes logged when deleting a DevWorkspace #878

I think this PR may also incidentally resolve the main piece of #873 but this needs more verification. In my brief testing, I've found:

The errored workspace in Errored workspaces aren't removed after PVC cleanup failure #873 is still not removed as no reconcile is triggered, but it will be removed the next time it is reconciled (you can trigger a reconcile by e.g. adding an annotation to the workspace)
However, in one case the cluster was still not left in a clean state as the errored pods were still present even after the job was removed (not sure why).

What issues does this PR fix or reference?

Closes: #877
Needs verification: #873

Is it tested? How?

Verify that DWO continues to reconcile workspace when common PVC cleanup job fails #845 is not re-introduced (via the reproducer in that issue)
Test that Handling of cleanup job errors should be improved #877 is resolved by:
- Reaching the errored state in DWO continues to reconcile workspace when common PVC cleanup job fails #845 (your namespace has two workspaces, one in an errored terminating state and the other one started/stopped)
- Undoing the patch from the DWO continues to reconcile workspace when common PVC cleanup job fails #845 reproducer and updating the controller in order to run the correct job
- Verify that the workspace is eventually cleaned up once that job succeeds
this isn't a perfect test but approximates the situation that causes Handling of cleanup job errors should be improved #877
Check if this PR also resolves Errored workspaces aren't removed after PVC cleanup failure #873 by running the reproducing process there.

PR Checklist

E2E tests pass (when PR is ready, comment /test v8-devworkspace-operator-e2e, v8-che-happy-path to trigger)
- v8-devworkspace-operator-e2e: DevWorkspace e2e test
- v8-che-happy-path: Happy path for verification integration with Che

Rework how we update workspace status to avoid multiple calls that update the workspace status. Previously, every time we enter the finalize function, we would set the status to terminating, and then potentially set it to errored. Instead, we use the same deferred-function handling of updating the workspace status from main reconcile function -- defer a function that updates the status and pass around a currentStatus struct reference that needs to be updated. This moves all calls that update the workspace status into one place. Signed-off-by: Angel Misevski <amisevsk@redhat.com>

With the update to terminating workspace status handling (see previous commit), it turns out it's no longer necessary to stop reconciling workspaces once their job encounters an error. The previous bug where DWO would enter a tight loop reconciling failed workspaces turns out to be due to each finalize call updating the workspace to Terminating status, then to Errored status, triggering a new reconcile. Signed-off-by: Angel Misevski <amisevsk@redhat.com>

When the last finalizer on a DevWorkspace is cleared, the cluster may garbage-collect that workspace immediately, which can result in the object being deleted before we reach the point of attempting to update the workspace's status. This results in logging a harmless but confusing error, so we don't try to update the status for terminating workspaces with no finalizers. Signed-off-by: Angel Misevski <amisevsk@redhat.com>

dkwon17 · 2022-06-28T19:29:46Z

I'm having trouble verifying step 1:

After checking out this PR, and following the reproduction steps from #845 I see the Failed to clean up DevWorkspace storage error log three times instead of once in the devworkspace-controller-manager-* pod:

Pods (failing cleanup pods are to be expected):

I changed the container args in pkg/provision/storage/cleanup.go just like how it is explained in the steps, and created this image: quay.io/dkwon17/devworkspace-controller:cleanup

Any ideas why I see the Failed to clean up DevWorkspace storage three times instead of once?

amisevsk · 2022-06-28T19:45:41Z

Seeing the message multiple times is expected. You'll see that message logged every time an event related to any workspace-related object occurs, and it can be triggered by e.g. editing annotations on the pod/pvc/job/devworkspace/etc.

The key is that after occurring a few times, it stops reconciling that object on a loop (i.e. DWO is not re-queuing its own reconciles).

As for an explanation why you might see it three times:

The first time we detect an error, we log that line and set the workspace to the errored status.
Updating the workspace status is an event that triggers a new reconcile, during which we see the error case is still present and log the line again.
Not sure what's triggering this one, could be something random on the cluster or something strange we're doing?

In general, it's not an ideal situation, but is how controllers generally have to work (they have to redo all their work every time given a specific object). I'll add a commit to suppress the log line if the container's state is already errored, though -- that will make things a little cleaner.

For more clarity, if we were to never reconcile a DevWorkspace again after it reached an errored state, that would directly trigger issues like #877 and #873.

The issue arises when each reconcile triggers another reconcile with no path out of the errored state, causing the controller to enter a tight loop on checking a state that is not changing. Previously what happened in each finalize call was

Set workspace status to terminating -> if this is a change it will trigger a reconcile
Find error, set workspace status to errored -> This will always trigger a reconcile since we set it to terminating earlier

with the changes, we only ever set the state to terminating or errored, meaning if the workspace is already in an errored state, no change is applied and no reconcile is triggered.

dkwon17 · 2022-06-29T18:37:09Z

I see, thanks for the explanation @amisevsk

I was able to verify that this PR handles 2 (#877) but I had trouble with verifying 3 (#873).

After following the steps from #873 (comment), is step 6. (running oc delete dw --all) supposed to properly delete theia-next? In my case, theia-next-2 is deleted, but theia-next is still stuck in the errored state:

$ oc get dw
NAME         DEVWORKSPACE ID             PHASE   INFO
theia-next   workspace90f29c91547e48d4   Error   DevWorkspace PVC cleanup job failed: see logs for job "cleanup-workspace90f29c91547e48d4" for details

AObuchow · 2022-06-30T00:14:44Z

@amisevsk I tested this PR but got some weird results when verifying #845 isn't reproduced. For some reason, on OpenShift 4.10, the cleanup job pod was created thousands of times:

The errors which were getting logged were similar to the following:

{"level":"error","ts":1656541871.0967922,"logger":"controllers.DevWorkspace","msg":"Failed to clean up DevWorkspace storage","Request.Namespace":"devworkspace-controller","Request.Name":"theia-next","devworkspace_id":"workspacec2f66501846142e8","error":"Failed to sync cleanup job with cluster: Job.batch \"cleanup-workspacec2f66501846142e8\" is invalid: [spec.selector: Required value, spec.template.metadata.labels: Invalid value: map[string]string{\"controller.devfile.io/creator\":\"\", \"controller.devfile.io/devworkspace_id\":\"workspacec2f66501846142e8\", \"controller.devfile.io/devworkspace_name\":\"theia-next\"}: `selector` does not match template `labels`, spec.selector: Invalid value: \"null\": field is immutable, spec.template: Invalid value: core.PodTemplateSpec{ObjectMeta:v1.ObjectMeta{Name:\"\", GenerateName:\"\", Namespace:\"\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{\"controller.devfile.io/creator\":\"\", \"controller.devfile.io/devworkspace_id\":\"workspacec2f66501846142e8\", \"controller.devfile.io/devworkspace_name\":\"theia-next\"}, Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:\"\", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Spec:core.PodSpec{Volumes:[]core.Volume{core.Volume{Name:\"claim-devworkspace\", VolumeSource:core.VolumeSource{HostPath:(*core.HostPathVolumeSource)(nil), EmptyDir:(*core.EmptyDirVolumeSource)(nil), GCEPersistentDisk:(*core.GCEPersistentDiskVolumeSource)(nil), AWSElasticBlockStore:(*core.AWSElasticBlockStoreVolumeSource)(nil), GitRepo:(*core.GitRepoVolumeSource)(nil), Secret:(*core.SecretVolumeSource)(nil), NFS:(*core.NFSVolumeSource)(nil), ISCSI:(*core.ISCSIVolumeSource)(nil), Glusterfs:(*core.GlusterfsVolumeSource)(nil), PersistentVolumeClaim:(*core.PersistentVolumeClaimVolumeSource)(0xc05aa28b58), RBD:(*core.RBDVolumeSource)(nil), Quobyte:(*core.QuobyteVolumeSource)(nil), FlexVolume:(*core.FlexVolumeSource)(nil), Cinder:(*core.CinderVolumeSource)(nil), CephFS:(*core.CephFSVolumeSource)(nil), Flocker:(*core.FlockerVolumeSource)(nil), DownwardAPI:(*core.DownwardAPIVolumeSource)(nil), FC:(*core.FCVolumeSource)(nil), AzureFile:(*core.AzureFileVolumeSource)(nil), ConfigMap:(*core.ConfigMapVolumeSource)(nil), VsphereVolume:(*core.VsphereVirtualDiskVolumeSource)(nil), AzureDisk:(*core.AzureDiskVolumeSource)(nil), PhotonPersistentDisk:(*core.PhotonPersistentDiskVolumeSource)(nil), Projected:(*core.ProjectedVolumeSource)(nil), PortworxVolume:(*core.PortworxVolumeSource)(nil), ScaleIO:(*core.ScaleIOVolumeSource)(nil), StorageOS:(*core.StorageOSVolumeSource)(nil), CSI:(*core.CSIVolumeSource)(nil), Ephemeral:(*core.EphemeralVolumeSource)(nil)}}}, InitContainers:[]core.Container(nil), Containers:[]core.Container{core.Container{Name:\"cleanup-workspacec2f66501846142e8\", Image:\"registry.access.redhat.com/ubi8-micro:8.4-81\", Command:[]string{\"/bin/sh\"}, Args:[]string{\"-c\", \"exit 1\"}, WorkingDir:\"\", Ports:[]core.ContainerPort(nil), EnvFrom:[]core.EnvFromSource(nil), Env:[]core.EnvVar(nil), Resources:core.ResourceRequirements{Limits:core.ResourceList{\"cpu\":resource.Quantity{i:resource.int64Amount{value:50, scale:-3}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:\"50m\", Format:\"DecimalSI\"}, \"memory\":resource.Quantity{i:resource.int64Amount{value:104857600, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:\"100Mi\", Format:\"BinarySI\"}}, Requests:core.ResourceList{\"cpu\":resource.Quantity{i:resource.int64Amount{value:5, scale:-3}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:\"5m\", Format:\"DecimalSI\"}, \"memory\":resource.Quantity{i:resource.int64Amount{value:33554432, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:\"\", Format:\"BinarySI\"}}}, VolumeMounts:[]core.VolumeMount{core.VolumeMount{Name:\"claim-devworkspace\", ReadOnly:false, MountPath:\"/tmp/devworkspaces/\", SubPath:\"\", MountPropagation:(*core.MountPropagationMode)(nil), SubPathExpr:\"\"}}, VolumeDevices:[]core.VolumeDevice(nil), LivenessProbe:(*core.Probe)(nil), ReadinessProbe:(*core.Probe)(nil), StartupProbe:(*core.Probe)(nil), Lifecycle:(*core.Lifecycle)(nil), TerminationMessagePath:\"/dev/termination-log\", TerminationMessagePolicy:\"File\", ImagePullPolicy:\"IfNotPresent\", SecurityContext:(*core.SecurityContext)(nil), Stdin:false, StdinOnce:false, TTY:false}}, EphemeralContainers:[]core.EphemeralContainer(nil), RestartPolicy:\"Never\", TerminationGracePeriodSeconds:(*int64)(0xc057c271d0), ActiveDeadlineSeconds:(*int64)(nil), DNSPolicy:\"ClusterFirst\", NodeSelector:map[string]string(nil), ServiceAccountName:\"\", AutomountServiceAccountToken:(*bool)(nil), NodeName:\"\", SecurityContext:(*core.PodSecurityContext)(0xc05c5b3000), ImagePullSecrets:[]core.LocalObjectReference(nil), Hostname:\"\", Subdomain:\"\", SetHostnameAsFQDN:(*bool)(nil), Affinity:(*core.Affinity)(nil), SchedulerName:\"default-scheduler\", Tolerations:[]core.Toleration(nil), HostAliases:[]core.HostAlias(nil), PriorityClassName:\"\", Priority:(*int32)(nil), PreemptionPolicy:(*core.PreemptionPolicy)(nil), DNSConfig:(*core.PodDNSConfig)(nil), ReadinessGates:[]core.PodReadinessGate(nil), RuntimeClassName:(*string)(nil), Overhead:core.ResourceList(nil), EnableServiceLinks:(*bool)(nil), TopologySpreadConstraints:[]core.TopologySpreadConstraint(nil), OS:(*core.PodOS)(nil)}}: field is immutable]","stacktrace":"github.com/devfile/devworkspace-operator/controllers/workspace.(*DevWorkspaceReconciler).finalize\n\t/home/aobuchow/git/devworkspace-operator/controllers/workspace/finalize.go:69\ngithub.com/devfile/devworkspace-operator/controllers/workspace.(*DevWorkspaceReconciler).Reconcile\n\t/home/aobuchow/git/devworkspace-operator/controllers/workspace/devworkspace_controller.go:131\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/aobuchow/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.5/pkg/internal/controller/controller.go:298\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/aobuchow/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.5/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/aobuchow/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.5/pkg/internal/controller/controller.go:214"}

This occured after deleting one of the 2 workspaces.
Also, this was the status of the workspaces:

NAME           DEVWORKSPACE ID             PHASE     INFO
theia-next     workspacec2f66501846142e8   Error     DevWorkspace common PVC cleanup job failed: see logs for job "cleanup-workspacec2f66501846142e8" for details. Additional information: Detected unrecoverable event FailedMount 4 times: Unable to attach or mount volumes: unmounted volumes=[claim-devworkspace], unattached volumes=[claim-devworkspace kube-api-access-2m57d]: timed out waiting for the condition.
theia-next-2   workspace3036fb58cf68458d   Running   https://workspace3036fb58cf68458d.apps.ci-ln-w5mprw2-72292.origin-ci-int-gce.dev.rhcloud.com/theia/

My cluster has expired, but I will try re-testing very soon on another cluster to see if the issue occurs again, and to verify this PR in general.

amisevsk · 2022-06-30T18:39:23Z

@dkwon17 The issue you're running into is that no reconcile is queued up for that workspace :/

You can trigger one manually by editing e.g. the annotations on the errored workspace to trigger a reconcile, and it should then be deleted. I don't know if there's a good way to trigger reconciles for all errored workspaces though.

AObuchow · 2022-06-30T19:59:03Z

@amisevsk I tested this PR but got some weird results when verifying #845 isn't reproduced. For some reason, on OpenShift 4.10, the cleanup job pod was created thousands of times

Turns out I had 2 controllers running... oops :)

I was able to verify all 3 parts of this PR, and it worked as intended. For #873 I added another annotation to trigger a reconciliation which deleted the errored workspace.

Looks good to me :)

dkwon17

Thanks again for clarifying @amisevsk :), for #873 it is working after triggering a reconcile by editing the DW

openshift-ci · 2022-06-30T21:21:47Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: amisevsk, AObuchow, dkwon17, ibuziuk

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [amisevsk]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Since setting a workspace to the errored status will queue another reconcile, we only log the "failed to clean up common PVC" message if the workspace status is not already set to errored. Signed-off-by: Angel Misevski <amisevsk@redhat.com>

To make sure workspaces that are in an errored state due to PVC cleanup failing are removed when the common PVC is deleted (e.g. when all DevWorkspaces are deleted), add a watch and enqueue reconciles for all common-storage workspaces when the common PVC is deleted. Signed-off-by: Angel Misevski <amisevsk@redhat.com>

Update the watch for PVCs to also enqueue reconciles for per-workspace PVC events. Previously, the per-workspace PVC could be modified/deleted without the DevWorkspace that owns that PVC noticing. Signed-off-by: Angel Misevski <amisevsk@redhat.com>

openshift-ci · 2022-07-01T01:03:58Z

New changes are detected. LGTM label has been removed.

amisevsk · 2022-07-01T01:07:44Z

I've added 3 commits to this PR, for further testing (sorry all!)

Don't log the "Failed to clean up" message if the workspace is already in an Errored state to avoid confusion
Queue reconciles for all common-storage workspaces when the shared PVC is deleted to avoid the confusion faced during testing and remove the need to manually trigger a reconcile on the errored workspace (this issue Improve handling of errors during PVC cleanup #879 (comment) should no longer occur)
Also queue reconciles for per-workspace workspaces if the PVC they own faces a request, something we weren't doing before. This closes Changes to the per-workspace PVC do not trigger reconciles #881

With the second commit's changes, I think this PR closes #873

dkwon17 · 2022-07-04T15:43:48Z

I was able to verify that the new commits fix #873 and #881. The new commits looks good to me 👍

AObuchow · 2022-07-04T20:34:06Z

I was also able to verify the new commits work as expected in fixing #873 and #881 :) Code looks good to me :)

amisevsk added 3 commits June 27, 2022 20:44

amisevsk requested a review from ibuziuk as a code owner June 28, 2022 02:17

openshift-ci bot added the approved label Jun 28, 2022

amisevsk requested review from AObuchow and dkwon17 June 28, 2022 02:18

amisevsk mentioned this pull request Jun 28, 2022

Errored workspaces aren't removed after PVC cleanup failure #873

Closed

ibuziuk approved these changes Jun 28, 2022

View reviewed changes

openshift-ci bot assigned ibuziuk Jun 28, 2022

openshift-ci bot added the lgtm label Jun 28, 2022

ibuziuk mentioned this pull request Jun 28, 2022

Sprint 221 eclipse-che/che#21503

Closed

68 tasks

AObuchow approved these changes Jun 30, 2022

View reviewed changes

openshift-ci bot assigned AObuchow Jun 30, 2022

dkwon17 approved these changes Jun 30, 2022

View reviewed changes

openshift-ci bot assigned dkwon17 Jun 30, 2022

amisevsk added 3 commits June 30, 2022 18:23

openshift-ci bot removed the lgtm label Jul 1, 2022

amisevsk linked an issue Jul 1, 2022 that may be closed by this pull request

Changes to the per-workspace PVC do not trigger reconciles #881

Closed

amisevsk mentioned this pull request Jul 4, 2022

Block 'run' and 'debug' makefile rules if controller is not scaled to 0 #883

Merged

3 tasks

amisevsk merged commit 2a44ff6 into devfile:main Jul 4, 2022

amisevsk deleted the cleanup-warning branch July 4, 2022 21:18

This was referenced Jul 4, 2022

Cherry pick for 0.15.1 #885

Merged

StorageError sometimes logged when deleting a DevWorkspace #878

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve handling of errors during PVC cleanup #879

Improve handling of errors during PVC cleanup #879

amisevsk commented Jun 28, 2022

dkwon17 commented Jun 28, 2022

amisevsk commented Jun 28, 2022 •

edited

Loading

dkwon17 commented Jun 29, 2022 •

edited

Loading

AObuchow commented Jun 30, 2022

amisevsk commented Jun 30, 2022

AObuchow commented Jun 30, 2022

dkwon17 left a comment

openshift-ci bot commented Jun 30, 2022

openshift-ci bot commented Jul 1, 2022

amisevsk commented Jul 1, 2022

dkwon17 commented Jul 4, 2022

AObuchow commented Jul 4, 2022

Improve handling of errors during PVC cleanup #879

Improve handling of errors during PVC cleanup #879

Conversation

amisevsk commented Jun 28, 2022

What does this PR do?

What issues does this PR fix or reference?

Is it tested? How?

PR Checklist

dkwon17 commented Jun 28, 2022

amisevsk commented Jun 28, 2022 • edited Loading

dkwon17 commented Jun 29, 2022 • edited Loading

AObuchow commented Jun 30, 2022

amisevsk commented Jun 30, 2022

AObuchow commented Jun 30, 2022

dkwon17 left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Jun 30, 2022

openshift-ci bot commented Jul 1, 2022

amisevsk commented Jul 1, 2022

dkwon17 commented Jul 4, 2022

AObuchow commented Jul 4, 2022

amisevsk commented Jun 28, 2022 •

edited

Loading

dkwon17 commented Jun 29, 2022 •

edited

Loading