Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of errors during PVC cleanup #879

Merged
merged 6 commits into from
Jul 4, 2022

Conversation

amisevsk
Copy link
Collaborator

What does this PR do?

While working on a fix for #877, I began by attempting to unify how workspace status is managed in the finalize function (rather than having multiple places where it may be updated). In doing so, I came across a bug where every time finalize was called, it would first set the on-cluster workspace's status to Terminating, then potentially update it to Errored if PVC cleanup encountered an error. This turns out to be the underlying cause of #845.

See commits messages in this PR for more information, but this PR:

  1. Reuses the same trick from the reconcile function in handling workspace status -- we defer a function that sets the workspace's status so that there is only one call per finalize to update the status. This resolves Handling of cleanup job errors should be improved #877 and avoids reintroducing DWO continues to reconcile workspace when common PVC cleanup job fails #845 as we are no longer updating the workspace constantly
  2. Changes the finalize function to continue attempting to finalize terminating workspaces even if they are in an errored state (partially undoing the fix for DWO continues to reconcile workspace when common PVC cleanup job fails #845) in order to resolve Handling of cleanup job errors should be improved #877
  3. A minor fix to attempt to avoid logging a strange error that I've been seeing lately where we attempt to update the status of a workspace after it has been garbage collected (need to verify this works fully) -- see StorageError sometimes logged when deleting a DevWorkspace #878

I think this PR may also incidentally resolve the main piece of #873 but this needs more verification. In my brief testing, I've found:

  • The errored workspace in Errored workspaces aren't removed after PVC cleanup failure #873 is still not removed as no reconcile is triggered, but it will be removed the next time it is reconciled (you can trigger a reconcile by e.g. adding an annotation to the workspace)
  • However, in one case the cluster was still not left in a clean state as the errored pods were still present even after the job was removed (not sure why).

What issues does this PR fix or reference?

Closes: #877
Needs verification: #873

Is it tested? How?

  1. Verify that DWO continues to reconcile workspace when common PVC cleanup job fails #845 is not re-introduced (via the reproducer in that issue)

  2. Test that Handling of cleanup job errors should be improved #877 is resolved by:

    this isn't a perfect test but approximates the situation that causes Handling of cleanup job errors should be improved #877

  3. Check if this PR also resolves Errored workspaces aren't removed after PVC cleanup failure #873 by running the reproducing process there.

PR Checklist

  • E2E tests pass (when PR is ready, comment /test v8-devworkspace-operator-e2e, v8-che-happy-path to trigger)
    • v8-devworkspace-operator-e2e: DevWorkspace e2e test
    • v8-che-happy-path: Happy path for verification integration with Che

amisevsk added 3 commits June 27, 2022 20:44
Rework how we update workspace status to avoid multiple calls that
update the workspace status. Previously, every time we enter the
finalize function, we would set the status to terminating, and then
potentially set it to errored.

Instead, we use the same deferred-function handling of updating the
workspace status from main reconcile function -- defer a function that
updates the status and pass around a currentStatus struct reference that
needs to be updated. This moves all calls that update the workspace
status into one place.

Signed-off-by: Angel Misevski <amisevsk@redhat.com>
With the update to terminating workspace status handling (see previous
commit), it turns out it's no longer necessary to stop reconciling
workspaces once their job encounters an error. The previous bug where
DWO would enter a tight loop reconciling failed workspaces turns out to
be due to each finalize call updating the workspace to Terminating
status, then to Errored status, triggering a new reconcile.

Signed-off-by: Angel Misevski <amisevsk@redhat.com>
When the last finalizer on a DevWorkspace is cleared, the cluster may
garbage-collect that workspace immediately, which can result in the
object being deleted before we reach the point of attempting to update
the workspace's status. This results in logging a harmless but confusing
error, so we don't try to update the status for terminating workspaces
with no finalizers.

Signed-off-by: Angel Misevski <amisevsk@redhat.com>
@openshift-ci openshift-ci bot added the lgtm label Jun 28, 2022
@ibuziuk ibuziuk mentioned this pull request Jun 28, 2022
68 tasks
@dkwon17
Copy link
Collaborator

dkwon17 commented Jun 28, 2022

I'm having trouble verifying step 1:

After checking out this PR, and following the reproduction steps from #845 I see the Failed to clean up DevWorkspace storage error log three times instead of once in the devworkspace-controller-manager-* pod:
image

Pods (failing cleanup pods are to be expected):
image

I changed the container args in pkg/provision/storage/cleanup.go just like how it is explained in the steps, and created this image: quay.io/dkwon17/devworkspace-controller:cleanup

Any ideas why I see the Failed to clean up DevWorkspace storage three times instead of once?

@amisevsk
Copy link
Collaborator Author

amisevsk commented Jun 28, 2022

Seeing the message multiple times is expected. You'll see that message logged every time an event related to any workspace-related object occurs, and it can be triggered by e.g. editing annotations on the pod/pvc/job/devworkspace/etc.

The key is that after occurring a few times, it stops reconciling that object on a loop (i.e. DWO is not re-queuing its own reconciles).

As for an explanation why you might see it three times:

  1. The first time we detect an error, we log that line and set the workspace to the errored status.
  2. Updating the workspace status is an event that triggers a new reconcile, during which we see the error case is still present and log the line again.
  3. Not sure what's triggering this one, could be something random on the cluster or something strange we're doing?

In general, it's not an ideal situation, but is how controllers generally have to work (they have to redo all their work every time given a specific object). I'll add a commit to suppress the log line if the container's state is already errored, though -- that will make things a little cleaner.

For more clarity, if we were to never reconcile a DevWorkspace again after it reached an errored state, that would directly trigger issues like #877 and #873.

The issue arises when each reconcile triggers another reconcile with no path out of the errored state, causing the controller to enter a tight loop on checking a state that is not changing. Previously what happened in each finalize call was

  1. Set workspace status to terminating -> if this is a change it will trigger a reconcile
  2. Find error, set workspace status to errored -> This will always trigger a reconcile since we set it to terminating earlier

with the changes, we only ever set the state to terminating or errored, meaning if the workspace is already in an errored state, no change is applied and no reconcile is triggered.

@dkwon17
Copy link
Collaborator

dkwon17 commented Jun 29, 2022

I see, thanks for the explanation @amisevsk

I was able to verify that this PR handles 2 (#877) but I had trouble with verifying 3 (#873).

After following the steps from #873 (comment), is step 6. (running oc delete dw --all) supposed to properly delete theia-next? In my case, theia-next-2 is deleted, but theia-next is still stuck in the errored state:

$ oc get dw
NAME         DEVWORKSPACE ID             PHASE   INFO
theia-next   workspace90f29c91547e48d4   Error   DevWorkspace PVC cleanup job failed: see logs for job "cleanup-workspace90f29c91547e48d4" for details

@AObuchow
Copy link
Collaborator

@amisevsk I tested this PR but got some weird results when verifying #845 isn't reproduced. For some reason, on OpenShift 4.10, the cleanup job pod was created thousands of times:
image

The errors which were getting logged were similar to the following:

{"level":"error","ts":1656541871.0967922,"logger":"controllers.DevWorkspace","msg":"Failed to clean up DevWorkspace storage","Request.Namespace":"devworkspace-controller","Request.Name":"theia-next","devworkspace_id":"workspacec2f66501846142e8","error":"Failed to sync cleanup job with cluster: Job.batch \"cleanup-workspacec2f66501846142e8\" is invalid: [spec.selector: Required value, spec.template.metadata.labels: Invalid value: map[string]string{\"controller.devfile.io/creator\":\"\", \"controller.devfile.io/devworkspace_id\":\"workspacec2f66501846142e8\", \"controller.devfile.io/devworkspace_name\":\"theia-next\"}: `selector` does not match template `labels`, spec.selector: Invalid value: \"null\": field is immutable, spec.template: Invalid value: core.PodTemplateSpec{ObjectMeta:v1.ObjectMeta{Name:\"\", GenerateName:\"\", Namespace:\"\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{\"controller.devfile.io/creator\":\"\", \"controller.devfile.io/devworkspace_id\":\"workspacec2f66501846142e8\", \"controller.devfile.io/devworkspace_name\":\"theia-next\"}, Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:\"\", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Spec:core.PodSpec{Volumes:[]core.Volume{core.Volume{Name:\"claim-devworkspace\", VolumeSource:core.VolumeSource{HostPath:(*core.HostPathVolumeSource)(nil), EmptyDir:(*core.EmptyDirVolumeSource)(nil), GCEPersistentDisk:(*core.GCEPersistentDiskVolumeSource)(nil), AWSElasticBlockStore:(*core.AWSElasticBlockStoreVolumeSource)(nil), GitRepo:(*core.GitRepoVolumeSource)(nil), Secret:(*core.SecretVolumeSource)(nil), NFS:(*core.NFSVolumeSource)(nil), ISCSI:(*core.ISCSIVolumeSource)(nil), Glusterfs:(*core.GlusterfsVolumeSource)(nil), PersistentVolumeClaim:(*core.PersistentVolumeClaimVolumeSource)(0xc05aa28b58), RBD:(*core.RBDVolumeSource)(nil), Quobyte:(*core.QuobyteVolumeSource)(nil), FlexVolume:(*core.FlexVolumeSource)(nil), Cinder:(*core.CinderVolumeSource)(nil), CephFS:(*core.CephFSVolumeSource)(nil), Flocker:(*core.FlockerVolumeSource)(nil), DownwardAPI:(*core.DownwardAPIVolumeSource)(nil), FC:(*core.FCVolumeSource)(nil), AzureFile:(*core.AzureFileVolumeSource)(nil), ConfigMap:(*core.ConfigMapVolumeSource)(nil), VsphereVolume:(*core.VsphereVirtualDiskVolumeSource)(nil), AzureDisk:(*core.AzureDiskVolumeSource)(nil), PhotonPersistentDisk:(*core.PhotonPersistentDiskVolumeSource)(nil), Projected:(*core.ProjectedVolumeSource)(nil), PortworxVolume:(*core.PortworxVolumeSource)(nil), ScaleIO:(*core.ScaleIOVolumeSource)(nil), StorageOS:(*core.StorageOSVolumeSource)(nil), CSI:(*core.CSIVolumeSource)(nil), Ephemeral:(*core.EphemeralVolumeSource)(nil)}}}, InitContainers:[]core.Container(nil), Containers:[]core.Container{core.Container{Name:\"cleanup-workspacec2f66501846142e8\", Image:\"registry.access.redhat.com/ubi8-micro:8.4-81\", Command:[]string{\"/bin/sh\"}, Args:[]string{\"-c\", \"exit 1\"}, WorkingDir:\"\", Ports:[]core.ContainerPort(nil), EnvFrom:[]core.EnvFromSource(nil), Env:[]core.EnvVar(nil), Resources:core.ResourceRequirements{Limits:core.ResourceList{\"cpu\":resource.Quantity{i:resource.int64Amount{value:50, scale:-3}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:\"50m\", Format:\"DecimalSI\"}, \"memory\":resource.Quantity{i:resource.int64Amount{value:104857600, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:\"100Mi\", Format:\"BinarySI\"}}, Requests:core.ResourceList{\"cpu\":resource.Quantity{i:resource.int64Amount{value:5, scale:-3}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:\"5m\", Format:\"DecimalSI\"}, \"memory\":resource.Quantity{i:resource.int64Amount{value:33554432, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:\"\", Format:\"BinarySI\"}}}, VolumeMounts:[]core.VolumeMount{core.VolumeMount{Name:\"claim-devworkspace\", ReadOnly:false, MountPath:\"/tmp/devworkspaces/\", SubPath:\"\", MountPropagation:(*core.MountPropagationMode)(nil), SubPathExpr:\"\"}}, VolumeDevices:[]core.VolumeDevice(nil), LivenessProbe:(*core.Probe)(nil), ReadinessProbe:(*core.Probe)(nil), StartupProbe:(*core.Probe)(nil), Lifecycle:(*core.Lifecycle)(nil), TerminationMessagePath:\"/dev/termination-log\", TerminationMessagePolicy:\"File\", ImagePullPolicy:\"IfNotPresent\", SecurityContext:(*core.SecurityContext)(nil), Stdin:false, StdinOnce:false, TTY:false}}, EphemeralContainers:[]core.EphemeralContainer(nil), RestartPolicy:\"Never\", TerminationGracePeriodSeconds:(*int64)(0xc057c271d0), ActiveDeadlineSeconds:(*int64)(nil), DNSPolicy:\"ClusterFirst\", NodeSelector:map[string]string(nil), ServiceAccountName:\"\", AutomountServiceAccountToken:(*bool)(nil), NodeName:\"\", SecurityContext:(*core.PodSecurityContext)(0xc05c5b3000), ImagePullSecrets:[]core.LocalObjectReference(nil), Hostname:\"\", Subdomain:\"\", SetHostnameAsFQDN:(*bool)(nil), Affinity:(*core.Affinity)(nil), SchedulerName:\"default-scheduler\", Tolerations:[]core.Toleration(nil), HostAliases:[]core.HostAlias(nil), PriorityClassName:\"\", Priority:(*int32)(nil), PreemptionPolicy:(*core.PreemptionPolicy)(nil), DNSConfig:(*core.PodDNSConfig)(nil), ReadinessGates:[]core.PodReadinessGate(nil), RuntimeClassName:(*string)(nil), Overhead:core.ResourceList(nil), EnableServiceLinks:(*bool)(nil), TopologySpreadConstraints:[]core.TopologySpreadConstraint(nil), OS:(*core.PodOS)(nil)}}: field is immutable]","stacktrace":"github.com/devfile/devworkspace-operator/controllers/workspace.(*DevWorkspaceReconciler).finalize\n\t/home/aobuchow/git/devworkspace-operator/controllers/workspace/finalize.go:69\ngithub.com/devfile/devworkspace-operator/controllers/workspace.(*DevWorkspaceReconciler).Reconcile\n\t/home/aobuchow/git/devworkspace-operator/controllers/workspace/devworkspace_controller.go:131\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/aobuchow/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.5/pkg/internal/controller/controller.go:298\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/aobuchow/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.5/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/aobuchow/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.5/pkg/internal/controller/controller.go:214"}

This occured after deleting one of the 2 workspaces.
Also, this was the status of the workspaces:

NAME           DEVWORKSPACE ID             PHASE     INFO
theia-next     workspacec2f66501846142e8   Error     DevWorkspace common PVC cleanup job failed: see logs for job "cleanup-workspacec2f66501846142e8" for details. Additional information: Detected unrecoverable event FailedMount 4 times: Unable to attach or mount volumes: unmounted volumes=[claim-devworkspace], unattached volumes=[claim-devworkspace kube-api-access-2m57d]: timed out waiting for the condition.
theia-next-2   workspace3036fb58cf68458d   Running   https://workspace3036fb58cf68458d.apps.ci-ln-w5mprw2-72292.origin-ci-int-gce.dev.rhcloud.com/theia/

My cluster has expired, but I will try re-testing very soon on another cluster to see if the issue occurs again, and to verify this PR in general.

@amisevsk
Copy link
Collaborator Author

@dkwon17 The issue you're running into is that no reconcile is queued up for that workspace :/

You can trigger one manually by editing e.g. the annotations on the errored workspace to trigger a reconcile, and it should then be deleted. I don't know if there's a good way to trigger reconciles for all errored workspaces though.

@AObuchow
Copy link
Collaborator

@amisevsk I tested this PR but got some weird results when verifying #845 isn't reproduced. For some reason, on OpenShift 4.10, the cleanup job pod was created thousands of times

Turns out I had 2 controllers running... oops :)

I was able to verify all 3 parts of this PR, and it worked as intended. For #873 I added another annotation to trigger a reconciliation which deleted the errored workspace.

Looks good to me :)

Copy link
Collaborator

@dkwon17 dkwon17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for clarifying @amisevsk :), for #873 it is working after triggering a reconcile by editing the DW

@openshift-ci
Copy link

openshift-ci bot commented Jun 30, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: amisevsk, AObuchow, dkwon17, ibuziuk

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

amisevsk added 3 commits June 30, 2022 18:23
Since setting a workspace to the errored status will queue another
reconcile, we only log the "failed to clean up common PVC" message if
the workspace status is not already set to errored.

Signed-off-by: Angel Misevski <amisevsk@redhat.com>
To make sure workspaces that are in an errored state due to PVC cleanup
failing are removed when the common PVC is deleted (e.g. when all
DevWorkspaces are deleted), add a watch and enqueue reconciles for all
common-storage workspaces when the common PVC is deleted.

Signed-off-by: Angel Misevski <amisevsk@redhat.com>
Update the watch for PVCs to also enqueue reconciles for per-workspace
PVC events. Previously, the per-workspace PVC could be modified/deleted
without the DevWorkspace that owns that PVC noticing.

Signed-off-by: Angel Misevski <amisevsk@redhat.com>
@openshift-ci openshift-ci bot removed the lgtm label Jul 1, 2022
@openshift-ci
Copy link

openshift-ci bot commented Jul 1, 2022

New changes are detected. LGTM label has been removed.

@amisevsk
Copy link
Collaborator Author

amisevsk commented Jul 1, 2022

I've added 3 commits to this PR, for further testing (sorry all!)

  1. Don't log the "Failed to clean up" message if the workspace is already in an Errored state to avoid confusion
  2. Queue reconciles for all common-storage workspaces when the shared PVC is deleted to avoid the confusion faced during testing and remove the need to manually trigger a reconcile on the errored workspace (this issue Improve handling of errors during PVC cleanup #879 (comment) should no longer occur)
  3. Also queue reconciles for per-workspace workspaces if the PVC they own faces a request, something we weren't doing before. This closes Changes to the per-workspace PVC do not trigger reconciles #881

With the second commit's changes, I think this PR closes #873

@amisevsk amisevsk linked an issue Jul 1, 2022 that may be closed by this pull request
@dkwon17
Copy link
Collaborator

dkwon17 commented Jul 4, 2022

I was able to verify that the new commits fix #873 and #881. The new commits looks good to me 👍

@AObuchow
Copy link
Collaborator

AObuchow commented Jul 4, 2022

I was also able to verify the new commits work as expected in fixing #873 and #881 :) Code looks good to me :)

@amisevsk amisevsk merged commit 2a44ff6 into devfile:main Jul 4, 2022
@amisevsk amisevsk deleted the cleanup-warning branch July 4, 2022 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants