-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve handling of errors during PVC cleanup #879
Conversation
Rework how we update workspace status to avoid multiple calls that update the workspace status. Previously, every time we enter the finalize function, we would set the status to terminating, and then potentially set it to errored. Instead, we use the same deferred-function handling of updating the workspace status from main reconcile function -- defer a function that updates the status and pass around a currentStatus struct reference that needs to be updated. This moves all calls that update the workspace status into one place. Signed-off-by: Angel Misevski <amisevsk@redhat.com>
With the update to terminating workspace status handling (see previous commit), it turns out it's no longer necessary to stop reconciling workspaces once their job encounters an error. The previous bug where DWO would enter a tight loop reconciling failed workspaces turns out to be due to each finalize call updating the workspace to Terminating status, then to Errored status, triggering a new reconcile. Signed-off-by: Angel Misevski <amisevsk@redhat.com>
When the last finalizer on a DevWorkspace is cleared, the cluster may garbage-collect that workspace immediately, which can result in the object being deleted before we reach the point of attempting to update the workspace's status. This results in logging a harmless but confusing error, so we don't try to update the status for terminating workspaces with no finalizers. Signed-off-by: Angel Misevski <amisevsk@redhat.com>
I'm having trouble verifying step 1: After checking out this PR, and following the reproduction steps from #845 I see the Pods (failing cleanup pods are to be expected): I changed the container args in Any ideas why I see the |
Seeing the message multiple times is expected. You'll see that message logged every time an event related to any workspace-related object occurs, and it can be triggered by e.g. editing annotations on the pod/pvc/job/devworkspace/etc. The key is that after occurring a few times, it stops reconciling that object on a loop (i.e. DWO is not re-queuing its own reconciles). As for an explanation why you might see it three times:
In general, it's not an ideal situation, but is how controllers generally have to work (they have to redo all their work every time given a specific object). I'll add a commit to suppress the log line if the container's state is already errored, though -- that will make things a little cleaner. For more clarity, if we were to never reconcile a DevWorkspace again after it reached an errored state, that would directly trigger issues like #877 and #873. The issue arises when each reconcile triggers another reconcile with no path out of the errored state, causing the controller to enter a tight loop on checking a state that is not changing. Previously what happened in each finalize call was
with the changes, we only ever set the state to |
I see, thanks for the explanation @amisevsk I was able to verify that this PR handles 2 (#877) but I had trouble with verifying 3 (#873). After following the steps from #873 (comment), is step 6. (running
|
@amisevsk I tested this PR but got some weird results when verifying #845 isn't reproduced. For some reason, on OpenShift 4.10, the cleanup job pod was created thousands of times: The errors which were getting logged were similar to the following:
This occured after deleting one of the 2 workspaces.
My cluster has expired, but I will try re-testing very soon on another cluster to see if the issue occurs again, and to verify this PR in general. |
@dkwon17 The issue you're running into is that no reconcile is queued up for that workspace :/ You can trigger one manually by editing e.g. the annotations on the errored workspace to trigger a reconcile, and it should then be deleted. I don't know if there's a good way to trigger reconciles for all errored workspaces though. |
Turns out I had 2 controllers running... oops :) I was able to verify all 3 parts of this PR, and it worked as intended. For #873 I added another annotation to trigger a reconciliation which deleted the errored workspace. Looks good to me :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: amisevsk, AObuchow, dkwon17, ibuziuk The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Since setting a workspace to the errored status will queue another reconcile, we only log the "failed to clean up common PVC" message if the workspace status is not already set to errored. Signed-off-by: Angel Misevski <amisevsk@redhat.com>
To make sure workspaces that are in an errored state due to PVC cleanup failing are removed when the common PVC is deleted (e.g. when all DevWorkspaces are deleted), add a watch and enqueue reconciles for all common-storage workspaces when the common PVC is deleted. Signed-off-by: Angel Misevski <amisevsk@redhat.com>
Update the watch for PVCs to also enqueue reconciles for per-workspace PVC events. Previously, the per-workspace PVC could be modified/deleted without the DevWorkspace that owns that PVC noticing. Signed-off-by: Angel Misevski <amisevsk@redhat.com>
New changes are detected. LGTM label has been removed. |
I've added 3 commits to this PR, for further testing (sorry all!)
With the second commit's changes, I think this PR closes #873 |
What does this PR do?
While working on a fix for #877, I began by attempting to unify how workspace status is managed in the finalize function (rather than having multiple places where it may be updated). In doing so, I came across a bug where every time finalize was called, it would first set the on-cluster workspace's status to
Terminating
, then potentially update it toErrored
if PVC cleanup encountered an error. This turns out to be the underlying cause of #845.See commits messages in this PR for more information, but this PR:
reconcile
function in handling workspace status -- we defer a function that sets the workspace's status so that there is only one call per finalize to update the status. This resolves Handling of cleanup job errors should be improved #877 and avoids reintroducing DWO continues to reconcile workspace when common PVC cleanup job fails #845 as we are no longer updating the workspace constantlyI think this PR may also incidentally resolve the main piece of #873 but this needs more verification. In my brief testing, I've found:
What issues does this PR fix or reference?
Closes: #877
Needs verification: #873
Is it tested? How?
Verify that DWO continues to reconcile workspace when common PVC cleanup job fails #845 is not re-introduced (via the reproducer in that issue)
Test that Handling of cleanup job errors should be improved #877 is resolved by:
this isn't a perfect test but approximates the situation that causes Handling of cleanup job errors should be improved #877
Check if this PR also resolves Errored workspaces aren't removed after PVC cleanup failure #873 by running the reproducing process there.
PR Checklist
/test v8-devworkspace-operator-e2e, v8-che-happy-path
to trigger)v8-devworkspace-operator-e2e
: DevWorkspace e2e testv8-che-happy-path
: Happy path for verification integration with Che