[ws-manager] fix workspace status flipping pending to deleted #9438

sagor999 · 2022-04-20T21:27:28Z

Description

Fixes an issue where workspace status can get flipped from PENDING to TERMINATED back and forth if workspace pod is stuck in pending state (can happen if we are scaling up). When this happens, ws-manager will delete the pod after 5s and try again. Deletion of the pod would trigger workspace status update to get flipped into TERMINATED state causing webapp to delete workspace token from db.
This PR fixes this issue.

@geropl there is a small UX issue when pod is stuck in pending: it doesn't seem like there is any way to signal to stop the workspace. It will be stuck in pending with no way to stop it unless ws-manager will be able to schedule it.

Related Issue(s)

Fixes #8703

How to test

Spin up workspace preview env
Start workspace and observe it starts normally
Stop workspace
Cordon the node
Start workspace again
Observe that ws-manager keeps re-creating the pod every 5s as it is stuck in pending state but dashboard will remain now on Preparing workspace phase. Also observe that ws-manager-bridge is now only getting status updates that indicate phase is pending.

Release Notes

[ws-manager] fix a bug when opening workspace you would be signed out from git and not able to do git commands

Documentation

sagor999 · 2022-04-20T21:37:53Z

/werft run with-clean-slate-deployment

👍 started the job as gitpod-build-pavel-8703.2

csweichel · 2022-04-20T23:42:38Z

components/ws-manager/pkg/manager/manager.go

@@ -200,6 +200,9 @@ func (m *Manager) StartWorkspace(ctx context.Context, req *api.StartWorkspaceReq
 	}
 	span.LogKV("event", "pod description created")

+	// add an annotation to the pod to signal that ws-manager will now attempt to create this pod
+	pod.Annotations[attemptingToCreatePodAnnotation] = "true"


this should be part of the createDefiniteWorkspacePod and corresponding tests (cd pkg/manager && go test -update -force .)

I disagree. 😅
Adding it into createDefiniteWorkspacePod will lose some context as to why it is added, and also this whole retry logic is a temporary workaround for OutOfMemory error. Once we have 1.23.6 k8s and confirm it is working, we might consider removing retry logic all together.
Another reason not to add this into createDefiniteWorkspacePod is that it is not obvious that someone has to remove that annotation when pod is created. So if that function will be used later somewhere else, it will have unexpected side effects for someone who is using it.

I understand the intent to keep the code together - it is however a break in current style.

Today, all pod creation is covered by the very same createDefiniteWorkspacePod (CDWP) function and tests. I.e. no modifications happen outside of the initial pod struct production prior to creating it in Kubernetes. Adding this annotation outside of the regular pod creation path means we have no coverage that it's added, and we can no longer be sure that what our fixtures say will be used is what's actually used.

createDefiniteWorkspacePod only makes sense in the path of creating a pod, and is very unlikely to be used elsewhere.

Also, I don't think this workaround will go away any time soon. When 1.23.6 comes around, we might be able to upgrade in SaaS. But self-hosted users might face the very same problem still. IMHO this code is here to stay for a good while longer.

I have no hard feelings on this (would be ok either way), but a clear preference towards making this part of CDWP for the reasons outlined above.

@csweichel I see your point. Fair enough, code and tests updated.

components/ws-manager/pkg/manager/manager.go

sagor999 · 2022-04-21T00:17:52Z

@csweichel PTAL 🙏

geropl · 2022-04-21T06:38:20Z

@geropl there is a small UX issue when pod is stuck in pending: it doesn't seem like there is any way to signal to stop the workspace. It will be stuck in pending with no way to stop it unless ws-manager will be able to schedule it.

@sagor999 Agreed, thx for hinting: Will add a note to #8274

geropl · 2022-04-22T07:36:25Z

🎉 🚀 🙏

geropl · 2022-04-22T07:36:57Z

Is there an ETA for this? I guess sometime next week?

roboquat added do-not-merge/work-in-progress do-not-merge/release-note-label-needed size/M labels Apr 20, 2022

sagor999 force-pushed the pavel/8703 branch from 1e4d2b5 to 6baaec5 Compare April 20, 2022 21:28

sagor999 changed the title ~~[ws-manager] while creating workspace pod make sure workspace status …~~ [ws-manager] fix workspace status flipping pending to deleted Apr 20, 2022

roboquat added release-note and removed do-not-merge/release-note-label-needed labels Apr 20, 2022

sagor999 marked this pull request as ready for review April 20, 2022 21:59

sagor999 requested a review from a team April 20, 2022 21:59

roboquat removed the do-not-merge/work-in-progress label Apr 20, 2022

github-actions bot added the team: workspace Issue belongs to the Workspace team label Apr 20, 2022

csweichel reviewed Apr 20, 2022

View reviewed changes

sagor999 force-pushed the pavel/8703 branch from 6baaec5 to e3e1198 Compare April 21, 2022 00:11

roboquat added size/S and removed size/M labels Apr 21, 2022

sagor999 requested a review from csweichel April 21, 2022 00:18

geropl mentioned this pull request Apr 21, 2022

Dashboard: Can't delete workspaces which failed to be created due to lack of memory on pod #8274

Open

[ws-manager] fix workspace status flipping pending to deleted

652967e

sagor999 force-pushed the pavel/8703 branch from e3e1198 to 652967e Compare April 21, 2022 11:30

roboquat added size/M and removed size/S labels Apr 21, 2022

csweichel approved these changes Apr 21, 2022

View reviewed changes

roboquat merged commit 0c66eb2 into main Apr 21, 2022

roboquat deleted the pavel/8703 branch April 21, 2022 15:14

roboquat added the deployed: workspace Workspace team change is running in production label Apr 23, 2022

roboquat added the deployed Change is completely running in production label Apr 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ws-manager] fix workspace status flipping pending to deleted #9438

[ws-manager] fix workspace status flipping pending to deleted #9438

sagor999 commented Apr 20, 2022 •

edited

Loading

sagor999 commented Apr 20, 2022 •

edited by werft-gitpod-dev-com bot

Loading

csweichel Apr 20, 2022

sagor999 Apr 21, 2022

csweichel Apr 21, 2022

sagor999 Apr 21, 2022

sagor999 commented Apr 21, 2022

geropl commented Apr 21, 2022

geropl commented Apr 22, 2022

geropl commented Apr 22, 2022

[ws-manager] fix workspace status flipping pending to deleted #9438

[ws-manager] fix workspace status flipping pending to deleted #9438

Conversation

sagor999 commented Apr 20, 2022 • edited Loading

Description

Related Issue(s)

How to test

Release Notes

Documentation

sagor999 commented Apr 20, 2022 • edited by werft-gitpod-dev-com bot Loading

csweichel Apr 20, 2022

Choose a reason for hiding this comment

sagor999 Apr 21, 2022

Choose a reason for hiding this comment

csweichel Apr 21, 2022

Choose a reason for hiding this comment

sagor999 Apr 21, 2022

Choose a reason for hiding this comment

sagor999 commented Apr 21, 2022

geropl commented Apr 21, 2022

geropl commented Apr 22, 2022

geropl commented Apr 22, 2022

sagor999 commented Apr 20, 2022 •

edited

Loading

sagor999 commented Apr 20, 2022 •

edited by werft-gitpod-dev-com bot

Loading