Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Call finalizeWorkspaceContent if the workspace Pod in Terminating #11337

Merged
merged 3 commits into from
Jul 14, 2022

Conversation

jenting
Copy link
Contributor

@jenting jenting commented Jul 13, 2022

Description

When the node goes NotReady, the workspace Pod goes into a Terminating state.
In this case, the workspace Pod status.containerStatuses.state is still Running.

We should try to backup the workspace content if the Pod is Terminating and the underlying node is not ready or even gone.

This is the node taint if the node turns into a NotReady state.
image

This is the current workspace Pod spec.toleration
image

Therefore, for the cases

  • node.kubernetes.io/disk-pressure and node.kubernetes.io/memory-pressure: the workspace Pod keeps in tolerance indefinitely. -> We would not handle this case because the tolerance second is not configured.
  • node.kubernetes.io/network-unavailable: the workspace Pod tolerance duration is 30 seconds. -> We handle this case
  • node.kubernetes.io/not-ready and node.kubernetes.io/unreachable: the workspace Pod tolerance duration is 300 seconds. -> We handle this case

After the current time - the node's taint.timeAdded > the workspace pod tolerance time, the ws-manager starts back up the content.

https://www.loom.com/share/8e0e870e6bed40809d4ac8ac1159b1e2

Related Issue(s)

Fixes #11336

How to test

  1. Create 2 nodes, 1 control plane node, 1 worker node (using the workspace-preview).
  2. Launch a workspace, and the workspace Pod should be located on the worker node 🙏 .
  3. SSH into the worker node, disabling the k3s-agent systemctl disable k3s-agent.
  4. Waits for the node in NotReady state kubectl get node -w.
  5. Waits for the workspace pod in Terminating state kubectl get pod -l component=workspace -w.
  6. Check the workspace pod content is back up successfully.

Note: after the node removal, the terminating pod will be removed by Kubernetes after a while. (About 1 minutes)

Release Notes

Try to backup content when the node goes into the NotReady state

Documentation

None

Werft options:

  • /werft with-preview

@jenting jenting marked this pull request as ready for review July 13, 2022 08:59
@jenting jenting requested review from a team July 13, 2022 08:59
@github-actions github-actions bot added team: webapp Issue belongs to the WebApp team team: workspace Issue belongs to the Workspace team labels Jul 13, 2022
@sagor999
Copy link
Contributor

/hold
to prevent auto merge

jenting added 3 commits July 14, 2022 13:17
…erminating state w/o backing up

When the node turns into a NotReady state, after a moment, the workspace pod
goes into the terminating state, but the containerStatus.state is still running.

We check the pod toleration matches against the node taint, with effect
NoExecute and the toleration seconds expired to make sure that the container's
graceful shutdown is finished before taking the content backup.
Otherwise, it might create an unstable backup.

#11336

Signed-off-by: JenTing Hsiao <hsiaoairplane@gmail.com>
Signed-off-by: JenTing Hsiao <hsiaoairplane@gmail.com>
Signed-off-by: JenTing Hsiao <hsiaoairplane@gmail.com>
@jenting jenting removed the request for review from a team July 14, 2022 13:34
@jenting
Copy link
Contributor Author

jenting commented Jul 14, 2022

/werft run -a with-preview=true

👍 started the job as gitpod-build-jenting-10531.14
(with .werft/ from main)

@jenting jenting marked this pull request as ready for review July 14, 2022 14:22
@jenting
Copy link
Contributor Author

jenting commented Jul 14, 2022

/werft run with-clean-slate-deployment

👍 started the job as gitpod-build-jenting-10531.15
(with .werft/ from main)

@jenting jenting requested a review from sagor999 July 14, 2022 14:29
@jenting jenting removed the team: webapp Issue belongs to the WebApp team label Jul 14, 2022
Copy link
Contributor

@sagor999 sagor999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@sagor999
Copy link
Contributor

/unhold

@roboquat roboquat merged commit 95ec04a into main Jul 14, 2022
@roboquat roboquat deleted the jenting/10531 branch July 14, 2022 18:03
@roboquat roboquat added deployed: workspace Workspace team change is running in production deployed Change is completely running in production labels Jul 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deployed: workspace Workspace team change is running in production deployed Change is completely running in production release-note size/M team: workspace Issue belongs to the Workspace team
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

We did not backup content when the node goes to NotReady + Pod goes to Terminating
3 participants