-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Call finalizeWorkspaceContent if the workspace Pod in Terminating #11337
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
github-actions
bot
added
team: webapp
Issue belongs to the WebApp team
team: workspace
Issue belongs to the Workspace team
labels
Jul 13, 2022
sagor999
reviewed
Jul 13, 2022
/hold |
…erminating state w/o backing up When the node turns into a NotReady state, after a moment, the workspace pod goes into the terminating state, but the containerStatus.state is still running. We check the pod toleration matches against the node taint, with effect NoExecute and the toleration seconds expired to make sure that the container's graceful shutdown is finished before taking the content backup. Otherwise, it might create an unstable backup. #11336 Signed-off-by: JenTing Hsiao <hsiaoairplane@gmail.com>
Signed-off-by: JenTing Hsiao <hsiaoairplane@gmail.com>
Signed-off-by: JenTing Hsiao <hsiaoairplane@gmail.com>
Contributor
Author
/werft run -a with-preview=true 👍 started the job as gitpod-build-jenting-10531.14 |
Contributor
Author
/werft run with-clean-slate-deployment 👍 started the job as gitpod-build-jenting-10531.15 |
sagor999
approved these changes
Jul 14, 2022
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
/unhold |
1 task
roboquat
added
deployed: workspace
Workspace team change is running in production
deployed
Change is completely running in production
labels
Jul 20, 2022
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
deployed: workspace
Workspace team change is running in production
deployed
Change is completely running in production
release-note
size/M
team: workspace
Issue belongs to the Workspace team
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
When the node goes NotReady, the workspace Pod goes into a Terminating state.
In this case, the workspace Pod
status.containerStatuses.state
is stillRunning
.We should try to backup the workspace content if the Pod is Terminating and the underlying node is not ready or even gone.
This is the node taint if the node turns into a NotReady state.
This is the current workspace Pod spec.toleration
Therefore, for the cases
node.kubernetes.io/disk-pressure
andnode.kubernetes.io/memory-pressure
: the workspace Pod keeps in tolerance indefinitely. -> We would not handle this case because the tolerance second is not configured.node.kubernetes.io/network-unavailable
: the workspace Pod tolerance duration is 30 seconds. -> We handle this casenode.kubernetes.io/not-ready
andnode.kubernetes.io/unreachable
: the workspace Pod tolerance duration is 300 seconds. -> We handle this caseAfter the current time - the node's taint.timeAdded > the workspace pod tolerance time, the ws-manager starts back up the content.
https://www.loom.com/share/8e0e870e6bed40809d4ac8ac1159b1e2
Related Issue(s)
Fixes #11336
How to test
systemctl disable k3s-agent
.kubectl get node -w
.kubectl get pod -l component=workspace -w
.Note: after the node removal, the terminating pod will be removed by Kubernetes after a while. (About 1 minutes)
Release Notes
Documentation
None
Werft options: