Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KOTS: stop running workspaces prior to upgrading existing workspace for single cluster ref arch #13147

Closed
2 tasks done
kylos101 opened this issue Sep 20, 2022 · 19 comments · Fixed by #13215
Closed
2 tasks done
Assignees
Labels
blocked self-hosted type: bug Something isn't working

Comments

@kylos101
Copy link
Contributor

kylos101 commented Sep 20, 2022

Is your feature request related to a problem? Please describe

We do not support live upgrades for the single cluster ref arch while workspaces are running.

Describe the behaviour you'd like

Before KOTS begins a deployment:

  1. Prompt the user if it is okay to proceed with the deploy to an existing cluster, share this should be done during an outage window planned with their business.
  2. Stop workspaces, wait for them to backup/terminate, kubectl delete pods -l component=workspace may suffice
  3. Then deploy Gitpod to the cluster (the assumption is KOTS deletes existing resources and then recreates them)

Additionally, as part of the monthly release cycle, a self-hosted test should be added, so that the upgrade flow with running workspaces is included as part of the testing.

Describe alternatives you've considered

N/A, this removes friction from the upgrade experience.

Additional context

The deploy process should not start in a live cluster while workspaces are running.

As of the August KOTS release, when a deploy is done to an existing cluster, resources are deleted, however...because ws-daemon was deleted, the workspaces could not backup, and thus could not be deleted. Therefore, it is imperative that we wait for workspace pods to be deleted (including imagebuild and prebuild), before deleting the Gitpod installation.

Customers that experience this issue willl incur data loss, and to clean-up the pods, must remove the related finalizer from the regular and prebuild workspaces.

Dependent Tasks

@kylos101
Copy link
Contributor Author

@lucasvaltl @corneliusludmann may we ask for your help in treating this as a priority for the September release?

cc: @aledbf @atduarte

@kylos101 kylos101 added the type: bug Something isn't working label Sep 20, 2022
@corneliusludmann
Copy link
Contributor

Prompt the user if it is okay to proceed with the deploy to an existing cluster, share this should be done during an outage window planned with their business.

I'm afraid we are quite limited regarding the KOTS UX and cannot ask the user. @mrsimonemms any ideas?

@mrsimonemms
Copy link
Contributor

We cannot add a "this is the impact" type message, but there is always a confirmation before the deployment is made (unless they have auto-deployments configured). Documenting the impact in the Gitpod docs is the only option.

Am I right in thinking that the reason for stopping the workspaces is to enforce the workspaces to backup to the storage?


Suggestions

  1. I'd also suggest that, rather than using kubectl delete that this is written as part of the Golang binary. I've just spent a lot of time removing as much as we can from the bash script so we should be wary of that.

Questions

  1. What happens to a workspace that's started before the upgrade process is completed? I can imagine that, as soon as they see the workspace stopping, users will almost instantly trigger a new workspace regardless of whether the upgrade process has finished. If it's the same workspace, is there any danger of those backed-up files being lost?

@lucasvaltl
Copy link
Contributor

My idea here for an absolute skateboard would be to add a preflight check (should be top of the list of preflight checks in the UI) to check for running workspaces. If workspaces are running, the check should fail and point to the (new) documentation page around stopping workspaces in this PR.

@kylos101
Copy link
Contributor Author

Am I right in thinking that the reason for stopping the workspaces is to enforce the workspaces to backup to the storage?

Yes. Otherwise, the workspaces will continue to run, KOTS will delete the gitpod installation (including ws-daemon), and those running workspaces will never have their data backed up, resulting in data loss and 😿 users.

@kylos101
Copy link
Contributor Author

What happens to a workspace that's started before the upgrade process is completed? I can imagine that, as soon as they see the workspace stopping, users will almost instantly trigger a new workspace regardless of whether the upgrade process has finished. If it's the same workspace, is there any danger of those backed-up files being lost?

I'm working on a test for this, @mrsimonemms , where basically we want to prevent users from starting workspaces during outage windows for updates.

Options:

  1. Ideally we'd use gpctl to update the cluster score to 0, or cordon it, so we do not try sending workspace starts to it
  2. Another option may be to kubectl scale --replicas=0 deployment/ws-manager -n gitpod, but the UX is poor here because it doesn't fail fast, however it might be a good short term solution

For awareness, I've created #13150, because we cannot easily test in our preview environments, due to the cluster name showing up as an empty string.

@kylos101
Copy link
Contributor Author

If workspaces are running, the check should fail and point to the (new) documentation page around stopping workspaces in this https://github.com/gitpod-io/website/pull/2766.

@lucasvaltl That will help for workspaces that are running before the upgrade is attempted, however, we also need to put the Gitpod installation into a state where it doesn't allow users to try starting workspaces...otherwise they'll have a poor experience during the upgrade.

@mrsimonemms
Copy link
Contributor

Thanks for the clarification @kylos101. I agree with @lucasvaltl's earlier comment of having a 🛹 and then bringing this additional stuff into it. From experience, upgrades tend to only take a couple of minutes to run - if it's done immediately before the helm upgrade command a user will likely not be able to start a workspace quick enough for it to be a problem in most cases

@kylos101
Copy link
Contributor Author

@mrsimonemms do we prompt the user to see which ref arch they're using? If they're using the single cluster ref arch, and there are running workspaces, it would be great if the deploy process can hard fail, sharing that workspaces are currently running.

In other words, my understanding is that the pre-flight checks are soft, and can be ignored. I'd hate for an administrator to shoot themselves in the foot, and cause users to lose data.

@mrsimonemms
Copy link
Contributor

@kylos101 No, the only prompt is a big "deploy" button - they can choose to skip the pre-flight checks, where there's another "we don't recommend this - it may break things" alert. Again, we don't have any control over this content or whether they can skip it.

The idea is the pre-flight checks are idempotent and that a change only happens when they click "deploy"

@lucasvaltl
Copy link
Contributor

lucasvaltl commented Sep 21, 2022

@lucasvaltl That will help for workspaces that are running before the upgrade is attempted, however, we also need to put the Gitpod installation into a state where it doesn't allow users to try starting workspaces...otherwise they'll have a poor experience during the upgrade.

@kylos101 Fair! What I proposed at least lessens the pain. If we can also get the installation into a state where new workloads cannot be started - all the better. Was just not sure if we can get something done for this in a reasonable timeframe :)

@mrsimonemms
Copy link
Contributor

mrsimonemms commented Sep 22, 2022

@kylos101 this command will also stop any running image builds - I presume that is a desired effect of this?

@mrsimonemms mrsimonemms self-assigned this Sep 22, 2022
@mrsimonemms mrsimonemms moved this from 📓Scheduled to ⚒In Progress in 🚚 Security, Infrastructure, and Delivery Team (SID) Sep 22, 2022
@mrsimonemms
Copy link
Contributor

@kylos101 I've had a play and created a draft PR at #13125. Unfortunately, on the app I'm testing, the workspace pod seems to have stuck on terminating.

I presume that if I were to put a --force or --grace-period, then there's going to be the danger that it will not backup the workspaces properly. Is there any safe way I can avoid the workspace termination from getting stuck?

@kylos101
Copy link
Contributor Author

kylos101 commented Sep 22, 2022

@kylos101 this command will also stop any running image builds - I presume that is a desired effect of this?

@mrsimonemms Yes sir, that is the desired effect.

I presume that if I were to put a --force or --grace-period, then there's going to be the danger that it will not backup the workspaces properly. Is there any safe way I can avoid the workspace termination from getting stuck?
It is not desirable to use force or grace-period.

What type of workspace were you testing with? Regular, prebuild, imagebuild?

@mrsimonemms
Copy link
Contributor

@mrsimonemms Yes sir, that is the desired effect.

@kylos101 thanks for clarifying.

What type of workspace were you testing with? Regular, prebuild, imagebuild?

Regular workspace this time, but I've seen that behaviour on all types of workspace.

It's one of those funny things that I've found over the years that if you run kubectl delete pods <workspace> it often hangs. It's not a problem when you're on a test instance and you just want to kill it, but it's a different thing if you're doing it programmatically on EVERY instance out there

@mrsimonemms
Copy link
Contributor

mrsimonemms commented Sep 26, 2022

I've done some more investigation on this and can confirm that gpctl workspaces stop WILL work - eventually...

The problem is that gpctl only authenticates via a kubeconfig file. When running normally, that's fine. However inside a pod, we don't have a kubeconfig file as we're authenticating as a service account.

The refactored Installer as an authClusterOrKubeconfig function, which (as the name implies) allows authentication via the supplied kubeconfig file or via detection of the serviceaccount.

Once that's in, we can stop workspaces using gpctl

And it would be very helpful if the gpctl workspaces command received a --namespace flag


EDIT: I may have found a workaround which I'm testing

@mrsimonemms mrsimonemms moved this from 🕶In Review / Measuring to ⚒In Progress in 🚚 Security, Infrastructure, and Delivery Team (SID) Sep 26, 2022
Repository owner moved this from ⚒In Progress to ✨Done in 🚚 Security, Infrastructure, and Delivery Team (SID) Sep 26, 2022
@mrsimonemms
Copy link
Contributor

I figured out how to use the gpctl function in a service account authorised environment. The deletion of workspaces now happens immediately before deployment and is in a function controlled by @gitpod-io/engineering-workspace

@jenting
Copy link
Contributor

jenting commented Sep 26, 2022

it would be very helpful if the gpctl workspaces command received a --namespace flag

@mrsimonemms
Could you please open an issue for this?
Or would you like to come out with a PR to enhance it? That would be great.
Thank you.

@mrsimonemms
Copy link
Contributor

@jenting I opened #13329 and #13330 last night, but closed as not required as found a workaround.

If you want to reopen them and work on them, please do but it's not urgent any more

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked self-hosted type: bug Something isn't working
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

5 participants