KOTS: stop running workspaces prior to upgrading existing workspace for single cluster ref arch #13147

kylos101 · 2022-09-20T23:43:19Z

Is your feature request related to a problem? Please describe

We do not support live upgrades for the single cluster ref arch while workspaces are running.

Describe the behaviour you'd like

Before KOTS begins a deployment:

Prompt the user if it is okay to proceed with the deploy to an existing cluster, share this should be done during an outage window planned with their business.
Stop workspaces, wait for them to backup/terminate, kubectl delete pods -l component=workspace may suffice
Then deploy Gitpod to the cluster (the assumption is KOTS deletes existing resources and then recreates them)

Additionally, as part of the monthly release cycle, a self-hosted test should be added, so that the upgrade flow with running workspaces is included as part of the testing.

Describe alternatives you've considered

N/A, this removes friction from the upgrade experience.

Additional context

The deploy process should not start in a live cluster while workspaces are running.

As of the August KOTS release, when a deploy is done to an existing cluster, resources are deleted, however...because ws-daemon was deleted, the workspaces could not backup, and thus could not be deleted. Therefore, it is imperative that we wait for workspace pods to be deleted (including imagebuild and prebuild), before deleting the Gitpod installation.

Customers that experience this issue willl incur data loss, and to clean-up the pods, must remove the related finalizer from the regular and prebuild workspaces.

Dependent Tasks

The text was updated successfully, but these errors were encountered:

kylos101 · 2022-09-20T23:51:22Z

@lucasvaltl @corneliusludmann may we ask for your help in treating this as a priority for the September release?

cc: @aledbf @atduarte

corneliusludmann · 2022-09-21T05:49:42Z

Prompt the user if it is okay to proceed with the deploy to an existing cluster, share this should be done during an outage window planned with their business.

I'm afraid we are quite limited regarding the KOTS UX and cannot ask the user. @mrsimonemms any ideas?

mrsimonemms · 2022-09-21T06:46:43Z

We cannot add a "this is the impact" type message, but there is always a confirmation before the deployment is made (unless they have auto-deployments configured). Documenting the impact in the Gitpod docs is the only option.

Am I right in thinking that the reason for stopping the workspaces is to enforce the workspaces to backup to the storage?

Suggestions

I'd also suggest that, rather than using kubectl delete that this is written as part of the Golang binary. I've just spent a lot of time removing as much as we can from the bash script so we should be wary of that.

Questions

What happens to a workspace that's started before the upgrade process is completed? I can imagine that, as soon as they see the workspace stopping, users will almost instantly trigger a new workspace regardless of whether the upgrade process has finished. If it's the same workspace, is there any danger of those backed-up files being lost?

lucasvaltl · 2022-09-21T06:57:48Z

My idea here for an absolute skateboard would be to add a preflight check (should be top of the list of preflight checks in the UI) to check for running workspaces. If workspaces are running, the check should fail and point to the (new) documentation page around stopping workspaces in this PR.

kylos101 · 2022-09-21T12:40:27Z

Am I right in thinking that the reason for stopping the workspaces is to enforce the workspaces to backup to the storage?

Yes. Otherwise, the workspaces will continue to run, KOTS will delete the gitpod installation (including ws-daemon), and those running workspaces will never have their data backed up, resulting in data loss and 😿 users.

kylos101 · 2022-09-21T12:44:21Z

What happens to a workspace that's started before the upgrade process is completed? I can imagine that, as soon as they see the workspace stopping, users will almost instantly trigger a new workspace regardless of whether the upgrade process has finished. If it's the same workspace, is there any danger of those backed-up files being lost?

I'm working on a test for this, @mrsimonemms , where basically we want to prevent users from starting workspaces during outage windows for updates.

Options:

Ideally we'd use gpctl to update the cluster score to 0, or cordon it, so we do not try sending workspace starts to it
Another option may be to kubectl scale --replicas=0 deployment/ws-manager -n gitpod, but the UX is poor here because it doesn't fail fast, however it might be a good short term solution

For awareness, I've created #13150, because we cannot easily test in our preview environments, due to the cluster name showing up as an empty string.

kylos101 · 2022-09-21T12:45:20Z

If workspaces are running, the check should fail and point to the (new) documentation page around stopping workspaces in this https://github.com/gitpod-io/website/pull/2766.

@lucasvaltl That will help for workspaces that are running before the upgrade is attempted, however, we also need to put the Gitpod installation into a state where it doesn't allow users to try starting workspaces...otherwise they'll have a poor experience during the upgrade.

mrsimonemms · 2022-09-21T12:47:19Z

Thanks for the clarification @kylos101. I agree with @lucasvaltl's earlier comment of having a 🛹 and then bringing this additional stuff into it. From experience, upgrades tend to only take a couple of minutes to run - if it's done immediately before the helm upgrade command a user will likely not be able to start a workspace quick enough for it to be a problem in most cases

kylos101 · 2022-09-21T12:50:22Z

@mrsimonemms do we prompt the user to see which ref arch they're using? If they're using the single cluster ref arch, and there are running workspaces, it would be great if the deploy process can hard fail, sharing that workspaces are currently running.

In other words, my understanding is that the pre-flight checks are soft, and can be ignored. I'd hate for an administrator to shoot themselves in the foot, and cause users to lose data.

mrsimonemms · 2022-09-21T12:54:30Z

@kylos101 No, the only prompt is a big "deploy" button - they can choose to skip the pre-flight checks, where there's another "we don't recommend this - it may break things" alert. Again, we don't have any control over this content or whether they can skip it.

The idea is the pre-flight checks are idempotent and that a change only happens when they click "deploy"

lucasvaltl · 2022-09-21T13:15:18Z

@lucasvaltl That will help for workspaces that are running before the upgrade is attempted, however, we also need to put the Gitpod installation into a state where it doesn't allow users to try starting workspaces...otherwise they'll have a poor experience during the upgrade.

@kylos101 Fair! What I proposed at least lessens the pain. If we can also get the installation into a state where new workloads cannot be started - all the better. Was just not sure if we can get something done for this in a reasonable timeframe :)

mrsimonemms · 2022-09-22T15:00:04Z

@kylos101 this command will also stop any running image builds - I presume that is a desired effect of this?

mrsimonemms · 2022-09-22T20:50:59Z

@kylos101 I've had a play and created a draft PR at #13125. Unfortunately, on the app I'm testing, the workspace pod seems to have stuck on terminating.

I presume that if I were to put a --force or --grace-period, then there's going to be the danger that it will not backup the workspaces properly. Is there any safe way I can avoid the workspace termination from getting stuck?

kylos101 · 2022-09-22T21:26:06Z

@kylos101 this command will also stop any running image builds - I presume that is a desired effect of this?

@mrsimonemms Yes sir, that is the desired effect.

I presume that if I were to put a --force or --grace-period, then there's going to be the danger that it will not backup the workspaces properly. Is there any safe way I can avoid the workspace termination from getting stuck?
It is not desirable to use force or grace-period.

What type of workspace were you testing with? Regular, prebuild, imagebuild?

mrsimonemms · 2022-09-23T06:48:04Z

@mrsimonemms Yes sir, that is the desired effect.

@kylos101 thanks for clarifying.

What type of workspace were you testing with? Regular, prebuild, imagebuild?

Regular workspace this time, but I've seen that behaviour on all types of workspace.

It's one of those funny things that I've found over the years that if you run kubectl delete pods <workspace> it often hangs. It's not a problem when you're on a test instance and you just want to kill it, but it's a different thing if you're doing it programmatically on EVERY instance out there

mrsimonemms · 2022-09-26T16:09:56Z

I've done some more investigation on this and can confirm that gpctl workspaces stop WILL work - eventually...

The problem is that gpctl only authenticates via a kubeconfig file. When running normally, that's fine. However inside a pod, we don't have a kubeconfig file as we're authenticating as a service account.

The refactored Installer as an authClusterOrKubeconfig function, which (as the name implies) allows authentication via the supplied kubeconfig file or via detection of the serviceaccount.

Once that's in, we can stop workspaces using gpctl

And it would be very helpful if the gpctl workspaces command received a --namespace flag

EDIT: I may have found a workaround which I'm testing

mrsimonemms · 2022-09-26T21:21:57Z

I figured out how to use the gpctl function in a service account authorised environment. The deletion of workspaces now happens immediately before deployment and is in a function controlled by @gitpod-io/engineering-workspace

jenting · 2022-09-26T22:38:27Z

it would be very helpful if the gpctl workspaces command received a --namespace flag

@mrsimonemms
Could you please open an issue for this?
Or would you like to come out with a PR to enhance it? That would be great.
Thank you.

mrsimonemms · 2022-09-27T06:50:25Z

@jenting I opened #13329 and #13330 last night, but closed as not required as found a workaround.

If you want to reopen them and work on them, please do but it's not urgent any more

kylos101 added the self-hosted label Sep 20, 2022

kylos101 added this to 🚚 Security, Infrastructure, and Delivery Team (SID) Sep 20, 2022

kylos101 added the type: bug Something isn't working label Sep 20, 2022

corneliusludmann moved this to 📓Scheduled in 🚚 Security, Infrastructure, and Delivery Team (SID) Sep 21, 2022

mrsimonemms self-assigned this Sep 22, 2022

mrsimonemms moved this from 📓Scheduled to ⚒In Progress in 🚚 Security, Infrastructure, and Delivery Team (SID) Sep 22, 2022

mrsimonemms mentioned this issue Sep 23, 2022

[kots]: delete workspace pods before installing Gitpod #13215

Merged

3 tasks

mrsimonemms moved this from ⚒In Progress to 🕶In Review / Measuring in 🚚 Security, Infrastructure, and Delivery Team (SID) Sep 23, 2022

mrsimonemms added the blocked label Sep 26, 2022

mrsimonemms moved this from 🕶In Review / Measuring to ⚒In Progress in 🚚 Security, Infrastructure, and Delivery Team (SID) Sep 26, 2022

roboquat closed this as completed in #13215 Sep 26, 2022

Repository owner moved this from ⚒In Progress to ✨Done in 🚚 Security, Infrastructure, and Delivery Team (SID) Sep 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KOTS: stop running workspaces prior to upgrading existing workspace for single cluster ref arch #13147

KOTS: stop running workspaces prior to upgrading existing workspace for single cluster ref arch #13147

kylos101 commented Sep 20, 2022 •

edited by mrsimonemms

Loading

kylos101 commented Sep 20, 2022

corneliusludmann commented Sep 21, 2022

mrsimonemms commented Sep 21, 2022

lucasvaltl commented Sep 21, 2022

kylos101 commented Sep 21, 2022

kylos101 commented Sep 21, 2022

kylos101 commented Sep 21, 2022

mrsimonemms commented Sep 21, 2022

kylos101 commented Sep 21, 2022

mrsimonemms commented Sep 21, 2022

lucasvaltl commented Sep 21, 2022 •

edited

Loading

mrsimonemms commented Sep 22, 2022 •

edited

Loading

mrsimonemms commented Sep 22, 2022

kylos101 commented Sep 22, 2022 •

edited

Loading

mrsimonemms commented Sep 23, 2022

mrsimonemms commented Sep 26, 2022 •

edited

Loading

mrsimonemms commented Sep 26, 2022

jenting commented Sep 26, 2022

mrsimonemms commented Sep 27, 2022

KOTS: stop running workspaces prior to upgrading existing workspace for single cluster ref arch #13147

KOTS: stop running workspaces prior to upgrading existing workspace for single cluster ref arch #13147

Comments

kylos101 commented Sep 20, 2022 • edited by mrsimonemms Loading

Is your feature request related to a problem? Please describe

Describe the behaviour you'd like

Describe alternatives you've considered

Additional context

Dependent Tasks

kylos101 commented Sep 20, 2022

corneliusludmann commented Sep 21, 2022

mrsimonemms commented Sep 21, 2022

Suggestions

Questions

lucasvaltl commented Sep 21, 2022

kylos101 commented Sep 21, 2022

kylos101 commented Sep 21, 2022

kylos101 commented Sep 21, 2022

mrsimonemms commented Sep 21, 2022

kylos101 commented Sep 21, 2022

mrsimonemms commented Sep 21, 2022

lucasvaltl commented Sep 21, 2022 • edited Loading

mrsimonemms commented Sep 22, 2022 • edited Loading

mrsimonemms commented Sep 22, 2022

kylos101 commented Sep 22, 2022 • edited Loading

mrsimonemms commented Sep 23, 2022

mrsimonemms commented Sep 26, 2022 • edited Loading

mrsimonemms commented Sep 26, 2022

jenting commented Sep 26, 2022

mrsimonemms commented Sep 27, 2022

kylos101 commented Sep 20, 2022 •

edited by mrsimonemms

Loading

lucasvaltl commented Sep 21, 2022 •

edited

Loading

mrsimonemms commented Sep 22, 2022 •

edited

Loading

kylos101 commented Sep 22, 2022 •

edited

Loading

mrsimonemms commented Sep 26, 2022 •

edited

Loading