[dashboard, supervisor, ws-proxy] Break redirect loop on failing workspaces starts #8125

geropl · 2022-02-09T17:01:53Z

Description

This PR breaks the restart cycles we recently oberserve for workspaces that fail quickly enough.

The basic cycle was:

StartWorkspace (dashboard, /start) -> onload: server.startWorkspace
server.startWorkspace returns 👍 inkl. ideUrl (1. instance)
StartWorkspace redirects to ideUrl
StartWorkspace (inframe, IDE url) -> onload: server.startWorkspace:
a. expecting server to return the already running instance from step 1)
b. but that was already stopped with an error, so started a fresh one (2. instance)
c. supervisor frontend suspected a re-start (because it sees stopped updates from 1. instance, followed by pending/creating from 2.), and to properly re-connect the frontend, re-directs to StartWorkspace 🔁

This PR changes three things:

if workspace is not found, ws-proxy now re-direct to /start/?not_found=true. StartWorkspace reads that parameter and sets dontAutostart, which in turn ensures startWorkspace(id) is not called on load
when StartWorkspace is loaded inside an iframe (when already on the ideUrl) dontAutostart: true is set as well.
the reload behavior in supervisor frontend got removed, and replaced with a similar mechanism in StartWorkspace.

Effects on the "start workspace" flow

👍 if you come back to a (very) old workspace tabs that were unloaded, a mere "tab activation" will no longer re-start the workspace. Instead, you have to actively hit "Open Workspace".
👍 we do not call startWorkspace that often, which should help with not confusing the rate-limiter (and potentially fasten the startup process 🤞 )
👎 maybe: If there have been other errors in the startup-chain, they have so-long been covered by the "auto-restart" behavior. Now users have to actively click "Open workspace" again in such cases. Not sure how relevant this is, and IMO we have to find out
👎 ... ?

UI

Currently the cycle breaks with this screen:

Alternatively we could use a modal but I found it to be bad because:

it hides the error message
if we added the error message to the screen we'd just duplicate the original screen
Update: removed that code

Alternatives

Alternatively, and to further strengthen the process, we could try to find a way to pass the instanceId from dashboard to supervisor frontend. This would avoid the whole issue altogether (and make code in supervisor frontend easier), because we know upfront which instance we expect to see here.
The only way I can think of is to make it part of the url; <workspaceId>.ws-<cluster>.gitpod.io/#<instanceId>, for example. But I'm unsure if and how this affects IDE, and if yes, how to avoid that (/cc @akosyakov )

Dismissed here.

Note: There sometimes seems to be CORS issues with redirects from ws-proxy, but sometimes they work. I assumed them to be out-of-scope here, because this PR does not touch CORS or any domains used.

Related Issue(s)

Fixes #8043

How to test

Negative

in this preview env all workspaces will fail: https://gpl-8043-break-redirect-loop.staging.gitpod-dev.com/workspaces
tab 1: start any workspace
tab 2: open the same workspace
note how both end up in the error state
hit "open workspace" again on one tab
note how a) both update, and show up-to-date state, and b) both end up in the error state again

Positive

in this preview env all workspaces should succeed: https://gpl-8043-break-redirect-loop-w.staging.gitpod-dev.com/workspaces
tab 1: start any workspace
tab 2: open the same workspace
note how both end up in the IDE
stop the workspace
note how both end up in "Stopped"
open the workspace again from tab 2
Note how:
- a) both workspaces show up-to-date state
- b) only tab 2 reveals the IDE
- c) tab 1 still shows the up-to-date state (RUNNING), incl. controls to open/reveal the IDE (with fresh credentials)

Release Notes

make workspace startup more robust

Documentation

codecov · 2022-02-09T17:06:24Z

Codecov Report

Merging #8125 (5d02f7a) into main (2da01c7) will decrease coverage by 1.15%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main    #8125      +/-   ##
==========================================
- Coverage   11.98%   10.82%   -1.16%     
==========================================
  Files          20       18       -2     
  Lines        1193     1025     -168     
==========================================
- Hits          143      111      -32     
+ Misses       1046      912     -134     
+ Partials        4        2       -2

Flag	Coverage Δ
components-gitpod-cli-app	`10.82% <ø> (ø)`
components-local-app-app-darwin-amd64	`?`
components-local-app-app-darwin-arm64	`?`
components-local-app-app-linux-amd64	`?`
components-local-app-app-linux-arm64	`?`
components-local-app-app-windows-386	`?`
components-local-app-app-windows-amd64	`?`
components-local-app-app-windows-arm64	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
components/local-app/pkg/auth/pkce.go
components/local-app/pkg/auth/auth.go

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2da01c7...5d02f7a. Read the comment docs.

geropl · 2022-02-10T08:45:44Z

/hold

do not merge before todays deployment
properly test workspace timeout/stopping as well

jankeromnes · 2022-02-10T08:47:40Z

StartWorkspace (inframe, IDE url) -> onload: server.startWorkspace:

If we no longer wish to make the (inframe, IDE url) auto-start an instance, it seems like in this step we should call something like server.getRunningWorkspaceInstance instead of server.startWorkspace. Wouldn't that be a sufficient fix for this problem?

However, I guess in both cases, we break the behavior of "refresh stopped workspace tab to restart it" (which is probably okay though -- this eager restarting is what caused the problem, and we could instead show the "stopped" screen which has a restart button if you do want to do that).

geropl · 2022-02-10T08:52:43Z

If we no longer wish to make the (inframe, IDE url) auto-start an instance, it seems like in this step we should call something like server.getRunningWorkspaceInstance instead of server.startWorkspace.

👍 That's what we do with this.fetchWorkspaceInfo(undefined). It also (needs to) does authentication, error handling, etc.

Wouldn't that be a sufficient fix for this problem?

No, because we need a "safe place" for ws-proxy/supervisor to redirect too in case of an error. They don't have the overview (access to DB), and need to delegate back to the dashboard/application cluster.
That could be a separate slug /restart instead of /start, or an additional parameter as implemented here. Happy to switch if we can agree on something better. 👍

jankeromnes · 2022-02-10T08:54:49Z

No, because we need a "safe place" for ws-proxy/supervisor to redirect too in case of an error.

Ah, I don't understand why a redirect is needed (instead of just staying on the ideURL, which shows the "stopped" page with buttons). I'll read the code to understand better.

geropl · 2022-02-10T08:56:22Z

I'll read the code to understand better.

What helped me tremendously was to a) have the workspaces auto-fail and b) yarn telepresence dashboard to play through the different scenarios! 👍

jankeromnes · 2022-02-10T09:08:33Z

Thanks! 👍

My current suspicion is that we could simply delete the redirection from supervisor (because it already has an iframe that knows how to show a "Stopped" or error page, right?), however for ws-proxy I don't currently see a good alternative.

I agree that allowing to disable the "start" behavior of StartWorkspace (and keep only the "show instance status" part) seems like a good fix.

components/dashboard/src/start/StartWorkspace.tsx

akosyakov · 2022-02-10T09:12:25Z

How can I test other flows besides with failed workspace?

akosyakov · 2022-02-10T09:18:55Z

The only way I can think of is to make it part of the url; .ws-.gitpod.io/#, for example. But I'm unsure if and how this affects IDE, and if yes, how to avoid that (/cc @akosyakov )

IDE is an owner of URL, I don't think we should mess with it.

jankeromnes

Thanks a lot for diving deeper into this problem. 🙏

I wonder if we could make this fix a bit simpler. Currently, there is:

trigger (with a complicated type)
dontAutostart
dontRedirect
showRestartModal

But:

Do we really need to know where the redirect came from? I think it could be simpler to have just ?autostart=false as the only (optional) query parameter, regardless of where we came from (that's not really important and introduces too much complexity)
dontAutostart could be replaced by just the autostart prop
is dontRedirect really needed? (if we never call server.startWorkspace(), do we ever get a workspaceInstance.ideUrl?)
we don't need an extra showRestartModal, because the existing UI already knows what to show when the instance is stopped / doesn't exist (i.e. it has a convenient Run Workspace button)

Wouldn't it be enough to just have a /start/?autostart=false#wsID query parameter, which "simply" prevents the call to this.startWorkspace()? 💭 (Maybe I'm missing something)

geropl · 2022-02-10T09:34:58Z

How can I test other flows besides with failed workspace?

@akosyakov You could create a branch from this one, workspaces start by default.

IDE is an owner of URL, I don't think we should mess with it.

Agreed, so I'll stay with this solution. 👍

geropl · 2022-02-10T09:45:19Z

@jankeromnes Sorry if I was unclear, IMO we're mixing a few things:

the modal /showRestartModal was kept as a possible alternative and everyone sees it as bad (incl me), so will remove
IMO dontAutostart makes clear that this is the exception
(dontRedirect is a leftover 👍 )

Do we really need to know where the redirect came from?

No, but IMO ?autostart=false is also not ideal. Discussion: #8125 (comment)

is dontRedirect really needed? (if we never call server.startWorkspace(), do we ever get a workspaceInstance.ideUrl?)

Yes it is: We get instance updates (incl. some with ideUrl) whenever we call fetchWorkspaceInfo. And this is needed as well, to solve all kinds of other situations (e.g., stopping situation: we need to eventually switch to stopped, and all updates contain ideURL)

geropl · 2022-02-10T10:13:13Z

My current suspicion is that we could simply delete the redirection from supervisor (because it already has an iframe that knows how to show a "Stopped" or error page, right?), however for ws-proxy I don't currently see a good alternative.

Hm. We needed that indirection because we notice a re-starting workspace, and at the moment dashboard is the only place which is allows to refresh authentication credentials. (In theory we could extend that to IDE-URLs, but IMO we have refused to do that because we wanted avoid the need to properly parse IDEUrls - excluding every path/subdomain that's potentially under user control 🙃 )

With the new behavior it could be that it's superfluous... but I'm not sure, and like to avoid including that optimization here, as it would mean a lot of manual test effort, again. 🙂

components/supervisor/frontend/src/index.ts

geropl · 2022-02-10T11:04:50Z

@akosyakov @jankeromnes The PR is rebased, and cleaned up. Now ws-proxy/supervisor redirect to /start/?error=true#....

Test (always fails): https://gpl-8043-break-redirect-loop.staging.gitpod-dev.com/workspaces
Test (should work): https://gpl-8043-break-redirect-loop-w.staging.gitpod-dev.com/workspaces

gtsiolis

Going through the changes from yesterday, seems that going with the non-modal approach sounds better. 🥊

Here're some points to consider improving although these could be out of the scope of this PR.
1️⃣ Remove the last part of the error message as it does not contain any clear action forward.
2️⃣ Remove the No Changes element as it's probably not relevant or useful here.
3️⃣ Use the red dot indicator as the workspace failed to open.
4️⃣ Keep the same Open action verb for the button label.

BEFORE	AFTER

Thanks for the ping, @geropl! 🏓

geropl · 2022-02-11T13:27:35Z

@gtsiolis I agree with most (all?) of your suggestions, thx! 🙏
I created separate issues:

1️⃣ [ws-manager] Remove trailing "Please contact support..." from error messages #8154
2️⃣ + 3️⃣ [dashboard] Unify workspace (instance) visuals across dashboard #8155

gtsiolis · 2022-02-11T14:21:29Z

@geropl 🙏

components/dashboard/src/start/StartWorkspace.tsx

csweichel

I really like the approach - nice, clean, simple change.
Once the comment is mitigated I'll happily approve :)

components/dashboard/src/start/StartWorkspace.tsx

…ling autostart When triggered: a) inFrame or b) when redirect from IDE url (by ws-proxy)

akosyakov · 2022-02-14T09:21:37Z

If i change IDE settings to enable VS Code Desktop and then restart a workspace from the dashboard then:

a new window is trying to open VS Code Desktop
but old windows are still trying to open web version

In other aspects it worked flawlessly and I like simplification to the supervisor frontend 👍

AlexTugarev · 2022-02-14T09:22:34Z

components/supervisor/frontend/src/index.ts

-                instance.status.phase === 'pending' ||
-                instance.status.phase === 'creating' ||
-                instance.status.phase === 'initializing')) {
-                // reload the page if the workspace was restarted to ensure:


@geropl, how relevant is this removal?
owner token etc. will be fine?

@AlexTugarev see discussions here: #8125 (comment)

I think it is the proper way to go.

owner token etc. will be fine?

Yes. as Anton pointed out:

the mechanism got pushed into StartWorkspace (where it belonged in the first place)

fetchWorkspaceInfo handles owner token renewal (ensureWorkspaceAuth)

We do it on purpose to remove dependencies from supervisor and loading screen. Now a user will need to reload explicitly. We add a new button for that on the loading screen. I think it is fine.

Thanks, I missed that, as it was resolved.

Nice and clean!

akosyakov

please address in the follow-up PR: #8125 (comment)

geropl · 2022-02-14T09:30:05Z

If i change IDE settings to enable VS Code Desktop and then restart a workspace from the dashboard then:

a new window is trying to open VS Code Desktop

but old windows are still trying to open web version

@akosyakov this is relevant, and should also be easy to fix (we need to make sure IDEOptions are fetched on the dontAutostart paths as well`). Let's fix this in q small follow up.

AlexTugarev · 2022-02-14T09:47:16Z

components/dashboard/src/start/StartWorkspace.tsx

+    const startedInstanceId = this.state?.startedInstanceId;
+    if (startedInstanceId !== workspaceInstance.id) {
+      // do we want to switch to "new" instance we just received an update for?
+      const switchToNewInstance = this.state.workspaceInstance?.status.phase === "stopped" && workspaceInstance.status.phase !== "stopped";


if previous instances is in stopping or never reaches stopped, this reads like an dead end.

assuming that we always have a single instance running per workspace, let's just focus on the status/phase of the new.

I agree in general, but think it's not super important. And has been like this for years.

But in light of this change we should watch more sure! 🙏

AlexTugarev

Awesome!

geropl · 2022-02-14T10:06:01Z

/unhold

roboquat added do-not-merge/work-in-progress release-note size/L labels Feb 9, 2022

geropl marked this pull request as ready for review February 9, 2022 19:06

geropl requested review from a team February 9, 2022 19:06

roboquat removed the do-not-merge/work-in-progress label Feb 9, 2022

roboquat added the do-not-merge/hold label Feb 10, 2022

akosyakov reviewed Feb 10, 2022

View reviewed changes

components/dashboard/src/start/StartWorkspace.tsx Outdated Show resolved Hide resolved

jankeromnes reviewed Feb 10, 2022

View reviewed changes

geropl force-pushed the gpl/8043-break-redirect-loop branch from 1b0bd82 to 147b0fc Compare February 10, 2022 10:01

geropl force-pushed the gpl/8043-break-redirect-loop branch from 147b0fc to 85b9c83 Compare February 10, 2022 10:27

akosyakov reviewed Feb 10, 2022

View reviewed changes

components/supervisor/frontend/src/index.ts Outdated Show resolved Hide resolved

geropl force-pushed the gpl/8043-break-redirect-loop branch 2 times, most recently from 57006dc to 9520984 Compare February 11, 2022 10:32

gtsiolis reviewed Feb 11, 2022

View reviewed changes

This was referenced Feb 11, 2022

[ws-manager] Remove trailing "Please contact support..." from error messages #8154

Closed

[dashboard] Unify workspace (instance) visuals across dashboard #8155

Open

svenefftinge reviewed Feb 11, 2022

View reviewed changes

components/dashboard/src/start/StartWorkspace.tsx Outdated Show resolved Hide resolved

components/dashboard/src/start/StartWorkspace.tsx Outdated Show resolved Hide resolved

geropl force-pushed the gpl/8043-break-redirect-loop branch 2 times, most recently from 4c96ef6 to 3228e92 Compare February 11, 2022 15:01

akosyakov reviewed Feb 14, 2022

View reviewed changes

components/dashboard/src/start/StartWorkspace.tsx Show resolved Hide resolved

csweichel reviewed Feb 14, 2022

View reviewed changes

components/dashboard/src/start/StartWorkspace.tsx Outdated Show resolved Hide resolved

[dashboard, ws-proxy, supervisor] Break potential DDOS cycle by disab…

5d02f7a

…ling autostart When triggered: a) inFrame or b) when redirect from IDE url (by ws-proxy)

geropl force-pushed the gpl/8043-break-redirect-loop branch from 3228e92 to 5d02f7a Compare February 14, 2022 09:18

AlexTugarev reviewed Feb 14, 2022

View reviewed changes

csweichel approved these changes Feb 14, 2022

View reviewed changes

akosyakov approved these changes Feb 14, 2022

View reviewed changes

AlexTugarev reviewed Feb 14, 2022

View reviewed changes

AlexTugarev approved these changes Feb 14, 2022

View reviewed changes

roboquat removed the do-not-merge/hold label Feb 14, 2022

roboquat merged commit c0bbf91 into main Feb 14, 2022

roboquat deleted the gpl/8043-break-redirect-loop branch February 14, 2022 10:06

This was referenced Feb 14, 2022

[dashboard] Ensure we fetch IDEOptions onload #8188

Closed

[supervisor] Ensure we fetch IDEOptions/desktopIDEConfig on re-start #8226

Closed

mustard-mh mentioned this pull request Feb 16, 2022

Workspace with phase "Timed out" refresh didn't restart #8246

Closed

geropl mentioned this pull request Feb 16, 2022

[dashboard] Ensure we fetch IDEOptions onload + onNewInstance #8250

Merged

gtsiolis mentioned this pull request Feb 18, 2022

telemetry: show collected data under the toggle #8272

Merged

This was referenced Mar 29, 2022

One frame running phase for every workspace start #8990

Closed

Redirect web page when instance changed #9004

Merged

[dashboard, supervisor, ws-proxy] Break redirect loop on failing workspaces starts #8125

[dashboard, supervisor, ws-proxy] Break redirect loop on failing workspaces starts #8125

Uh oh!

Conversation

geropl commented Feb 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Effects on the "start workspace" flow

UI

Alternatives

Related Issue(s)

How to test

Negative

Positive

Release Notes

Documentation

Uh oh!

codecov bot commented Feb 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

geropl commented Feb 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jankeromnes commented Feb 10, 2022

Uh oh!

geropl commented Feb 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jankeromnes commented Feb 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

geropl commented Feb 10, 2022

Uh oh!

jankeromnes commented Feb 10, 2022 • edited by werft-gitpod-dev-com bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

akosyakov commented Feb 10, 2022

Uh oh!

akosyakov commented Feb 10, 2022

Uh oh!

jankeromnes left a comment

Choose a reason for hiding this comment

Uh oh!

geropl commented Feb 10, 2022

Uh oh!

geropl commented Feb 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

geropl commented Feb 10, 2022

Uh oh!

Uh oh!

geropl commented Feb 10, 2022

Uh oh!

gtsiolis left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

geropl commented Feb 11, 2022

Uh oh!

gtsiolis commented Feb 11, 2022 • edited by werft-gitpod-dev-com bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

csweichel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

akosyakov commented Feb 14, 2022 • edited by werft-gitpod-dev-com bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlexTugarev Feb 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

akosyakov Feb 14, 2022

Choose a reason for hiding this comment

geropl commented Feb 9, 2022 •

edited

Loading

codecov bot commented Feb 9, 2022 •

edited

Loading

geropl commented Feb 10, 2022 •

edited

Loading

geropl commented Feb 10, 2022 •

edited

Loading

jankeromnes commented Feb 10, 2022 •

edited

Loading

jankeromnes commented Feb 10, 2022 •

edited by werft-gitpod-dev-com bot

Loading

geropl commented Feb 10, 2022 •

edited

Loading

gtsiolis left a comment •

edited

Loading

gtsiolis commented Feb 11, 2022 •

edited by werft-gitpod-dev-com bot

Loading

akosyakov commented Feb 14, 2022 •

edited by werft-gitpod-dev-com bot

Loading

AlexTugarev Feb 14, 2022 •

edited

Loading

akosyakov Feb 14, 2022 •

edited

Loading