Skip to content

[dashboard, supervisor, ws-proxy] Break redirect loop on failing workspaces starts #8125

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 14, 2022

Conversation

geropl
Copy link
Member

@geropl geropl commented Feb 9, 2022

Description

This PR breaks the restart cycles we recently oberserve for workspaces that fail quickly enough.

The basic cycle was:

  1. StartWorkspace (dashboard, /start) -> onload: server.startWorkspace
  2. server.startWorkspace returns 👍 inkl. ideUrl (1. instance)
  3. StartWorkspace redirects to ideUrl
  4. StartWorkspace (inframe, IDE url) -> onload: server.startWorkspace:
    a. expecting server to return the already running instance from step 1)
    b. but that was already stopped with an error, so started a fresh one (2. instance)
    c. supervisor frontend suspected a re-start (because it sees stopped updates from 1. instance, followed by pending/creating from 2.), and to properly re-connect the frontend, re-directs to StartWorkspace 🔁

This PR changes three things:

Effects on the "start workspace" flow

  • 👍 if you come back to a (very) old workspace tabs that were unloaded, a mere "tab activation" will no longer re-start the workspace. Instead, you have to actively hit "Open Workspace".
  • 👍 we do not call startWorkspace that often, which should help with not confusing the rate-limiter (and potentially fasten the startup process 🤞 )
  • 👎 maybe: If there have been other errors in the startup-chain, they have so-long been covered by the "auto-restart" behavior. Now users have to actively click "Open workspace" again in such cases. Not sure how relevant this is, and IMO we have to find out
  • 👎 ... ?

UI

Currently the cycle breaks with this screen:
image

Alternatively we could use a modal but I found it to be bad because:

  • it hides the error message
  • if we added the error message to the screen we'd just duplicate the original screen
    Update: removed that code

Alternatives

Alternatively, and to further strengthen the process, we could try to find a way to pass the instanceId from dashboard to supervisor frontend. This would avoid the whole issue altogether (and make code in supervisor frontend easier), because we know upfront which instance we expect to see here.
The only way I can think of is to make it part of the url; <workspaceId>.ws-<cluster>.gitpod.io/#<instanceId>, for example. But I'm unsure if and how this affects IDE, and if yes, how to avoid that (/cc @akosyakov )

Dismissed here.

Note: There sometimes seems to be CORS issues with redirects from ws-proxy, but sometimes they work. I assumed them to be out-of-scope here, because this PR does not touch CORS or any domains used.

Related Issue(s)

Fixes #8043

How to test

Negative

Positive

  • in this preview env all workspaces should succeed: https://gpl-8043-break-redirect-loop-w.staging.gitpod-dev.com/workspaces
  • tab 1: start any workspace
  • tab 2: open the same workspace
  • note how both end up in the IDE
  • stop the workspace
  • note how both end up in "Stopped"
  • open the workspace again from tab 2
  • Note how:
    • a) both workspaces show up-to-date state
    • b) only tab 2 reveals the IDE
    • c) tab 1 still shows the up-to-date state (RUNNING), incl. controls to open/reveal the IDE (with fresh credentials)
      image

Release Notes

make workspace startup more robust

Documentation

@codecov
Copy link

codecov bot commented Feb 9, 2022

Codecov Report

Merging #8125 (5d02f7a) into main (2da01c7) will decrease coverage by 1.15%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #8125      +/-   ##
==========================================
- Coverage   11.98%   10.82%   -1.16%     
==========================================
  Files          20       18       -2     
  Lines        1193     1025     -168     
==========================================
- Hits          143      111      -32     
+ Misses       1046      912     -134     
+ Partials        4        2       -2     
Flag Coverage Δ
components-gitpod-cli-app 10.82% <ø> (ø)
components-local-app-app-darwin-amd64 ?
components-local-app-app-darwin-arm64 ?
components-local-app-app-linux-amd64 ?
components-local-app-app-linux-arm64 ?
components-local-app-app-windows-386 ?
components-local-app-app-windows-amd64 ?
components-local-app-app-windows-arm64 ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
components/local-app/pkg/auth/pkce.go
components/local-app/pkg/auth/auth.go

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2da01c7...5d02f7a. Read the comment docs.

@geropl geropl marked this pull request as ready for review February 9, 2022 19:06
@geropl geropl requested review from a team February 9, 2022 19:06
@geropl
Copy link
Member Author

geropl commented Feb 10, 2022

/hold

  • do not merge before todays deployment
  • properly test workspace timeout/stopping as well

@jankeromnes
Copy link
Contributor

  1. StartWorkspace (inframe, IDE url) -> onload: server.startWorkspace:

If we no longer wish to make the (inframe, IDE url) auto-start an instance, it seems like in this step we should call something like server.getRunningWorkspaceInstance instead of server.startWorkspace. Wouldn't that be a sufficient fix for this problem?

However, I guess in both cases, we break the behavior of "refresh stopped workspace tab to restart it" (which is probably okay though -- this eager restarting is what caused the problem, and we could instead show the "stopped" screen which has a restart button if you do want to do that).

@geropl
Copy link
Member Author

geropl commented Feb 10, 2022

If we no longer wish to make the (inframe, IDE url) auto-start an instance, it seems like in this step we should call something like server.getRunningWorkspaceInstance instead of server.startWorkspace.

👍 That's what we do with this.fetchWorkspaceInfo(undefined). It also (needs to) does authentication, error handling, etc.

Wouldn't that be a sufficient fix for this problem?

No, because we need a "safe place" for ws-proxy/supervisor to redirect too in case of an error. They don't have the overview (access to DB), and need to delegate back to the dashboard/application cluster.
That could be a separate slug /restart instead of /start, or an additional parameter as implemented here. Happy to switch if we can agree on something better. 👍

@jankeromnes
Copy link
Contributor

jankeromnes commented Feb 10, 2022

No, because we need a "safe place" for ws-proxy/supervisor to redirect too in case of an error.

Ah, I don't understand why a redirect is needed (instead of just staying on the ideURL, which shows the "stopped" page with buttons). I'll read the code to understand better.

@geropl
Copy link
Member Author

geropl commented Feb 10, 2022

I'll read the code to understand better.

What helped me tremendously was to a) have the workspaces auto-fail and b) yarn telepresence dashboard to play through the different scenarios! 👍

@jankeromnes
Copy link
Contributor

jankeromnes commented Feb 10, 2022

Thanks! 👍

My current suspicion is that we could simply delete the redirection from supervisor (because it already has an iframe that knows how to show a "Stopped" or error page, right?), however for ws-proxy I don't currently see a good alternative.

I agree that allowing to disable the "start" behavior of StartWorkspace (and keep only the "show instance status" part) seems like a good fix.

@akosyakov
Copy link
Member

How can I test other flows besides with failed workspace?

@akosyakov
Copy link
Member

The only way I can think of is to make it part of the url; .ws-.gitpod.io/#, for example. But I'm unsure if and how this affects IDE, and if yes, how to avoid that (/cc @akosyakov )

IDE is an owner of URL, I don't think we should mess with it.

Copy link
Contributor

@jankeromnes jankeromnes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for diving deeper into this problem. 🙏

I wonder if we could make this fix a bit simpler. Currently, there is:

  • trigger (with a complicated type)
  • dontAutostart
  • dontRedirect
  • showRestartModal

But:

  • Do we really need to know where the redirect came from? I think it could be simpler to have just ?autostart=false as the only (optional) query parameter, regardless of where we came from (that's not really important and introduces too much complexity)
  • dontAutostart could be replaced by just the autostart prop
  • is dontRedirect really needed? (if we never call server.startWorkspace(), do we ever get a workspaceInstance.ideUrl?)
  • we don't need an extra showRestartModal, because the existing UI already knows what to show when the instance is stopped / doesn't exist (i.e. it has a convenient Run Workspace button)

Wouldn't it be enough to just have a /start/?autostart=false#wsID query parameter, which "simply" prevents the call to this.startWorkspace()? 💭 (Maybe I'm missing something)

@geropl
Copy link
Member Author

geropl commented Feb 10, 2022

How can I test other flows besides with failed workspace?

@akosyakov You could create a branch from this one, workspaces start by default.

IDE is an owner of URL, I don't think we should mess with it.

Agreed, so I'll stay with this solution. 👍

@geropl
Copy link
Member Author

geropl commented Feb 10, 2022

@jankeromnes Sorry if I was unclear, IMO we're mixing a few things:

  • the modal /showRestartModal was kept as a possible alternative and everyone sees it as bad (incl me), so will remove
  • IMO dontAutostart makes clear that this is the exception
  • (dontRedirect is a leftover 👍 )

Do we really need to know where the redirect came from?

No, but IMO ?autostart=false is also not ideal. Discussion: #8125 (comment)

is dontRedirect really needed? (if we never call server.startWorkspace(), do we ever get a workspaceInstance.ideUrl?)

Yes it is: We get instance updates (incl. some with ideUrl) whenever we call fetchWorkspaceInfo. And this is needed as well, to solve all kinds of other situations (e.g., stopping situation: we need to eventually switch to stopped, and all updates contain ideURL)

@geropl geropl force-pushed the gpl/8043-break-redirect-loop branch from 1b0bd82 to 147b0fc Compare February 10, 2022 10:01
@geropl
Copy link
Member Author

geropl commented Feb 10, 2022

My current suspicion is that we could simply delete the redirection from supervisor (because it already has an iframe that knows how to show a "Stopped" or error page, right?), however for ws-proxy I don't currently see a good alternative.

Hm. We needed that indirection because we notice a re-starting workspace, and at the moment dashboard is the only place which is allows to refresh authentication credentials. (In theory we could extend that to IDE-URLs, but IMO we have refused to do that because we wanted avoid the need to properly parse IDEUrls - excluding every path/subdomain that's potentially under user control 🙃 )

With the new behavior it could be that it's superfluous... but I'm not sure, and like to avoid including that optimization here, as it would mean a lot of manual test effort, again. 🙂

@geropl geropl force-pushed the gpl/8043-break-redirect-loop branch from 147b0fc to 85b9c83 Compare February 10, 2022 10:27
@geropl
Copy link
Member Author

geropl commented Feb 10, 2022

@akosyakov @jankeromnes The PR is rebased, and cleaned up. Now ws-proxy/supervisor redirect to /start/?error=true#....

Test (always fails): https://gpl-8043-break-redirect-loop.staging.gitpod-dev.com/workspaces
Test (should work): https://gpl-8043-break-redirect-loop-w.staging.gitpod-dev.com/workspaces

@geropl geropl force-pushed the gpl/8043-break-redirect-loop branch 2 times, most recently from 57006dc to 9520984 Compare February 11, 2022 10:32
Copy link
Contributor

@gtsiolis gtsiolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going through the changes from yesterday, seems that going with the non-modal approach sounds better. 🥊

Here're some points to consider improving although these could be out of the scope of this PR.
1️⃣ Remove the last part of the error message as it does not contain any clear action forward.
2️⃣ Remove the No Changes element as it's probably not relevant or useful here.
3️⃣ Use the red dot indicator as the workspace failed to open.
4️⃣ Keep the same Open action verb for the button label.

BEFORE AFTER
Screenshot 2022-02-11 at 1 58 22 PM (2) Screenshot 2022-02-11 at 11 53 28 AM (2)

Thanks for the ping, @geropl! 🏓

@geropl
Copy link
Member Author

geropl commented Feb 11, 2022

@gtsiolis I agree with most (all?) of your suggestions, thx! 🙏
I created separate issues:

@gtsiolis
Copy link
Contributor

gtsiolis commented Feb 11, 2022

@geropl 🙏

64u4qz

@geropl geropl force-pushed the gpl/8043-break-redirect-loop branch 2 times, most recently from 4c96ef6 to 3228e92 Compare February 11, 2022 15:01
Copy link
Contributor

@csweichel csweichel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the approach - nice, clean, simple change.
Once the comment is mitigated I'll happily approve :)

…ling autostart

    When triggered:
     a) inFrame or
     b) when redirect from IDE url (by ws-proxy)
@geropl geropl force-pushed the gpl/8043-break-redirect-loop branch from 3228e92 to 5d02f7a Compare February 14, 2022 09:18
@akosyakov
Copy link
Member

akosyakov commented Feb 14, 2022

If i change IDE settings to enable VS Code Desktop and then restart a workspace from the dashboard then:

  • a new window is trying to open VS Code Desktop
  • but old windows are still trying to open web version

Screenshot 2022-02-14 at 10 19 28

Screenshot 2022-02-14 at 10 19 24

In other aspects it worked flawlessly and I like simplification to the supervisor frontend 👍

instance.status.phase === 'pending' ||
instance.status.phase === 'creating' ||
instance.status.phase === 'initializing')) {
// reload the page if the workspace was restarted to ensure:
Copy link
Member

@AlexTugarev AlexTugarev Feb 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@geropl, how relevant is this removal?
owner token etc. will be fine?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlexTugarev see discussions here: #8125 (comment)

I think it is the proper way to go.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

owner token etc. will be fine?

Yes. as Anton pointed out:

  • the mechanism got pushed into StartWorkspace (where it belonged in the first place)
  • fetchWorkspaceInfo handles owner token renewal (ensureWorkspaceAuth)

Copy link
Member

@akosyakov akosyakov Feb 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do it on purpose to remove dependencies from supervisor and loading screen. Now a user will need to reload explicitly. We add a new button for that on the loading screen. I think it is fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I missed that, as it was resolved.

Nice and clean!

Copy link
Member

@akosyakov akosyakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please address in the follow-up PR: #8125 (comment)

@geropl
Copy link
Member Author

geropl commented Feb 14, 2022

If i change IDE settings to enable VS Code Desktop and then restart a workspace from the dashboard then:

  • a new window is trying to open VS Code Desktop
  • but old windows are still trying to open web version

@akosyakov this is relevant, and should also be easy to fix (we need to make sure IDEOptions are fetched on the dontAutostart paths as well`). Let's fix this in q small follow up.

const startedInstanceId = this.state?.startedInstanceId;
if (startedInstanceId !== workspaceInstance.id) {
// do we want to switch to "new" instance we just received an update for?
const switchToNewInstance = this.state.workspaceInstance?.status.phase === "stopped" && workspaceInstance.status.phase !== "stopped";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if previous instances is in stopping or never reaches stopped, this reads like an dead end.

assuming that we always have a single instance running per workspace, let's just focus on the status/phase of the new.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree in general, but think it's not super important. And has been like this for years.

But in light of this change we should watch more sure! 🙏

Copy link
Member

@AlexTugarev AlexTugarev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

@geropl
Copy link
Member Author

geropl commented Feb 14, 2022

/unhold

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[dashboard, server] Prevent redirect loops that trigger "startWorkspace" in a loop
8 participants