Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ws-manager] workspaces failed to start #10315

Closed
atduarte opened this issue May 29, 2022 · 9 comments · Fixed by #11419 or #11489
Closed

[ws-manager] workspaces failed to start #10315

atduarte opened this issue May 29, 2022 · 9 comments · Fixed by #11419 or #11489
Assignees
Labels
component: ws-manager team: workspace Issue belongs to the Workspace team type: bug Something isn't working

Comments

@atduarte
Copy link
Contributor

Bug description

Tried opening workspaces for a side-project and it failed to start at least two times.

Two errors I didn't have time to read showed up in quick succession, and the workspace stopped before it actually started. When I clicked in the "Open Workspace" button that showed up in the "Stopped" page, I got the following error: "cannot initialize workspace: cannot initialize workspace: no backup found".

Steps to reproduce

It was happening for every workspace started, but not anymore.

Workspace affected

atduarte-revispt-960nm6i1khz, atduarte-revispt-jsh4ofkpnyz

Expected behavior

No response

Example repository

No response

Anything else?

No response

@atduarte atduarte added type: bug Something isn't working team: workspace Issue belongs to the Workspace team labels May 29, 2022
@kylos101
Copy link
Contributor

Thanks for the heads up, we initially addressed in generation 44. Therefore, there's still some work to do, scheduled.

@kylos101 kylos101 moved this to Scheduled in 🌌 Workspace Team May 29, 2022
@kylos101 kylos101 changed the title Workspaces failed to start [ws-manager] workspaces failed to start May 29, 2022
@sagor999 sagor999 self-assigned this Jun 6, 2022
@sagor999 sagor999 moved this from Scheduled to In Progress in 🌌 Workspace Team Jun 6, 2022
@sagor999
Copy link
Contributor

sagor999 commented Jun 6, 2022

I believe this is due to failure to make prior backup, like when ws-daemon crashes.
To make it easier to debug later on, here is PR: #10491
that makes sure that we will surface if there was an actual error when attempting to download backup.
Once that PR is merged and deployed, I will check logs again to see if there is anything else going on.

@sagor999 sagor999 assigned sagor999 and unassigned sagor999 Jun 6, 2022
@atduarte
Copy link
Contributor Author

atduarte commented Jun 7, 2022

@sagor999 the example workspace ids provided are from fresh starts (of a project with prebuilds) that failed and a few minutes later, trying again, started working.

@sagor999
Copy link
Contributor

sagor999 commented Jun 7, 2022

atduarte-revispt-960nm6i1khz failed with this error:

cannot pull image: rpc error: code = FailedPrecondition desc = failed to pull and unpack image "reg.ws-eu44xl.gitpod.io:20000/remote/8bef4e73-367a-41b5-94c6-01c1f7d6c879:latest": failed commit on ref "manifest-sha256:844f36297db5efe4e646d7767ef139628f11ca3b0a0a2943b8b7df6226ae3295": "manifest-sha256:844f36297db5efe4e646d7767ef139628f11ca3b0a0a2943b8b7df6226ae3295" failed size validation: 10409 != 10331: failed precondition

which was due to redis\ipfs\cache issues that we saw before.

atduarte-revispt-jsh4ofkpnyz this one did fail with no backup found. Was this WS opened from previous WS that was failed due to that error above? I think in some cases we do not handle the case of failed WS properly and attempt to download backup from it.

@sagor999 sagor999 removed their assignment Jun 20, 2022
@sagor999 sagor999 moved this from In Progress to Scheduled in 🌌 Workspace Team Jun 20, 2022
@sagor999
Copy link
Contributor

@atduarte moving back into "scheduled" for now. Feel free to close this as well, as I think it might have been resolved since then.

@atduarte
Copy link
Contributor Author

@sagor999 I'm sorry I missed this.

Was this WS opened from previous WS that was failed due to that error above?

It was a fresh workspace start, that has a prebuild.

I will close it

Repository owner moved this from Scheduled to Done in 🌌 Workspace Team Jun 20, 2022
@kylos101 kylos101 reopened this Jul 15, 2022
@kylos101
Copy link
Contributor

kylos101 commented Jul 15, 2022

We're still seeing this issue, in the saas on gen54, and it's been reported in self-hosted given the May release.

cannot initialize workspace: cannot initialize workspace: no backup found: %!!(MISSING)!(MISSING)!(MISSING)w(<nil>)

image

Scheduling

@kylos101 kylos101 moved this from Done to Scheduled in 🌌 Workspace Team Jul 15, 2022
@sagor999
Copy link
Contributor

Here is what happens:
https://cloudlogging.app.goo.gl/EGhVdWhrJeqLSnAX6

First workspace failed with:

cannot pull image: rpc error: code = Unknown desc = failed to pull and unpack image "reg.ws-us54.gitpod.io:20000/remote/322c0ace-3122-4798-b180-145f122a254c:latest": failed to copy: httpReadSeeker: failed open: unexpected status code https://reg.ws-us54.gitpod.io:20000/v2/remote/322c0ace-3122-4798-b180-145f122a254c/blobs/sha256:10700734fce6ae2573453f95f5081483a76493753072536a4ffdb46059e50fb0: 500 Internal Server Error - Server message: unknown: unknown error

Then one minute later, second workspace is started (I assume webapp is retrying here):

cannot initialize workspace: cannot initialize workspace: no backup found

This is because of this check (I suspect):

const hasValidBackup = pastInstances.some(
(i) => !!i.status && !!i.status.conditions && !i.status.conditions.failed,
);

That for some reason I think might be saying that it has a validBackup, when in reality it doesn't.
Probably due to this:
https://console.cloud.google.com/logs/query;cursorTimestamp=2022-07-14T10:00:52.387485Z;query=%22322c0ace-3122-4798-b180-145f122a254c%22%0Atimestamp%3D%222022-07-14T10:00:52.322Z%22%0AinsertId%3D%220cy55592tyj86y6s%22;timeRange=2022-07-14T00:32:00.000Z%2F2022-07-14T22:16:00.000Z?project=gitpod-191109
We received an empty "failed" condition overriding an existing one!

Upon further investigation, I believe this is what "clears up" prior failed state:

if isPodBeingDeleted(pod) {
result.Phase = api.WorkspacePhase_STOPPING
_, podFailedBeforeBeingStopped := pod.Annotations[workspaceFailedBeforeStoppingAnnotation]
if !podFailedBeforeBeingStopped {
// While the pod is being deleted we do not care or want to know about any failure state.
// If the pod got stopped because it failed we will have sent out a Stopping status with a "failure"
result.Conditions.Failed = ""
} else {

PR incoming soon.

@sagor999 sagor999 self-assigned this Jul 15, 2022
@sagor999 sagor999 moved this from Scheduled to In Progress in 🌌 Workspace Team Jul 15, 2022
Repository owner moved this from In Progress to Done in 🌌 Workspace Team Jul 15, 2022
@sagor999 sagor999 reopened this Jul 19, 2022
@sagor999
Copy link
Contributor

need to redo the fix, as #11419 was reverted due to introduced bug.

@atduarte atduarte moved this from Done to In Validation in 🌌 Workspace Team Jul 22, 2022
Repository owner moved this from In Validation to Awaiting Deployment in 🌌 Workspace Team Jul 27, 2022
@kylos101 kylos101 moved this from Awaiting Deployment to In Validation in 🌌 Workspace Team Aug 4, 2022
@sagor999 sagor999 moved this from In Validation to Done in 🌌 Workspace Team Aug 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: ws-manager team: workspace Issue belongs to the Workspace team type: bug Something isn't working
Projects
No open projects
Status: Done
3 participants