Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stuck on "Loading prebuild..." for 60 minutes #12345

Closed
mikenikles opened this issue Aug 24, 2022 · 17 comments · Fixed by #13515
Closed

Stuck on "Loading prebuild..." for 60 minutes #12345

mikenikles opened this issue Aug 24, 2022 · 17 comments · Fixed by #13515
Assignees
Labels
feature: prebuilds team: workspace Issue belongs to the Workspace team type: bug Something isn't working

Comments

@mikenikles
Copy link
Contributor

mikenikles commented Aug 24, 2022

Bug description

Loading workspaces regularly gets stuck on the "Loading prebuild..." screen for 60 minutes, at which point it times out and a "Stopping workspace..." message shows up.

Steps to reproduce

  1. Load https://github.com/gitpod-io/website

Workspace affected

gitpodio-website-yla6f7ogn1q

Expected behavior

The workspace starts.

Example repository

https://github.com/gitpod-io/website

It happens with any other repository too.

Anything else?

It's a fairly common situation, but there are times when loading a workspace works. Over the last two weeks, I encountered the issue three times per day on average.

@mikenikles mikenikles added the type: bug Something isn't working label Aug 24, 2022
@mikenikles mikenikles changed the title Stuck on "Loading prebuild..." with Brave Stuck on "Loading prebuild..." for 60 minutes Aug 24, 2022
@mikenikles
Copy link
Contributor Author

After working with Firefox all day, it now too gets stuck on the "Loading prebuild..." screen for 60 minutes until it times out.

Prebuilds run successful and I've tried with "incremental prebuilds" on and off - same bug.

@axonasif axonasif added feature: prebuilds team: webapp Issue belongs to the WebApp team labels Aug 25, 2022
@axonasif
Copy link
Member

This has been happening lately but randomly.

@mikenikles
Copy link
Contributor Author

This was ok for ~5 days but it started to happen again yesterday & today.

@sagor999
Copy link
Contributor

sagor999 commented Sep 1, 2022

Moved it into Breakdown phase for workspace team.

@kylos101 kylos101 added team: workspace Issue belongs to the Workspace team and removed team: webapp Issue belongs to the WebApp team labels Sep 1, 2022
@kylos101
Copy link
Contributor

kylos101 commented Sep 2, 2022

I see the same problem with gitpodio-gitpod-aditftfbhhh.

gitpod /workspace/gitpod (main) $ gpctl workspaces list | grep 788c9d51-ea4e-4c78-864c-b8782ea54a18
8df3495b-685d-46e0-9820-009cc3b4afd8        gitpodio-gitpod-aditftfbhhh                 788c9d51-ea4e-4c78-864c-b8782ea54a18        INITIALIZING        REGULAR         ws-788c9d51-ea4e-4c78-864c-b8782ea54a18
kubectl describe pod ws-788c9d51-ea4e-4c78-864c-b8782ea54a18
...
  Warning  Unhealthy  4m34s (x897 over 19m)  kubelet  Readiness probe failed: Get "http://10.20.254.30:22999/_supervisor/v1/status/content/wait/true": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Given these logs, supervisor hasn't signaled that content-init is done.

No errors in the trace either:
Image

@sagor999 @utam0k is there a way to inspect the "current status" of content init? I seem to recall there being a JSON object persisted somewhere that contains this...but am having trouble locating it.

@kylos101
Copy link
Contributor

kylos101 commented Sep 6, 2022

Moving this to scheduled, the thought being, we can tackle this a few ways:

  1. Add more logs and traces around the usage for gsutil when initiallizing content, and inspect calls that fail where we're potentially eating the failure (causing us to wait to timeout for 60m).
  2. Try starting 10 workspaces from the Gitpod repo via prebuild, if one of them hangs, consider using pprof to see "to inspect the current state".
  3. Change the context for gsutil restore to 55 minute timeout, so if we hit it, we'll know for sure that's where the delay is, which would be a signal that gsutil has some sort of bug.
  4. Consider upgrading gsutil (gcloud sdk), our version is a bit older

@kylos101 kylos101 moved this from Breakdown to Scheduled in 🌌 Workspace Team Sep 6, 2022
@aledbf
Copy link
Member

aledbf commented Sep 6, 2022

Consider upgrading gsutil (gcloud sdk), our version is a bit older

Before considering this, please check if the download is the cause of the slow behavior.

I would focus not on the download but on what we do after that, meaning, decompress and run chown.

Also check if there are more than one workspace starting (the chown operation is expensive)

@kylos101
Copy link
Contributor

kylos101 commented Sep 7, 2022

@aledbf good point Re: starting many workspaces at once! We should be able to try that at once by starting 5-6 workspaces at the same time (occupying a node) for the Gitpod repo when using the large workspace class.

@Furisto Furisto self-assigned this Sep 9, 2022
@Furisto Furisto moved this from Scheduled to In Progress in 🌌 Workspace Team Sep 9, 2022
@mikenikles
Copy link
Contributor Author

mikenikles commented Sep 12, 2022

In case this helps with troubleshooting: I've used Gitpod's EU cluster for a few days and haven't seen this issue. However, as soon as I switched back to the US cluster half an hour ago, I'm now stuck on "Initializing content …".

Update: VS Code in the browser opened, exactly 60 minutes after I started the workspace.

@adrienthebo
Copy link
Contributor

I ran into this issue this morning; when opening a gitpod-io/gitpod pull request (https://gitpod.io/#github.com/gitpod-io/gitpod/pull/12871) 4 times in succession the first three have been hung for 10 minutes and the last workspace launched successfully.

This has happened on a regular basis for the last few weeks, about 1/3rd of the workspaces I create against gitpod-io/gitpod fail.

This seems to happen most frequently in the morning (US/Pacific) but that might just be based on the time that I open pull requests. Pull request repos also exhibit this the most, but that's also the scenario when I'm creating the greatest number of new workspaces.

Digging into the logs of past failures emit a log like this, up to 90 minutes after the workspace was created and 30 minutes after it was stopped:

prebuilt init was unable to restore snapshot workspaces/gitpodio-gitpod-gnkrwkd79f4/snapshot-1663069004949365382.tar@gitpod-prod-user-a5bfab1e-406f-4e6b-b1c6-8213be105eb5. Resorting the regular Git init

With an associated error:

snapshot initializer:
    github.com/gitpod-io/gitpod/content-service/pkg/initializer.(*SnapshotInitializer).Run
        github.com/gitpod-io/gitpod/content-service@v0.0.0-00010101000000-000000000000/pkg/initializer/snapshot.go:44
  - tar /dst: tar /dst: exit status 2;tar: Unexpected EOF in archive
    tar: Unexpected EOF in archive
    tar: Error is not recoverable: exiting now:
    github.com/gitpod-io/gitpod/ws-daemon/pkg/content.(*remoteContentStorage).Download
        github.com/gitpod-io/gitpod/ws-daemon/pkg/content/initializer.go:395

@sagor999
Copy link
Contributor

Unexpected EOF in archive

this could be due to lost connection and maybe gcloud util not recovering from this correctly. 🤔

@kylos101
Copy link
Contributor

@Furisto can you build an ephemeral cluster in the US region, land on the cluster using a VPN, and leverage the additional tracing to debug this issue? I ask so we don't lose a day waiting for the deploy, but also because this issue is rather persistent for folks using the US cluster. 🙏

@sagor999
Copy link
Contributor

sagor999 commented Sep 20, 2022

You can also use ephemeral-g67-2 cluster as well (I am using it to test gen67). It will be up until EOD of Tuesday (PST).

@Furisto Furisto removed their assignment Sep 20, 2022
@Furisto Furisto moved this from In Progress to Scheduled in 🌌 Workspace Team Sep 20, 2022
@Furisto
Copy link
Member

Furisto commented Sep 20, 2022

Unassigning myself because I will be OOO for the next three days. We have additional tracing for this problem now. If anyone has the bandwidth feel free to pick this up.

@kylos101
Copy link
Contributor

These instance IDs were created, have no start time, and stopped after 60m, which suggests this problem may not just be related to prebuilds:

a0a48d3d-e44f-45a6-943a-365816c93a76
fdd59231-4eee-4d7c-a1ff-7e9b9aecc895

@svenefftinge
Copy link
Member

Another case (internal thread)

@sagor999
Copy link
Contributor

A bit more info on this: https://gitpod.slack.com/archives/C02F19UUW6S/p1664476833923239
tl;dr
it uses direct http download code path, instead of gsutil, which might explain why sometimes it stalls out or takes a lot of time.

@kylos101 kylos101 moved this from Scheduled to In Progress in 🌌 Workspace Team Oct 3, 2022
Repository owner moved this from In Progress to Awaiting Deployment in 🌌 Workspace Team Oct 4, 2022
@kylos101 kylos101 moved this from Awaiting Deployment to In Validation in 🌌 Workspace Team Oct 5, 2022
@aledbf aledbf moved this from In Validation to Done in 🌌 Workspace Team Oct 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature: prebuilds team: workspace Issue belongs to the Workspace team type: bug Something isn't working
Projects
No open projects
Status: Done
Development

Successfully merging a pull request may close this issue.

8 participants