Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prebuilds are stuck in 'queued' #9395

Closed
svenefftinge opened this issue Apr 19, 2022 · 5 comments · Fixed by #10882
Closed

Prebuilds are stuck in 'queued' #9395

svenefftinge opened this issue Apr 19, 2022 · 5 comments · Fixed by #10882
Assignees
Labels
team: webapp Issue belongs to the WebApp team type: bug Something isn't working

Comments

@svenefftinge
Copy link
Member

svenefftinge commented Apr 19, 2022

Bug description

The DB contains +6k entries of prebuilds in 'queued' state. Would be good to understand what happens, clean things up and prevent these from being created further.

Updating prebuilt.state on failed prebuild start

During this incident we discovered that there seems to be an ongoing issue with updating state for prebuilds whose instances never reach RUNNING phase (note the empty startedTime):

SELECT wsi.id, wsi.creationTime, wsi.phasePersisted, wsi.status, wsi.startedTime, wsi.stoppingTime, wsi.stoppedTime
   FROM d_b_workspace_instance AS wsi
   JOIN d_b_prebuilt_workspace AS pws
      ON pws.buildWorkspaceId = wsi.workspaceId
   WHERE pws.state = 'queued'
	AND wsi.phasePersisted = 'stopped'
	ORDER BY wsi.creationTime DESC;

Note: Especially older entries could also be fallout from other incidents + cleanups, but there seems to be a couple of fresh ones every day.

Steps to reproduce

select * from d_b_prebuilt_workspace where state = 'queued' order by creationTime;

@svenefftinge svenefftinge added the team: webapp Issue belongs to the WebApp team label Apr 19, 2022
@geropl geropl added the type: bug Something isn't working label Apr 19, 2022
@geropl geropl moved this to Scheduled in 🍎 WebApp Team Apr 19, 2022
@svenefftinge
Copy link
Member Author

svenefftinge commented Apr 21, 2022

The cases have in common that the phase is unknown (which has a timeout of 600 secs vs. 3600 secs for preparing).
The code here sets the DB entries to stopping and stopped after that time.

Unfortunately, I couldn't find any signs in the log of workspace-cluster of these workspace instances. Also weren't lucky with finding traces in honeycomb (I checked 20 instances) 😞 .

@geropl
Copy link
Member

geropl commented Apr 22, 2022

The cases have in common that the phase is unknown (which has a timeout of 600 secs vs. 3600 secs for preparing).

This sounds related to a fix @andrew-farries is working on here. At some point we seem to have started using "unknown" as the initial phase - which does not make sense to me, at all. 😕 We're seeking to fix this now.

@andrew-farries Can you make sure we introduce separate timeouts for "building", and start out with the same value as "preparing"?

@svenefftinge
Copy link
Member Author

svenefftinge commented Apr 22, 2022

Yes, that explains why they are no signs in workspace-cluster the workspaces are probably building docker images and timing out in the meantime (after 10min).

@jldec
Copy link
Contributor

jldec commented Apr 26, 2022

Suspected example of affected user (internal)
Is there a workaround or DB change we can use to mitigate?

@geropl geropl removed the status in 🍎 WebApp Team May 31, 2022
@geropl geropl mentioned this issue May 31, 2022
11 tasks
@geropl geropl self-assigned this May 31, 2022
@geropl geropl moved this to In Progress in 🍎 WebApp Team May 31, 2022
@geropl
Copy link
Member

geropl commented May 31, 2022

Having had a look at the data we indeed leak a couple of prebuilds in queued every day.
There seem to be two constellations:

  1. image build timed out (after 1h): we currently have not (reliable) way to ensure that whenever an image build fails, we also fail the prebuild, because we assume we're watching the imagebuild from server from start -> end. IMO this can only be solved either by:
    1. making the imagebuild 💯 visible inside webapp, by storing image-build workspaces just like regular and prebuilds 👎
    2. making imagebuild completely transparent to webapp, by encapsulating them inside workspace 👍
  2. prebuild workspaces are stopped after 10mins no image builds, and no idea why those are stopped and by whom (yet) by MetaInstanceController: We currently have a timeout of 10mins for imagebuilds, which is a left over from before the "preparing" | "building" phase split We have a timeout for 1h for imagebuilds (correct), but the "building" phase just got deployed two days ago, so that's why we still see recent entries in the DB. Will:
    1. increase the timeout for "building" to 1h (so it's aligned with the imagebuild timeout itself): this is temporary, and expected to become obsolete once we solved 1. above
    2. make sure that if we stop a workspace, we also check for and update the corresponding prebuild if necessary

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team: webapp Issue belongs to the WebApp team type: bug Something isn't working
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants