Skip to content

Failed image builds can result in many workspace instances, hiding the original issue #12978

Closed as not planned
@jenting

Description

@jenting

Bug description

The workspace can't be started, this started out as an image build failure, but we allowed the user to try restarting the workspace (or restarted the workspace ourselves), resulting in new instances that could not start (because there was no backup).

When an image build fails, we provide the user with an option to start the workspace using the default image. It is possible that happened here. If that is true, it would mean image-builder is not the culprit for this particular issue, because they actually started the workspace...but then had trouble backups "at some point". Please refer to the logs and instance IDs to paint a complete picture.

Steps to reproduce

Try using a custom image with just this:

FROM hashicorp/terraform:1.2.9

It should fail, but, let you try restarting the workspace, resulting in you seeing a message indicating that the workspace backup doesn't exist.

Workspace affected

Refer to logs

Expected behavior

  1. If an image build fails, show the reason for the image build failure, rather than a generic headless image build failed. It is presently difficult to know why image builds fail, so it is important to share the actual error with users.
  2. If an image build fails, and the user does not run with our default image instead, the ending status for the workspace instance should not allow for restarts. Otherwise the danger is the user may not be able to start the workspace (due to an image build failure), and never see why on the subsequent restarts.

Example repository

No response

Anything else?

Logs for the workspace in question:
Internal slack conversation: https://gitpod.slack.com/archives/C02SF4A050W/p1663161853069319
workspace logs: https://cloudlogging.app.goo.gl/A2UvJMTtQd2eaFk38
webapp logs: https://cloudlogging.app.goo.gl/YpNtszBdZRGx2VXi8

Here is what we saw as a failure for the most recent instance failure that the customer was getting

cannot initialize workspace:
    github.com/gitpod-io/gitpod/content-service/pkg/initializer.InitializeWorkspace
        github.com/gitpod-io/gitpod/content-service@v0.0.0-00010101000000-000000000000/pkg/initializer/initializer.go:439
  - no backup found:
    github.com/gitpod-io/gitpod/content-service/pkg/initializer.(*fromBackupInitializer).Run
        github.com/gitpod-io/gitpod/content-service@v0.0.0-00010101000000-000000000000/pkg/initializer/initializer.go:199

We're calling Run on the initializer here, hasBackup is false, but, the stack trace suggests the initializer we're working with is a fromBackupInitializer given this error.

In looking at the entire history for this workspace...this particular workspace has five instances in the logs!

The initial workspace had an error (root cause) of:

[adduser -h /home/gitpod -s /bin/sh -D -G 33333 -u 33333 gitpod]: exit status 1: adduser: unknown group 33333:
    github.com/gitpod-io/gitpod/supervisor/pkg/supervisor.addUser
        github.com/gitpod-io/gitpod/supervisor/pkg/supervisor/user.go:172
  - exit status 1

The related message is cannot ensure Gitpod user exists . 

The related instance IDs are:
8e70e9a5-b464-4dab-8826-0ed331b88ca9
0aa0fb2b-0cd9-4d12-9330-552797c6a2a0
349a594f-f55a-47ff-ac64-ccca827c2ffa
e61ebcc4-c300-47f8-a729-2b4d45c19365
56adcae2-66ba-463f-9555-7dace043ff90

Lastly, this is related: #8908

Definition of done

When an image build fails, like due to the Gitpod user not existing, the end user sees that corresponding problem, instead of headless task failed.

Also, update the docs on the Gitpod website, so users know how to add the Gitpod user to a custom image not using a Gitpod base image.

Metadata

Metadata

Assignees

Labels

team: workspaceIssue belongs to the Workspace teamtype: bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions