Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve monitoring and feedback for errors during workspace cluster selection #8486

Merged
merged 5 commits into from
Mar 7, 2022

Conversation

geropl
Copy link
Member

@geropl geropl commented Feb 28, 2022

Description

This PR:

  • re-tries cluster-selection for a particular workspace instance for up to 2 times, with a 2s interval
  • introduces two new metrics: 'gitpod_server_instance_starts_success_total{retries}' and 'gitpod_server_instance_starts_failed_total'
  • adds a Graph for it on "Meta Overview" [sic]
  • adds an alert InstanceStartFailures (incl. runbook)
    image

Related Issue(s)

Fixes #7829
Runbook PR: https://github.com/gitpod-io/runbooks/pull/327

How to test

  • start a workspace on this branch
  • kubectl port-forward deployment grafana 3000
  • Open that port and navigate to "Team WebApp / Meta Overview"

Release Notes

improve robustness of startWorkspace
improve feedback for errors during cluster selection
improve monitoring for cluster selection errors

Documentation

  • /werft with-helm
  • /werft with-observability
  • /werft no-preview

@codecov
Copy link

codecov bot commented Feb 28, 2022

Codecov Report

Merging #8486 (572d8b4) into main (90fe82a) will decrease coverage by 17.74%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##             main    #8486       +/-   ##
===========================================
- Coverage   28.92%   11.17%   -17.75%     
===========================================
  Files         150       18      -132     
  Lines       23031      993    -22038     
===========================================
- Hits         6662      111     -6551     
+ Misses      15769      880    -14889     
+ Partials      600        2      -598     
Flag Coverage Δ
components-blobserve-app ?
components-blobserve-lib ?
components-common-go-lib ?
components-content-service-api-go-lib ?
components-content-service-app ?
components-content-service-lib ?
components-ee-agent-smith-app ?
components-ee-agent-smith-lib ?
components-ee-kedge-app ?
components-gitpod-cli-app 11.17% <ø> (ø)
components-image-builder-api-go-lib ?
components-image-builder-bob-app ?
components-image-builder-bob-runc-facade ?
components-image-builder-mk3-app ?
components-installation-telemetry-app ?
components-local-app-app-darwin-amd64 ?
components-local-app-app-darwin-arm64 ?
components-local-app-app-linux-amd64 ?
components-local-app-app-linux-arm64 ?
components-local-app-app-windows-386 ?
components-local-app-app-windows-amd64 ?
components-local-app-app-windows-arm64 ?
components-openvsx-proxy-app ?
components-openvsx-proxy-lib ?
components-registry-facade-app ?
components-registry-facade-lib ?
components-service-waiter-app ?
components-supervisor-app ?
components-workspacekit-app ?
components-ws-daemon-api-go-lib ?
components-ws-daemon-app ?
components-ws-daemon-lib ?
components-ws-daemon-nsinsider-app ?
components-ws-manager-api-go-lib ?
components-ws-manager-app ?
components-ws-proxy-app ?
dev-loadgen-app ?
dev-poolkeeper-app ?
install-installer-raw-app ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
install/installer/pkg/common/objects.go
components/ws-daemon/pkg/content/initializer.go
components/ws-daemon/pkg/cpulimit/dispatch.go
components/ws-manager/pkg/manager/annotations.go
components/supervisor/pkg/supervisor/git.go
components/common-go/log/redact.go
components/ee/kedge/pkg/kedge/collector.go
components/ws-manager/pkg/clock/clock.go
components/supervisor/pkg/ports/exposed-ports.go
components/blobserve/pkg/blobserve/refstore.go
... and 122 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 90fe82a...572d8b4. Read the comment docs.

@geropl
Copy link
Member Author

geropl commented Feb 28, 2022

/werft run

👍 started the job as gitpod-build-gpl-7829-clu-sel.1

@roboquat roboquat added size/XL and removed size/L labels Mar 1, 2022
@geropl geropl force-pushed the gpl/7829-clu-sel branch from be1e70d to 52b706f Compare March 1, 2022 15:40
@geropl geropl force-pushed the gpl/7829-clu-sel branch from 52b706f to 23bdf16 Compare March 1, 2022 16:38
@geropl geropl marked this pull request as ready for review March 1, 2022 16:41
@geropl geropl requested a review from a team March 1, 2022 16:41
@github-actions github-actions bot added the team: webapp Issue belongs to the WebApp team label Mar 1, 2022
@geropl
Copy link
Member Author

geropl commented Mar 1, 2022

Note: the build is read because of some bug with the "observability" deployment process, but Gitpod and Grafana do work.
We'll have to override the PR check once this is approved.

if (!resp) {
increaseFailedInstanceStartCounter("clusterSelectionFailed");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just in case tryStartOnCluster throws unhandled, should it make sense to increment as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, excellent point! 💯 Will add

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlexTugarev Here it is: 572d8b4

@geropl
Copy link
Member Author

geropl commented Mar 7, 2022

/werft run

👍 started the job as gitpod-build-gpl-7829-clu-sel.7

Hope to get a green build now with changed flags...

@geropl
Copy link
Member Author

geropl commented Mar 7, 2022

/werft no-preview=true

👎 unknown command: no-preview=true
Use /werft help to list the available commands

@geropl
Copy link
Member Author

geropl commented Mar 7, 2022

/werft run

👍 started the job as gitpod-build-gpl-7829-clu-sel.8

@roboquat roboquat merged commit d030750 into main Mar 7, 2022
@roboquat roboquat deleted the gpl/7829-clu-sel branch March 7, 2022 17:00
@roboquat roboquat added deployed: webapp Meta team change is running in production deployed Change is completely running in production labels Mar 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deployed: webapp Meta team change is running in production deployed Change is completely running in production release-note size/XL team: webapp Issue belongs to the WebApp team
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Improve monitoring and feedback for errors during workspace cluster selection
3 participants