Skip to content

[ws-manager-mk2] Loadgen fixes, concurrent reconciliation #16613

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Mar 6, 2023

Conversation

WVerlaek
Copy link
Member

@WVerlaek WVerlaek commented Feb 28, 2023

Description

Number of fixes and improvements found from loadtesting:

  • Configure concurrent reconciliation for ws-manager and ws-daemon controllers. Set to max 15 concurrent reconciliations. Configurable through config
  • See comments on PR for the rest

Related Issue(s)

Relates to #11416

How to test

Ran loadgen in ephemeral cluster

Release Notes

NONE

Documentation

Build Options:

  • /werft with-github-actions
    Experimental feature to run the build with GitHub Actions (and not in Werft).
  • leeway-no-cache
    leeway-target=components:all
  • /werft no-test
    Run Leeway with --dont-test
Publish Options
  • /werft publish-to-npm
  • /werft publish-to-jb-marketplace
Installer Options
  • with-ee-license
  • with-slow-database
  • with-dedicated-emulation
  • with-ws-manager-mk2
  • workspace-feature-flags
    Add desired feature flags to the end of the line above, space separated

Preview Environment Options:

  • /werft with-local-preview
    If enabled this will build install/preview
  • /werft with-preview
  • /werft with-large-vm
  • /werft with-gce-vm
    If enabled this will create the environment on GCE infra
  • /werft with-integration-tests=all
    Valid options are all, workspace, webapp, ide, jetbrains, vscode, ssh

Comment on lines +227 to +228
WorkspaceMaxConcurrentReconciles: 15,
TimeoutMaxConcurrentReconciles: 15,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set both the timeout and workspace controller's max reconciles to 15. This number is slightly arbitrary but should be sufficient for us looking at the metrics during the loadgen. It's in config, so we can easily change it anyways

Comment on lines +225 to +230
n := len(resp.GetStatus())
if n == 0 {
break
}
ex := resp.GetStatus()[0]
log.Infof("%d workspaces remaining, e.g. %s", n, ex.Id)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some extra logging while stopping workspaces after a load test to view progress. Also include a workspace ID of a stopping workspace, to make it easy to inspect a workspace stuck in stopping

@@ -55,7 +55,7 @@ var benchmarkCommand = &cobra.Command{
}

var load loadgen.LoadGenerator
load = loadgen.NewFixedLoadGenerator(500*time.Millisecond, 300*time.Millisecond)
load = loadgen.NewFixedLoadGenerator(800*time.Millisecond, 300*time.Millisecond)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slightly decreased the rate, this was creating workspaces too quickly for mk2, as mk2's StartWorkspace request doesn't block and was following the 2/second rate.

For mk1 loadtests, the StartWorkspace request takes seconds (to minutes) to complete, and never reached a rate of 2 starts/second anyways.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slightly decreased the rate, this was creating workspaces too quickly for mk2, as mk2's StartWorkspace request doesn't block and was following the 2/second rate.

@WVerlaek were you hitting a rate limit of ws-manager-mk2, where it wasn't allowing additional gRPC connections? I assume yes, just curious.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, some workspaces were failing to start, there were too many starting at once and pulling an image causing some to fail to pull. Increased the delay a bit to slow down workspace creation, but at this rate it's still faster than what mk1 would handle.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I see, so it's just a natural breaking limit. Good to know!

What was failing on pull? registry-facade, containerd? Something else? Just curious.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The errors were failing to pull image from registry-facade due to IO timeout

@WVerlaek
Copy link
Member Author

Keeping in draft - need to update the tests

@WVerlaek WVerlaek force-pushed the wv/mk2-loadgen-fixes branch from adeb3d3 to b0c58b2 Compare March 2, 2023 14:34
@roboquat roboquat added size/L and removed size/M labels Mar 2, 2023
Comment on lines +194 to +195
// Nothing to dispose if content wasn't ready.
!wsk8s.ConditionPresentAndTrue(ws.Status.Conditions, string(workspacev1.WorkspaceConditionContentReady))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also finish disposal if ContentReady condition isn't present. Fixes workspaces stuck in Stopping when the condition isn't added due to e.g. workspace startup failure

@WVerlaek WVerlaek marked this pull request as ready for review March 2, 2023 14:40
@WVerlaek WVerlaek requested a review from a team March 2, 2023 14:40
@github-actions github-actions bot added the team: workspace Issue belongs to the Workspace team label Mar 2, 2023
@WVerlaek WVerlaek requested a review from kylos101 March 6, 2023 13:36
@kylos101
Copy link
Contributor

kylos101 commented Mar 6, 2023

Thanks for the ping @WVerlaek , I'll start reviewing now!

@roboquat roboquat merged commit 1631a4a into main Mar 6, 2023
@roboquat roboquat deleted the wv/mk2-loadgen-fixes branch March 6, 2023 14:18
@roboquat roboquat added deployed: workspace Workspace team change is running in production deployed Change is completely running in production labels Mar 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deployed: workspace Workspace team change is running in production deployed Change is completely running in production release-note-none size/L team: workspace Issue belongs to the Workspace team
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants