-
Notifications
You must be signed in to change notification settings - Fork 1.3k
[ws-manager-mk2] Loadgen fixes, concurrent reconciliation #16613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WorkspaceMaxConcurrentReconciles: 15, | ||
TimeoutMaxConcurrentReconciles: 15, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set both the timeout and workspace controller's max reconciles to 15. This number is slightly arbitrary but should be sufficient for us looking at the metrics during the loadgen. It's in config, so we can easily change it anyways
n := len(resp.GetStatus()) | ||
if n == 0 { | ||
break | ||
} | ||
ex := resp.GetStatus()[0] | ||
log.Infof("%d workspaces remaining, e.g. %s", n, ex.Id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some extra logging while stopping workspaces after a load test to view progress. Also include a workspace ID of a stopping workspace, to make it easy to inspect a workspace stuck in stopping
@@ -55,7 +55,7 @@ var benchmarkCommand = &cobra.Command{ | |||
} | |||
|
|||
var load loadgen.LoadGenerator | |||
load = loadgen.NewFixedLoadGenerator(500*time.Millisecond, 300*time.Millisecond) | |||
load = loadgen.NewFixedLoadGenerator(800*time.Millisecond, 300*time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slightly decreased the rate, this was creating workspaces too quickly for mk2, as mk2's StartWorkspace request doesn't block and was following the 2/second rate.
For mk1 loadtests, the StartWorkspace request takes seconds (to minutes) to complete, and never reached a rate of 2 starts/second anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slightly decreased the rate, this was creating workspaces too quickly for mk2, as mk2's StartWorkspace request doesn't block and was following the 2/second rate.
@WVerlaek were you hitting a rate limit of ws-manager-mk2, where it wasn't allowing additional gRPC connections? I assume yes, just curious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, some workspaces were failing to start, there were too many starting at once and pulling an image causing some to fail to pull. Increased the delay a bit to slow down workspace creation, but at this rate it's still faster than what mk1 would handle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, I see, so it's just a natural breaking limit. Good to know!
What was failing on pull? registry-facade, containerd? Something else? Just curious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The errors were failing to pull image from registry-facade due to IO timeout
Keeping in draft - need to update the tests |
adeb3d3
to
b0c58b2
Compare
// Nothing to dispose if content wasn't ready. | ||
!wsk8s.ConditionPresentAndTrue(ws.Status.Conditions, string(workspacev1.WorkspaceConditionContentReady)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also finish disposal if ContentReady
condition isn't present. Fixes workspaces stuck in Stopping when the condition isn't added due to e.g. workspace startup failure
Thanks for the ping @WVerlaek , I'll start reviewing now! |
Description
Number of fixes and improvements found from loadtesting:
Related Issue(s)
Relates to #11416
How to test
Ran loadgen in ephemeral cluster
Release Notes
Documentation
Build Options:
Experimental feature to run the build with GitHub Actions (and not in Werft).
leeway-target=components:all
Run Leeway with
--dont-test
Publish Options
Installer Options
Add desired feature flags to the end of the line above, space separated
Preview Environment Options:
If enabled this will build
install/preview
If enabled this will create the environment on GCE infra
Valid options are
all
,workspace
,webapp
,ide
,jetbrains
,vscode
,ssh