Load test: Test run-to-completion workflow for Job objects #799

mm4tt · 2019-09-16T12:58:22Z

As a part of #704 the load test was extended to cover Jobs.

The way it was implemented was kind of a shortcut, we used "pause" pods that never complete and used Jobs in the same way we use/test Deployments, i.e. we create N Jobs of size X then scale them up or down and then delete the jobs. While this was a good start to test overall performance of job-controller, we should actually use the jobs in a way they are designed to be used, i.e. don't use the pause pods, but pods that end after some time and test the run-to-completion workflow.

fejta-bot · 2019-12-15T13:48:23Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

wojtek-t · 2019-12-16T07:23:29Z

/remove-lifecycle stale
/lifecycle frozen

marseel · 2020-12-17T13:17:05Z

@mm4tt @wojtek-t So I am actually wondering about one thing. Let's assume we change jobs to "sleep X".
Currently we have two phases: creating job and waiting for pods to be running and second phase where we scale number of pods within job.
After changing to "sleep X" first phase will complete job, so second part of scaling up/down job will not work (completions is also immutable). Do you have any idea how we want to handle that other than just creating job and then waiting for completion?

I have two ideas what we could do here:

Set parallelism == completions
Set parallelism < completions (for example parallelism = 1/4 of completions)
In both cases I guess it would make sense to make sleep a bit random, so we don't have many pods churning and/or created at the same time.

What do you think about it?

wojtek-t · 2020-12-17T13:21:01Z

We can get rid of scaling phase for jobs at this point. Just run it to completion, ideally with parallelism < completions.

alculquicondor · 2022-01-21T20:18:09Z

/assign

alculquicondor · 2022-01-21T20:24:56Z

Let me know if this plan makes sense:

Add a WaitForJobsCompleted measurement method
In the load test:
- Switch the image from pause into something that just finishes. Is there an image I can use? Simplest thing is a busybox with command exit 0, but let me know if I should add a new image.
- Remove the WaitForRunningJobs identifier from WaitForControlledPodsRunning
- Add WaitForJobsCompleted
- Make the scale a noop for jobs (keep the same parallelism, although at this point the job would already be finished)

cc @jprzychodzen

wojtek-t · 2022-01-24T07:34:47Z

I'm actually reluctant to changing that in the current load test. We make some assumptions about number of pods etc.
I think we should have a new (more batch-oriented) test for it.
We may decide to merge them eventually, but I don't think we're there yet.

@mborsz - FYI

jprzychodzen · 2022-01-24T16:58:20Z

I agree with splitting Jobs to another test. Currently we are strongly depending on number of Pods and changing that during the test can generate unstable results.

As mentioned in kubernetes/enhancements#3113 and kubernetes/enhancements#3111 understanding performance metrics should require monitoring values in metric job_sync_duration_seconds (they are visible in Prometheus as job_controller_job_sync_duration_seconds_* metrics ).

To achieve this, we should tweak

perf-tests/clusterloader2/pkg/measurement/common/generic_query_measurement.go

Line 94 in 28e2453

return measurementutil.NewLatencyMetricPrometheus(samples)

to allow generating summary for PerfDash for this metric.

/cc @aleksandra-malinowska

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 15, 2019

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 16, 2019

k8s-ci-robot assigned alculquicondor Jan 21, 2022

This was referenced Jan 25, 2022

Add WaitForFinishedJobs measurement #1976

Merged

Add image for sleep command #1984

Merged

alculquicondor mentioned this issue Mar 1, 2022

Add job load test: create multiple jobs based on the number of nodes #1998

Merged

k8s-ci-robot closed this as completed in #1998 Mar 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load test: Test run-to-completion workflow for Job objects #799

Load test: Test run-to-completion workflow for Job objects #799

mm4tt commented Sep 16, 2019

fejta-bot commented Dec 15, 2019

wojtek-t commented Dec 16, 2019

marseel commented Dec 17, 2020 •

edited

Loading

wojtek-t commented Dec 17, 2020

alculquicondor commented Jan 21, 2022

alculquicondor commented Jan 21, 2022

wojtek-t commented Jan 24, 2022

jprzychodzen commented Jan 24, 2022

Load test: Test run-to-completion workflow for Job objects #799

Load test: Test run-to-completion workflow for Job objects #799

Comments

mm4tt commented Sep 16, 2019

fejta-bot commented Dec 15, 2019

wojtek-t commented Dec 16, 2019

marseel commented Dec 17, 2020 • edited Loading

wojtek-t commented Dec 17, 2020

alculquicondor commented Jan 21, 2022

alculquicondor commented Jan 21, 2022

wojtek-t commented Jan 24, 2022

jprzychodzen commented Jan 24, 2022

marseel commented Dec 17, 2020 •

edited

Loading