Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load test: Test run-to-completion workflow for Job objects #799

Closed
mm4tt opened this issue Sep 16, 2019 · 8 comments · Fixed by #1998
Closed

Load test: Test run-to-completion workflow for Job objects #799

mm4tt opened this issue Sep 16, 2019 · 8 comments · Fixed by #1998
Assignees
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@mm4tt
Copy link
Contributor

mm4tt commented Sep 16, 2019

As a part of #704 the load test was extended to cover Jobs.

The way it was implemented was kind of a shortcut, we used "pause" pods that never complete and used Jobs in the same way we use/test Deployments, i.e. we create N Jobs of size X then scale them up or down and then delete the jobs. While this was a good start to test overall performance of job-controller, we should actually use the jobs in a way they are designed to be used, i.e. don't use the pause pods, but pods that end after some time and test the run-to-completion workflow.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 15, 2019
@wojtek-t
Copy link
Member

/remove-lifecycle stale
/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 16, 2019
@marseel
Copy link
Member

marseel commented Dec 17, 2020

@mm4tt @wojtek-t So I am actually wondering about one thing. Let's assume we change jobs to "sleep X".
Currently we have two phases: creating job and waiting for pods to be running and second phase where we scale number of pods within job.
After changing to "sleep X" first phase will complete job, so second part of scaling up/down job will not work (completions is also immutable). Do you have any idea how we want to handle that other than just creating job and then waiting for completion?

I have two ideas what we could do here:

  1. Set parallelism == completions
  2. Set parallelism < completions (for example parallelism = 1/4 of completions)
    In both cases I guess it would make sense to make sleep a bit random, so we don't have many pods churning and/or created at the same time.

What do you think about it?

@wojtek-t
Copy link
Member

We can get rid of scaling phase for jobs at this point. Just run it to completion, ideally with parallelism < completions.

@alculquicondor
Copy link
Member

/assign

@alculquicondor
Copy link
Member

Let me know if this plan makes sense:

  1. Add a WaitForJobsCompleted measurement method
  2. In the load test:
    • Switch the image from pause into something that just finishes. Is there an image I can use? Simplest thing is a busybox with command exit 0, but let me know if I should add a new image.
    • Remove the WaitForRunningJobs identifier from WaitForControlledPodsRunning
    • Add WaitForJobsCompleted
    • Make the scale a noop for jobs (keep the same parallelism, although at this point the job would already be finished)

cc @jprzychodzen

@wojtek-t
Copy link
Member

I'm actually reluctant to changing that in the current load test. We make some assumptions about number of pods etc.
I think we should have a new (more batch-oriented) test for it.
We may decide to merge them eventually, but I don't think we're there yet.

@mborsz - FYI

@jprzychodzen
Copy link
Contributor

I agree with splitting Jobs to another test. Currently we are strongly depending on number of Pods and changing that during the test can generate unstable results.

As mentioned in kubernetes/enhancements#3113 and kubernetes/enhancements#3111 understanding performance metrics should require monitoring values in metric job_sync_duration_seconds (they are visible in Prometheus as job_controller_job_sync_duration_seconds_* metrics ).

To achieve this, we should tweak

return measurementutil.NewLatencyMetricPrometheus(samples)
to allow generating summary for PerfDash for this metric.

/cc @aleksandra-malinowska

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants