Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MetricController: Run only a single job per task #660

Merged
merged 1 commit into from
Jun 27, 2019

Conversation

epa095
Copy link
Contributor

@epa095 epa095 commented Jun 18, 2019

What this PR does / why we need it:
This changes the spec.concurrencyPolicy of the metric collector
cron-job from "Allow" (default) to "Forbid". The cronjob used to
create a new job even if the previous job had not succeeded. On
high-load clusters this could lead to a high number of jobs which
never finished.

Which issue(s) this PR fixes *:
This fixes #659

Special notes for your reviewer:

  1. Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

Release note:



This change is Reviewable

@googlebot
Copy link

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here (e.g. I signed it!) and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@k8s-ci-robot
Copy link

Hi @epa095. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@epa095
Copy link
Contributor Author

epa095 commented Jun 18, 2019

I signed it!

@googlebot
Copy link

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

@epa095
Copy link
Contributor Author

epa095 commented Jun 18, 2019

/assign @richardsliu

@johnugeorge
Copy link
Member

Wondering why previous jobs have not succeeded?

Can you add changes to v1alpha2 also?

This changes the `spec.concurrencyPolicy` of the metric collector
cron-job from "Allow" (default) to "Forbid". The cronjob used to
create a new job even if the previous job had not succeeded. On
high-load clusters this could lead to a high number of jobs which
never finished.

This fixed kubeflow#659
@epa095
Copy link
Contributor Author

epa095 commented Jun 22, 2019

Wondering why previous jobs have not succeeded?

Can you add changes to v1alpha2 also?

Added it to v1alpha2 as well.

In our case we actually saw two problems, and we think both are related to a flaky cluster (on AKS):

  1. Pods started, but slowly. This would cause the metric-collectors to pile up, but finish eventually. In this case katib just put unnecessary stress on the system.
  2. Some pods never started, because... who knows, but probably AKS/network issues. In this case katib/cronjobs kept spawning jobs which never finished, and it brought everything down (which is maybe as expected with a broken cluster). We have a limit of 110 pods on each node, and I think the everlasting-starting pods counted towards this limit.

It should be noted that in the second case above this patch has the effect that we dont spawn infinite non-starting pods, but it also causes some trials to never get their results collected, while in the previous setup one could be lucky and have some of the collectors for a trial start up and manage collecting the result. I think its better to not start up arbitrary many instances, but its a bit of a tradeof.

@johnugeorge
Copy link
Member

@epa095 i am little confused about 2. If pods were not started, cronjobs should fail(https://github.com/kubeflow/katib/blob/master/pkg/util/v1alpha2/metricscollector/metricscollector.go#L49) and failedJobsHistoryLimit is set which determines the max number of failed jobs that should be kept.

@epa095
Copy link
Contributor Author

epa095 commented Jun 22, 2019

"failedJobsHistoryLimit" refers to failed jobs, not jobs stuck in 'starting'.

@hougangliu
Copy link
Member

/lgtm

@hougangliu
Copy link
Member

/ok-to-test

@johnugeorge
Copy link
Member

/retest

1 similar comment
@johnugeorge
Copy link
Member

/retest

Copy link
Member

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@johnugeorge
Copy link
Member

/approve

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnugeorge

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit c81818d into kubeflow:master Jun 27, 2019
@gaocegege
Copy link
Member

@epa095

Thanks for your contribution! 🎉 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Metric-collector cronjob spawns unlimited jobs
7 participants