MetricController: Run only a single job per task #660

epa095 · 2019-06-18T11:43:45Z

What this PR does / why we need it:
This changes the spec.concurrencyPolicy of the metric collector
cron-job from "Allow" (default) to "Forbid". The cronjob used to
create a new job even if the previous job had not succeeded. On
high-load clusters this could lead to a high number of jobs which
never finished.

Which issue(s) this PR fixes *:
This fixes #659

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

Release note:

This change is

googlebot · 2019-06-18T11:43:48Z

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here (e.g. I signed it!) and we'll verify it.

What to do if you already signed the CLA

Individual signers

It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.

Corporate signers

Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version).
The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.
The email used to register you as an authorized contributor must also be attached to your GitHub account.

ℹ️ Googlers: Go here for more info.

k8s-ci-robot · 2019-06-18T11:43:59Z

Hi @epa095. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

epa095 · 2019-06-18T12:05:26Z

I signed it!

googlebot · 2019-06-18T12:05:33Z

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

epa095 · 2019-06-18T12:13:27Z

/assign @richardsliu

johnugeorge · 2019-06-18T12:49:40Z

Wondering why previous jobs have not succeeded?

Can you add changes to v1alpha2 also?

This changes the `spec.concurrencyPolicy` of the metric collector cron-job from "Allow" (default) to "Forbid". The cronjob used to create a new job even if the previous job had not succeeded. On high-load clusters this could lead to a high number of jobs which never finished. This fixed kubeflow#659

epa095 · 2019-06-22T09:48:29Z

Wondering why previous jobs have not succeeded?

Can you add changes to v1alpha2 also?

Added it to v1alpha2 as well.

In our case we actually saw two problems, and we think both are related to a flaky cluster (on AKS):

Pods started, but slowly. This would cause the metric-collectors to pile up, but finish eventually. In this case katib just put unnecessary stress on the system.
Some pods never started, because... who knows, but probably AKS/network issues. In this case katib/cronjobs kept spawning jobs which never finished, and it brought everything down (which is maybe as expected with a broken cluster). We have a limit of 110 pods on each node, and I think the everlasting-starting pods counted towards this limit.

It should be noted that in the second case above this patch has the effect that we dont spawn infinite non-starting pods, but it also causes some trials to never get their results collected, while in the previous setup one could be lucky and have some of the collectors for a trial start up and manage collecting the result. I think its better to not start up arbitrary many instances, but its a bit of a tradeof.

johnugeorge · 2019-06-22T10:12:07Z

@epa095 i am little confused about 2. If pods were not started, cronjobs should fail(https://github.com/kubeflow/katib/blob/master/pkg/util/v1alpha2/metricscollector/metricscollector.go#L49) and failedJobsHistoryLimit is set which determines the max number of failed jobs that should be kept.

epa095 · 2019-06-22T13:16:01Z

"failedJobsHistoryLimit" refers to failed jobs, not jobs stuck in 'starting'.

hougangliu · 2019-06-26T23:38:38Z

/lgtm

hougangliu · 2019-06-26T23:39:26Z

/ok-to-test

johnugeorge · 2019-06-27T03:06:02Z

/retest

johnugeorge · 2019-06-27T04:40:00Z

/retest

gaocegege

/lgtm

johnugeorge · 2019-06-27T05:37:45Z

/approve

k8s-ci-robot · 2019-06-27T05:37:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnugeorge

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [johnugeorge]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gaocegege · 2019-06-27T05:46:24Z

@epa095

Thanks for your contribution! 🎉 👍

k8s-ci-robot requested review from andreyvelich and garganubhav June 18, 2019 11:43

k8s-ci-robot added needs-ok-to-test size/XS labels Jun 18, 2019

k8s-ci-robot assigned richardsliu Jun 18, 2019

johnugeorge mentioned this pull request Jun 21, 2019

if worker failed/pending, merics job will be created each min #667

Closed

epa095 force-pushed the patch-1 branch from 3f1b39d to 7471333 Compare June 22, 2019 09:39

k8s-ci-robot assigned hougangliu Jun 26, 2019

k8s-ci-robot added the lgtm label Jun 26, 2019

k8s-ci-robot added ok-to-test and removed needs-ok-to-test labels Jun 26, 2019

gaocegege reviewed Jun 27, 2019

View reviewed changes

k8s-ci-robot assigned gaocegege Jun 27, 2019

k8s-ci-robot added the approved label Jun 27, 2019

k8s-ci-robot merged commit c81818d into kubeflow:master Jun 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MetricController: Run only a single job per task #660

MetricController: Run only a single job per task #660

epa095 commented Jun 18, 2019 •

edited

Loading

googlebot commented Jun 18, 2019

k8s-ci-robot commented Jun 18, 2019

epa095 commented Jun 18, 2019

googlebot commented Jun 18, 2019

epa095 commented Jun 18, 2019

johnugeorge commented Jun 18, 2019

epa095 commented Jun 22, 2019

johnugeorge commented Jun 22, 2019

epa095 commented Jun 22, 2019 •

edited

Loading

hougangliu commented Jun 26, 2019

hougangliu commented Jun 26, 2019

johnugeorge commented Jun 27, 2019

johnugeorge commented Jun 27, 2019

gaocegege left a comment

johnugeorge commented Jun 27, 2019

k8s-ci-robot commented Jun 27, 2019

gaocegege commented Jun 27, 2019

MetricController: Run only a single job per task #660

MetricController: Run only a single job per task #660

Conversation

epa095 commented Jun 18, 2019 • edited Loading

googlebot commented Jun 18, 2019

What to do if you already signed the CLA

Individual signers

Corporate signers

k8s-ci-robot commented Jun 18, 2019

epa095 commented Jun 18, 2019

googlebot commented Jun 18, 2019

epa095 commented Jun 18, 2019

johnugeorge commented Jun 18, 2019

epa095 commented Jun 22, 2019

johnugeorge commented Jun 22, 2019

epa095 commented Jun 22, 2019 • edited Loading

hougangliu commented Jun 26, 2019

hougangliu commented Jun 26, 2019

johnugeorge commented Jun 27, 2019

johnugeorge commented Jun 27, 2019

gaocegege left a comment

Choose a reason for hiding this comment

johnugeorge commented Jun 27, 2019

k8s-ci-robot commented Jun 27, 2019

gaocegege commented Jun 27, 2019

epa095 commented Jun 18, 2019 •

edited

Loading

epa095 commented Jun 22, 2019 •

edited

Loading