Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reconcileTFJobs is always triggered even with no update #796

Merged
merged 1 commit into from
Aug 27, 2018

Conversation

jian-he
Copy link
Contributor

@jian-he jian-he commented Aug 24, 2018

Found that reconcileTFJobs continuously gets triggered even with no update.

This patch fixes that


This change is Reviewable

@googlebot
Copy link

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and have the pull request author add another comment and the bot will run again. If the bot doesn't comment, it means it doesn't think anything has changed.

@googlebot
Copy link

CLAs look good, thanks!

@TravisBuddy
Copy link

Travis tests have failed

Hey @jian-he,
Please read the following log in order to understand the failure reason.
It'll be awesome if you fix what's wrong and commit the changes.

2nd Build

gometalinter --config=linter_config.json --vendor ./...
cmd/tf-operator.v2/app/options/options.go:1::warning: file is not goimported (goimports)

3rd Build

gometalinter --config=linter_config.json --vendor ./...
cmd/tf-operator.v2/app/options/options.go:1::warning: file is not goimported (goimports)

travis_time:end:387b475f:start=1535144156722657402,finish=1535144362874978215,duration=206152320813

@coveralls
Copy link

coveralls commented Aug 24, 2018

Coverage Status

Coverage decreased (-0.2%) to 58.112% when pulling 27480ec on jian-he:develop into a11a0ee on kubeflow:master.

@TravisBuddy
Copy link

Travis tests have failed

Hey @jian-he,
Please read the following log in order to understand the failure reason.
It'll be awesome if you fix what's wrong and commit the changes.

2nd Build

gometalinter --config=linter_config.json --vendor ./...
cmd/tf-operator.v2/app/options/options.go:1::warning: file is not goimported (goimports)

3rd Build

gometalinter --config=linter_config.json --vendor ./...
cmd/tf-operator.v2/app/options/options.go:1::warning: file is not goimported (goimports)

travis_time:end:0fed45bd:start=1535144554726391036,finish=1535144724622344200,duration=169895953164

@gaocegege
Copy link
Member

/ok-to-test

Copy link
Member

@ScorpioCPH ScorpioCPH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@ScorpioCPH
Copy link
Member

/lgtm
/approve

Copy link
Member

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rebase and try again

Copy link
Member

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gaocegege, ScorpioCPH

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [ScorpioCPH,gaocegege]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 554d750 into kubeflow:master Aug 27, 2018
@jlewi
Copy link
Contributor

jlewi commented Aug 27, 2018

@jian-he can you please explain this change? I think running reconcile periodically is highly desirable as this ensures the job is correctly running. For example, by calling reconcile periodically we ensure that if any pods are interrupted we will detect this and take corrective action.

I don't think we want to depend on getting an event for all possible changes as we could easily miss a change.

I think one of the design principles for K8s controllers is that they are level set. This means we want to periodically check that everything is in the expected state and if not take corrective action.

I think this PR prevents that because on each call to reconcile we won't do anything because the resource version hasn't changed, so we won't actually check that everything is in the expected state.

/cc @gaocegege @ScorpioCPH

@johnugeorge
Copy link
Member

I agree with @jlewi If we somehow miss taking complete action on an update event(may be an transient error or a return), this PR will skip further reconciliation.

@ScorpioCPH
Copy link
Member

I think if any pods' states are changed, the reversion of TFJob will be updated, so we catch this immediately.

@johnugeorge
Copy link
Member

old and current (https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v2/tfcontroller/tfjob.go#L66) are provided by the upstream. I feel, there is always a chance that controller may have missed updates or prior actions failed. In such cases, we won't take further actions since resource versions are same

@jian-he
Copy link
Contributor Author

jian-he commented Sep 2, 2018

@jlewi @gaocegege @johnugeorge @ScorpioCPH
I commented the reason why this change is made. Could you please take a look ?
#800 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants