Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reconcile should be triggered on update; even if no changes #800

Closed
jlewi opened this issue Aug 27, 2018 · 11 comments
Closed

reconcile should be triggered on update; even if no changes #800

jlewi opened this issue Aug 27, 2018 · 11 comments

Comments

@jlewi
Copy link
Contributor

jlewi commented Aug 27, 2018

#796 made a change to prevent reconcile from being on every update; now reconcile is only called if the resourceVersion changed.

I think the original behavior might be WAI. I think we want to periodically trigger reconcile? Reconcile ensures that everything is running correctly.

@jian-he Can you provide more explanation for the change?

/cc @ScorpioCPH @gaocegege

@gaocegege
Copy link
Member

I think it is two different problems. The reconcile on on-update request is not what we want, since it is
idempotent. While we want to call reconcile periodically.

@jlewi
Copy link
Contributor Author

jlewi commented Aug 29, 2018

@gaocegege Isn't update called periodically by the informer?

See #309. I thought the design was to configure the informer to periodically issue an Update event and then handle Update by calling reconcile.

@gaocegege
Copy link
Member

gaocegege commented Aug 29, 2018

I thought we want to implement the logic in the operator instead of relying on the informer. We have a ReconcilerSyncLoopPeriod although we do not implement the logic about it. Personally, I think we should follow the TODO here to reduce the cost of reconciling.

Actually, I am not sure why we should run it periodically to make sure the state is correct, in our design, the state should be correct even if we do not sync it periodically. If it does not, there must be something wrong and we should not call reconcile to hide the error.

But I have no strong opinion on it, if you think it is necessary, then I agree to revert the change.

@johnugeorge
Copy link
Member

@gaocegege I feel that there can be cases of transient errors during reconcile. Eg: call to api server fails during one reconcile period and error may be returned before the entire processing of the event. In such a case, state might get wrong

@jlewi
Copy link
Contributor Author

jlewi commented Aug 29, 2018

I thought we want to implement the logic in the operator instead of relying on the informer. We have a ReconcilerSyncLoopPeriod although we do not implement the logic about it. Personally, I think we should follow the TODO here to reduce the cost of reconciling.

I think we could potentially do both. The informer provides an upper bound for retries. But there might be cases where we know we want to retry sooner then that; e.g. if there was a temporary API error we might want to retry with exponential backoff. I think the controller control loop supports that because we can requeue an update after some time.

My understanding is that K8s controllers are designed to be level based. Which means they shouldn't assume that events are properly observed.

So I think calling reconcile periodically is pretty important. Doing that basically ensures that given enough time all jobs will recover.

So I think we should revert #796. Any objection?

I'm more concerned about reliability then I am performance. So until we have a strong indication that the CRD is using a lot of resources or is unable to scale; lets optimize fo reliability.

/cc @richardsliu

@gaocegege
Copy link
Member

OK, I will file a PR to revert the changes

/cc @jian-he

@gaocegege
Copy link
Member

/assign @gaocegege

@jlewi
Copy link
Contributor Author

jlewi commented Aug 30, 2018

Thanks!

@jian-he
Copy link
Contributor

jian-he commented Aug 30, 2018

@jlewi @gaocegege @johnugeorge sorry for the late response.

If the reconcileTFJob continuously gets called, it will get invoked even for every completed job. e.g.
Here, if the job is completed, such calls to api-server will still be continuously made. I think it unnecessarily brings more overhead to the api-server, if the number of completed jobs keeps on increasing... It's also possible that people add more heavy-operation code in this block, without knowing that the code will be called after the job is done, which makes it error prone...

And such pattern does exist in pretty much every k8s controller, replicaset, daemonset, statefulset, deployment_controller...
image
and also the kubeflow JobController ,

@johnugeorge
Copy link
Member

johnugeorge commented Sep 2, 2018

@jian-he In all k8s controllers like replicates, daemonset, statefulset, I see that own update events are actually enqueued without resource version checks(while external informers like pod informers have version checks)
eg:
https://github.com/kubernetes/kubernetes/blob/5d8a79f2e1fe381d348f77fb314e13350de519df/pkg/controller/daemon/daemon_controller.go#L183
https://github.com/kubernetes/kubernetes/blob/7f23a743e8c23ac6489340bbb34fa6f1d392db9d/pkg/controller/replicaset/replica_set.go#L232 (See the comment given in updateRS)

I agree with you regarding the api calls. I think, we have to revisit Pdb calls during the reconciles..

When i went through the code, I got a another question: We are populating tfjobNeedsSync based only on expectation counts. Don't we also need to check if spec got changed ? https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v2/tensorflow/controller.go#L297

@gaocegege
Correct me if I am wrong

@jian-he
Copy link
Contributor

jian-he commented Sep 5, 2018

@johnugeorge

. I think, we have to revisit Pdb calls during the reconciles..

ok, I opened #824 for this issue.

Don't we also need to check if spec got changed ?

I think this is needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants