reconcile should be triggered on update; even if no changes #800

jlewi · 2018-08-27T22:42:58Z

#796 made a change to prevent reconcile from being on every update; now reconcile is only called if the resourceVersion changed.

I think the original behavior might be WAI. I think we want to periodically trigger reconcile? Reconcile ensures that everything is running correctly.

@jian-he Can you provide more explanation for the change?

/cc @ScorpioCPH @gaocegege

gaocegege · 2018-08-28T03:05:33Z

I think it is two different problems. The reconcile on on-update request is not what we want, since it is
idempotent. While we want to call reconcile periodically.

jlewi · 2018-08-29T12:46:51Z

@gaocegege Isn't update called periodically by the informer?

See #309. I thought the design was to configure the informer to periodically issue an Update event and then handle Update by calling reconcile.

gaocegege · 2018-08-29T13:37:23Z

I thought we want to implement the logic in the operator instead of relying on the informer. We have a ReconcilerSyncLoopPeriod although we do not implement the logic about it. Personally, I think we should follow the TODO here to reduce the cost of reconciling.

Actually, I am not sure why we should run it periodically to make sure the state is correct, in our design, the state should be correct even if we do not sync it periodically. If it does not, there must be something wrong and we should not call reconcile to hide the error.

But I have no strong opinion on it, if you think it is necessary, then I agree to revert the change.

johnugeorge · 2018-08-29T15:58:47Z

@gaocegege I feel that there can be cases of transient errors during reconcile. Eg: call to api server fails during one reconcile period and error may be returned before the entire processing of the event. In such a case, state might get wrong

jlewi · 2018-08-29T22:56:40Z

I thought we want to implement the logic in the operator instead of relying on the informer. We have a ReconcilerSyncLoopPeriod although we do not implement the logic about it. Personally, I think we should follow the TODO here to reduce the cost of reconciling.

I think we could potentially do both. The informer provides an upper bound for retries. But there might be cases where we know we want to retry sooner then that; e.g. if there was a temporary API error we might want to retry with exponential backoff. I think the controller control loop supports that because we can requeue an update after some time.

My understanding is that K8s controllers are designed to be level based. Which means they shouldn't assume that events are properly observed.

So I think calling reconcile periodically is pretty important. Doing that basically ensures that given enough time all jobs will recover.

So I think we should revert #796. Any objection?

I'm more concerned about reliability then I am performance. So until we have a strong indication that the CRD is using a lot of resources or is unable to scale; lets optimize fo reliability.

/cc @richardsliu

gaocegege · 2018-08-30T02:35:35Z

OK, I will file a PR to revert the changes

/cc @jian-he

gaocegege · 2018-08-30T03:42:32Z

/assign @gaocegege

jlewi · 2018-08-30T03:52:14Z

Thanks!

jian-he · 2018-08-30T22:25:02Z

@jlewi @gaocegege @johnugeorge sorry for the late response.

If the reconcileTFJob continuously gets called, it will get invoked even for every completed job. e.g.
Here, if the job is completed, such calls to api-server will still be continuously made. I think it unnecessarily brings more overhead to the api-server, if the number of completed jobs keeps on increasing... It's also possible that people add more heavy-operation code in this block, without knowing that the code will be called after the job is done, which makes it error prone...

And such pattern does exist in pretty much every k8s controller, replicaset, daemonset, statefulset, deployment_controller...

and also the kubeflow JobController ,

johnugeorge · 2018-09-02T08:26:20Z

@jian-he In all k8s controllers like replicates, daemonset, statefulset, I see that own update events are actually enqueued without resource version checks(while external informers like pod informers have version checks)
eg:
https://github.com/kubernetes/kubernetes/blob/5d8a79f2e1fe381d348f77fb314e13350de519df/pkg/controller/daemon/daemon_controller.go#L183
https://github.com/kubernetes/kubernetes/blob/7f23a743e8c23ac6489340bbb34fa6f1d392db9d/pkg/controller/replicaset/replica_set.go#L232 (See the comment given in updateRS)

I agree with you regarding the api calls. I think, we have to revisit Pdb calls during the reconciles..

When i went through the code, I got a another question: We are populating tfjobNeedsSync based only on expectation counts. Don't we also need to check if spec got changed ? https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v2/tensorflow/controller.go#L297

@gaocegege
Correct me if I am wrong

jian-he · 2018-09-05T00:22:01Z

@johnugeorge

. I think, we have to revisit Pdb calls during the reconciles..

ok, I opened #824 for this issue.

Don't we also need to check if spec got changed ?

I think this is needed.

jlewi added priority/p1 area/0.3.0 area/tfjob labels Aug 27, 2018

k8s-ci-robot assigned gaocegege Aug 30, 2018

gaocegege mentioned this issue Aug 30, 2018

Revert "Avoid triggering reconcileTFJobs if no TFJob update (#796)" #805

Merged

k8s-ci-robot closed this as completed in #805 Aug 30, 2018

jian-he mentioned this issue Sep 2, 2018

reconcileTFJobs is always triggered even with no update #796

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reconcile should be triggered on update; even if no changes #800

reconcile should be triggered on update; even if no changes #800

jlewi commented Aug 27, 2018

gaocegege commented Aug 28, 2018

jlewi commented Aug 29, 2018

gaocegege commented Aug 29, 2018 •

edited

Loading

johnugeorge commented Aug 29, 2018

jlewi commented Aug 29, 2018

gaocegege commented Aug 30, 2018

gaocegege commented Aug 30, 2018

jlewi commented Aug 30, 2018

jian-he commented Aug 30, 2018 •

edited

Loading

johnugeorge commented Sep 2, 2018 •

edited

Loading

jian-he commented Sep 5, 2018 •

edited

Loading

reconcile should be triggered on update; even if no changes #800

reconcile should be triggered on update; even if no changes #800

Comments

jlewi commented Aug 27, 2018

gaocegege commented Aug 28, 2018

jlewi commented Aug 29, 2018

gaocegege commented Aug 29, 2018 • edited Loading

johnugeorge commented Aug 29, 2018

jlewi commented Aug 29, 2018

gaocegege commented Aug 30, 2018

gaocegege commented Aug 30, 2018

jlewi commented Aug 30, 2018

jian-he commented Aug 30, 2018 • edited Loading

johnugeorge commented Sep 2, 2018 • edited Loading

jian-he commented Sep 5, 2018 • edited Loading

gaocegege commented Aug 29, 2018 •

edited

Loading

jian-he commented Aug 30, 2018 •

edited

Loading

johnugeorge commented Sep 2, 2018 •

edited

Loading

jian-he commented Sep 5, 2018 •

edited

Loading