-
Notifications
You must be signed in to change notification settings - Fork 699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reconcile should be triggered on update; even if no changes #800
Comments
I think it is two different problems. The reconcile on on-update request is not what we want, since it is |
@gaocegege Isn't update called periodically by the informer? See #309. I thought the design was to configure the informer to periodically issue an Update event and then handle Update by calling reconcile. |
I thought we want to implement the logic in the operator instead of relying on the informer. We have a ReconcilerSyncLoopPeriod although we do not implement the logic about it. Personally, I think we should follow the TODO here to reduce the cost of reconciling. Actually, I am not sure why we should run it periodically to make sure the state is correct, in our design, the state should be correct even if we do not sync it periodically. If it does not, there must be something wrong and we should not call reconcile to hide the error. But I have no strong opinion on it, if you think it is necessary, then I agree to revert the change. |
@gaocegege I feel that there can be cases of transient errors during reconcile. Eg: call to api server fails during one reconcile period and error may be returned before the entire processing of the event. In such a case, state might get wrong |
I think we could potentially do both. The informer provides an upper bound for retries. But there might be cases where we know we want to retry sooner then that; e.g. if there was a temporary API error we might want to retry with exponential backoff. I think the controller control loop supports that because we can requeue an update after some time. My understanding is that K8s controllers are designed to be level based. Which means they shouldn't assume that events are properly observed. So I think calling reconcile periodically is pretty important. Doing that basically ensures that given enough time all jobs will recover. So I think we should revert #796. Any objection? I'm more concerned about reliability then I am performance. So until we have a strong indication that the CRD is using a lot of resources or is unable to scale; lets optimize fo reliability. /cc @richardsliu |
OK, I will file a PR to revert the changes /cc @jian-he |
/assign @gaocegege |
Thanks! |
@jlewi @gaocegege @johnugeorge sorry for the late response. If the reconcileTFJob continuously gets called, it will get invoked even for every completed job. e.g. And such pattern does exist in pretty much every k8s controller, replicaset, daemonset, statefulset, deployment_controller... |
@jian-he In all k8s controllers like replicates, daemonset, statefulset, I see that own update events are actually enqueued without resource version checks(while external informers like pod informers have version checks) I agree with you regarding the api calls. I think, we have to revisit Pdb calls during the reconciles.. When i went through the code, I got a another question: We are populating tfjobNeedsSync based only on expectation counts. Don't we also need to check if spec got changed ? https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v2/tensorflow/controller.go#L297 @gaocegege |
ok, I opened #824 for this issue.
I think this is needed. |
#796 made a change to prevent reconcile from being on every update; now reconcile is only called if the resourceVersion changed.
I think the original behavior might be WAI. I think we want to periodically trigger reconcile? Reconcile ensures that everything is running correctly.
@jian-he Can you provide more explanation for the change?
/cc @ScorpioCPH @gaocegege
The text was updated successfully, but these errors were encountered: