-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why resync default is so large - 10hours #521
Comments
controller-runtime uses the underlying client-go caches, so the difference isn't there. The main difference is that we force people to actually write level-based controllers (see below). The problem was that nobody actually understood what this knob did -- it doesn't relist from the API server (which is what most people thought it did). All it does is force a re-reconcile of everything in the cache. Generally, you probably don't want to do this -- it's often expensive, it'll mask subtle bugs, and it's usually not necessary. It's actually probably fine to set this to never, except it might save you in production if you've got bugs in your reconcile logic because you didn't write a level-based controller. The only place, IMO, it's actually reasonable to use this knob, is if you're writing something that needs periodic reconcilation, like a horizontal pod autoscaler, but we'd recommend using Does that mostly answer your question? |
(as for why the flink operator sets it so low, I'm not sure, except that perhaps they mistakenly copied it from elsewhere. The spark operator probably just copied the sample-controller code, which copied something else, which copied something else, until nobody actually remembered what the option did in the first place) |
@DirectXMan12 wrote:
This setting seems to get passed down to |
So, at the very least, we know it lists from the cache. If we go back up the stack to the If we then check The last place to check would be to see if the go-routine that runs the resync loop in From a concrete perspective, if we try with a controller-runtime and set a TL;DR: I think may have done that a long time ago, but it hasn't been that way for a while now. |
Hm that's unfortunate because I've observed the watch stalling on rare occasions, leaving the cache to become more and more out-of-date over time. The relist would have mitigated this automatically, but if there's no more relist, then it appears my only recourse is to reboot the process. |
o_O the watch shouldn't stall out ever. I'm surprised we don't time out and then trigger the reconnect. That sounds like a bug. |
If you have more data on that, I'd love to see it (and probably SIG API machinery, too, since it's prob a reflector bug) |
My logs didn't have anything useful to report. I only see logs from controller-runtime and my own code. Will I get lower-level logs if I increase the log level? Do you have a recommended level to catch potential problems at the informer/reflector level without being too noisy to leave on all the time? |
You'll have to call |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Just to close the loop on my report of the watch stalling: I ended up turning on detailed logging, but also at the same time updating to k8s-1.13.x-based operator-sdk. I haven't seen any watches stall since then. |
(closing as per above response) |
Using the default 0 duration period should instruct the watchers to avoid re-listing from cache and getting onUpdate calls. We should not need this, since our reconcile logic is retrying if an event fails to update wg peers and never lets go. kubernetes-sigs/controller-runtime#521 (comment) If a watcher fails to connect to apiservers for a while it should be able to sync when is back online and create sufficient events to sync our peers again: https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes
This patch adds the --sync-period flag back and delivers the provided time duration to controller runtime, which now in v2 handles the relevant client-go configuration on our behalf (whereas in v1 we used it directly). We now use the default value of 48 hours instead of the value we previously used in v1 of 10 minutes, as per the improvements and context relevant to the setting which make relying on it no longer necessary. See also: kubernetes-sigs/controller-runtime#521
This patch adds the --sync-period flag back and delivers the provided time duration to controller runtime, which now in v2 handles the relevant client-go configuration on our behalf (whereas in v1 we used it directly). We now use the default value of 48 hours instead of the value we previously used in v1 of 10 minutes, as per the improvements and context relevant to the setting which make relying on it no longer necessary. See also: kubernetes-sigs/controller-runtime#521
@DirectXMan12 can you please tell me how to set SyncPeriod as never? I don't see a way to set time.Duration as never in Golang. |
If not explicitly set on its CR, HCO webhook is consuming TLS configuration from Openshift cluster-wide APIServer CR. For performance reason it's not reading it on each request to the HCO CR but it's consuming a cached representation. The cache was only refreshed by a controller based on an informer. We got reports that due to the nature of changes in the APIServer CR, the connection to the APIserver itself could become stuck: ``` W1025 13:50:16.898592 1 reflector.go:424] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: failed to list *v1.APIServer: Get "https://172.30.0.1:443/apis/config.openshift.io/v1/apiservers?resourceVersion=1572273": dial tcp 172.30.0.1:443: connect: connection refused E1025 13:50:16.898683 1 reflector.go:140] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: Failed to watch *v1.APIServer: failed to list *v1.APIServer: Get "https://172.30.0.1:443/apis/config.openshift.io/v1/apiservers?resourceVersion=1572273": dial tcp 172.30.0.1:443: connect: connection refused I1025 13:50:43.182360 1 trace.go:205] Trace[621733159]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:19.338) (total time: 23843ms): Trace[621733159]: ---"Objects listed" error:<nil> 23843ms (13:50:43.182) Trace[621733159]: [23.843677488s] [23.843677488s] END I1025 13:50:43.716723 1 trace.go:205] Trace[255710357]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:12.260) (total time: 31456ms): Trace[255710357]: ---"Objects listed" error:<nil> 31456ms (13:50:43.716) Trace[255710357]: [31.45666834s] [31.45666834s] END I1025 13:50:43.968506 1 trace.go:205] Trace[2001360213]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:11.520) (total time: 32447ms): Trace[2001360213]: ---"Objects listed" error:<nil> 32447ms (13:50:43.968) Trace[2001360213]: [32.44785055s] [32.44785055s] END ``` On controller runtime the default SyncPeriod when all the watched resources are refreshed is 10 hourse ( see kubernetes-sigs/controller-runtime#521 for its reasons) but it appears too long for this specific use case. Let's ensure we read the APIServer CR at least once every minute. Make the logs less verbose. Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2137896 Remove this once kubernetes-sigs/controller-runtime#2032 is properly addressed Signed-off-by: Simone Tiraboschi <stirabos@redhat.com>
If not explicitly set on its CR, HCO webhook is consuming TLS configuration from Openshift cluster-wide APIServer CR. For performance reason it's not reading it on each request to the HCO CR but it's consuming a cached representation. The cache was only refreshed by a controller based on an informer. We got reports that due to the nature of changes in the APIServer CR, the connection to the APIserver itself could become stuck: ``` W1025 13:50:16.898592 1 reflector.go:424] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: failed to list *v1.APIServer: Get "https://172.30.0.1:443/apis/config.openshift.io/v1/apiservers?resourceVersion=1572273": dial tcp 172.30.0.1:443: connect: connection refused E1025 13:50:16.898683 1 reflector.go:140] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: Failed to watch *v1.APIServer: failed to list *v1.APIServer: Get "https://172.30.0.1:443/apis/config.openshift.io/v1/apiservers?resourceVersion=1572273": dial tcp 172.30.0.1:443: connect: connection refused I1025 13:50:43.182360 1 trace.go:205] Trace[621733159]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:19.338) (total time: 23843ms): Trace[621733159]: ---"Objects listed" error:<nil> 23843ms (13:50:43.182) Trace[621733159]: [23.843677488s] [23.843677488s] END I1025 13:50:43.716723 1 trace.go:205] Trace[255710357]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:12.260) (total time: 31456ms): Trace[255710357]: ---"Objects listed" error:<nil> 31456ms (13:50:43.716) Trace[255710357]: [31.45666834s] [31.45666834s] END I1025 13:50:43.968506 1 trace.go:205] Trace[2001360213]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:11.520) (total time: 32447ms): Trace[2001360213]: ---"Objects listed" error:<nil> 32447ms (13:50:43.968) Trace[2001360213]: [32.44785055s] [32.44785055s] END ``` On controller runtime the default SyncPeriod when all the watched resources are refreshed is 10 hourse ( see kubernetes-sigs/controller-runtime#521 for its reasons) but it appears too long for this specific use case. Let's ensure we read the APIServer CR at least once every minute. Make the logs less verbose. Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2137896 Remove this once kubernetes-sigs/controller-runtime#2032 is properly addressed Signed-off-by: Simone Tiraboschi <stirabos@redhat.com> Signed-off-by: Simone Tiraboschi <stirabos@redhat.com>
If not explicitly set on its CR, HCO webhook is consuming TLS configuration from Openshift cluster-wide APIServer CR. For performance reason it's not reading it on each request to the HCO CR but it's consuming a cached representation. The cache was only refreshed by a controller based on an informer. We got reports that due to the nature of changes in the APIServer CR, the connection to the APIserver itself could become stuck: ``` W1025 13:50:16.898592 1 reflector.go:424] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: failed to list *v1.APIServer: Get "https://172.30.0.1:443/apis/config.openshift.io/v1/apiservers?resourceVersion=1572273": dial tcp 172.30.0.1:443: connect: connection refused E1025 13:50:16.898683 1 reflector.go:140] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: Failed to watch *v1.APIServer: failed to list *v1.APIServer: Get "https://172.30.0.1:443/apis/config.openshift.io/v1/apiservers?resourceVersion=1572273": dial tcp 172.30.0.1:443: connect: connection refused I1025 13:50:43.182360 1 trace.go:205] Trace[621733159]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:19.338) (total time: 23843ms): Trace[621733159]: ---"Objects listed" error:<nil> 23843ms (13:50:43.182) Trace[621733159]: [23.843677488s] [23.843677488s] END I1025 13:50:43.716723 1 trace.go:205] Trace[255710357]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:12.260) (total time: 31456ms): Trace[255710357]: ---"Objects listed" error:<nil> 31456ms (13:50:43.716) Trace[255710357]: [31.45666834s] [31.45666834s] END I1025 13:50:43.968506 1 trace.go:205] Trace[2001360213]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:11.520) (total time: 32447ms): Trace[2001360213]: ---"Objects listed" error:<nil> 32447ms (13:50:43.968) Trace[2001360213]: [32.44785055s] [32.44785055s] END ``` On controller runtime the default SyncPeriod when all the watched resources are refreshed is 10 hourse ( see kubernetes-sigs/controller-runtime#521 for its reasons) but it appears too long for this specific use case. Let's ensure we read the APIServer CR at least once every minute. Make the logs less verbose. Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2137896 Remove this once kubernetes-sigs/controller-runtime#2032 is properly addressed Signed-off-by: Simone Tiraboschi <stirabos@redhat.com>
If not explicitly set on its CR, HCO webhook is consuming TLS configuration from Openshift cluster-wide APIServer CR. For performance reason it's not reading it on each request to the HCO CR but it's consuming a cached representation. The cache was only refreshed by a controller based on an informer. We got reports that due to the nature of changes in the APIServer CR, the connection to the APIserver itself could become stuck: ``` W1025 13:50:16.898592 1 reflector.go:424] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: failed to list *v1.APIServer: Get "https://172.30.0.1:443/apis/config.openshift.io/v1/apiservers?resourceVersion=1572273": dial tcp 172.30.0.1:443: connect: connection refused E1025 13:50:16.898683 1 reflector.go:140] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: Failed to watch *v1.APIServer: failed to list *v1.APIServer: Get "https://172.30.0.1:443/apis/config.openshift.io/v1/apiservers?resourceVersion=1572273": dial tcp 172.30.0.1:443: connect: connection refused I1025 13:50:43.182360 1 trace.go:205] Trace[621733159]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:19.338) (total time: 23843ms): Trace[621733159]: ---"Objects listed" error:<nil> 23843ms (13:50:43.182) Trace[621733159]: [23.843677488s] [23.843677488s] END I1025 13:50:43.716723 1 trace.go:205] Trace[255710357]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:12.260) (total time: 31456ms): Trace[255710357]: ---"Objects listed" error:<nil> 31456ms (13:50:43.716) Trace[255710357]: [31.45666834s] [31.45666834s] END I1025 13:50:43.968506 1 trace.go:205] Trace[2001360213]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (25-Oct-2022 13:50:11.520) (total time: 32447ms): Trace[2001360213]: ---"Objects listed" error:<nil> 32447ms (13:50:43.968) Trace[2001360213]: [32.44785055s] [32.44785055s] END ``` On controller runtime the default SyncPeriod when all the watched resources are refreshed is 10 hourse ( see kubernetes-sigs/controller-runtime#521 for its reasons) but it appears too long for this specific use case. Let's ensure we read the APIServer CR at least once every minute. Make the logs less verbose. Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2137896 Remove this once kubernetes-sigs/controller-runtime#2032 is properly addressed Signed-off-by: Simone Tiraboschi <stirabos@redhat.com> Signed-off-by: Simone Tiraboschi <stirabos@redhat.com> Co-authored-by: Simone Tiraboschi <stirabos@redhat.com>
@akashjain971 Although you most probably found an answer to your question or have solved your problem in another way, I want to include the answer for everyone else who wonders:
Therefore I guess passing 0 will turn the resync off. @DirectXMan12 Thank you very much for your elaborate answers in this issue. Very much appreciated! |
This is an old issue, but @stefanprodan sent me here and it was really helpful. Sample controller should really be updated to use a similar time like the 10 hours we see in controller-runtime: cc @dims |
@alexellis could i persuade you to cut a PR for sample-controller? pretty please! |
Maybe the sample-controller should be archived and its readme could point to kubebuilder and controller-runtime. To avoid people start their controller journey with sample-controller, it served us well when we got started with controllers many years ago, but controller-runtime is by far so much better now. |
Resync interval is 10hours and there is a recommendation not to change it.
https://github.com/kubernetes-sigs/controller-runtime/blob/master/pkg/manager/manager.go#L103
Several operators like the lyft flink operator (controller runtime based) and the spark operator (sample controller based) set it to 30s. Why a so large default value? Is there a difference between this and the cache mechanism used by go-client that justifies this. @DirectXMan12 any ideas?
The text was updated successfully, but these errors were encountered: