-
Notifications
You must be signed in to change notification settings - Fork 256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about when topology updater runs #703
Comments
/cc @swatisehgal @fromanirh |
Hi @Garrybest, your observation is totally correct. In case there is a batch of pods being scheduled on nodes and the monitoring interval of the podresource API is too slow, the aggregated information will be stale. This is a known issue and is due to the fact that podresources API needs to be polled. Our long term plan is to introduce a watch endpoint in podresource API to make it event based which will allow to obtain upto date information of the available resources and enable that capability in NFD. We have had discussion in SIG Node regarding this and are currently working on the design. For now, you can play around with https://github.com/k8stopologyawareschedwg/resource-topology-exporter (which is another exporter and can replace NFD) where we have smart polling enabled by means of watching for changes in the kubelet state directory. Please refer to PRs related to this:
/cc jlojosnegros |
@swatisehgal Do you have a KEP for the watch endpoint? How hard would it be to introduce the PRs you stated above into NFD? |
@zvonkok we have plans to update the nfd-topology-updater, add missing features and proposing the new ones. It is planned for early 2022. |
this is very correct. The updater - either nfd-topology-updater or the RTE described here, should be able to update as soon as possible. However very aggressive setting may put too much pressure on the cluster, and this is a design constraint which is hardly be solved by the implementation: the more frequent the updates, the more the traffic, thus the load on the updater/cluster/apiserver/etcd. This means that more timely updates are part of the solution, not the complete solution. The cluster admin (or a operator/agent on their behalf) will have to tune the updater and the frequency of the updates depending on the specific workload. Another part of the solution is how the scheduler consume this data and how it behaves with stale information. This is a problem quite similar to the one the default scheduler faces. We have some plans to review and experiment with the scheduling side once we reach a stable state on the updater side (this ofc includes |
Thanks for explanation. I still have 2 questions. @swatisehgal @fromanirh
It seems like we don't want
Well, the
I think the essence of this issue is that the scheduler select a node based on the topology information, but it does not select the specific cpus. The associated topology management, like selecting NUMA node, is done by kubelet not by kube-scheduler. So we may still have some inconsistent problems. Do we have any ideas? |
We totally don't want to watch the apiserver for scalability reasons if we can help it. And we can: we can either extend the podresources API to notify resource allocation changes or get notifications from the runtime
We cannot fully solve this problem. We cannot guarantee the scheduler always have fresh information. We will need to improve the scheduler side and basically each cluster has to be tuned appropriately to find a right balance between the updating frequency and the scheduling effectiveness. We will always have tradeoffs in the current architecture, which is in turn heavily driven by the kubernetes architecture.
To some extent this is an open question. Further development and refinement is planned for 2022, once we reach a good state on the updater side. RTE is maturing and we will soon push the changes to NFD, while keep working adding the missing API to the kubelet. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
I had a glance at the source code of nfd-topology-updater and found it aggregates all pod resources from kubelet and update the result through gRPC client every 60s, See here.
When taken topology aware scheduling as consideration, I'm a little confused about the way we update the numa topology. If we collect node topology every minute but not by pod informer event, what will scheduler do if the topology status has changed during the 1 minute? Let me give an example.
The text was updated successfully, but these errors were encountered: