Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about when topology updater runs #703

Closed
Garrybest opened this issue Dec 19, 2021 · 9 comments
Closed

Question about when topology updater runs #703

Garrybest opened this issue Dec 19, 2021 · 9 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@Garrybest
Copy link
Member

I had a glance at the source code of nfd-topology-updater and found it aggregates all pod resources from kubelet and update the result through gRPC client every 60s, See here.

When taken topology aware scheduling as consideration, I'm a little confused about the way we update the numa topology. If we collect node topology every minute but not by pod informer event, what will scheduler do if the topology status has changed during the 1 minute? Let me give an example.

  1. The nfd-client collect the node topology zone info and update by nfd-master.
  2. Based on the zone info, the scheduler tries to schedule a pod to this node.
  3. Now the next collection(after about 60s) has not come yet, so the zone info does not change.
  4. The scheduler still uses the old zone info for scheduling, will there be an inconsistent issue?
@Garrybest
Copy link
Member Author

/cc @swatisehgal @fromanirh

@swatisehgal
Copy link
Contributor

swatisehgal commented Dec 20, 2021

Hi @Garrybest, your observation is totally correct. In case there is a batch of pods being scheduled on nodes and the monitoring interval of the podresource API is too slow, the aggregated information will be stale. This is a known issue and is due to the fact that podresources API needs to be polled. Our long term plan is to introduce a watch endpoint in podresource API to make it event based which will allow to obtain upto date information of the available resources and enable that capability in NFD. We have had discussion in SIG Node regarding this and are currently working on the design.

For now, you can play around with https://github.com/k8stopologyawareschedwg/resource-topology-exporter (which is another exporter and can replace NFD) where we have smart polling enabled by means of watching for changes in the kubelet state directory. Please refer to PRs related to this:

  1. add support for notification file k8stopologyawareschedwg/resource-topology-exporter#54
  2. Throttle RTE events k8stopologyawareschedwg/resource-topology-exporter#85

/cc jlojosnegros

@zvonkok
Copy link
Contributor

zvonkok commented Dec 21, 2021

@swatisehgal Do you have a KEP for the watch endpoint? How hard would it be to introduce the PRs you stated above into NFD?

@ffromani
Copy link
Contributor

@zvonkok we have plans to update the nfd-topology-updater, add missing features and proposing the new ones. It is planned for early 2022.

@ffromani
Copy link
Contributor

Hi @Garrybest, your observation is totally correct. In case there is a batch of pods being scheduled on nodes and the monitoring interval of the podresource API is too slow, the aggregated information will be stale. This is a known issue and is due to the fact that podresources API needs to be polled. Our long term plan is to introduce a watch endpoint in podresource API to make it event based which will allow to obtain upto date information of the available resources and enable that capability in NFD. We have had discussion in SIG Node regarding this and are currently working on the design.

For now, you can play around with https://github.com/k8stopologyawareschedwg/resource-topology-exporter (which is another exporter and can replace NFD) where we have smart polling enabled by means of watching for changes in the kubelet state directory. Please refer to PRs related to this:

1. [add support for notification file k8stopologyawareschedwg/resource-topology-exporter#54](https://github.com/k8stopologyawareschedwg/resource-topology-exporter/pull/54)

2. [Throttle RTE events k8stopologyawareschedwg/resource-topology-exporter#85](https://github.com/k8stopologyawareschedwg/resource-topology-exporter/pull/85)

this is very correct. The updater - either nfd-topology-updater or the RTE described here, should be able to update as soon as possible. However very aggressive setting may put too much pressure on the cluster, and this is a design constraint which is hardly be solved by the implementation: the more frequent the updates, the more the traffic, thus the load on the updater/cluster/apiserver/etcd.

This means that more timely updates are part of the solution, not the complete solution. The cluster admin (or a operator/agent on their behalf) will have to tune the updater and the frequency of the updates depending on the specific workload.

Another part of the solution is how the scheduler consume this data and how it behaves with stale information. This is a problem quite similar to the one the default scheduler faces. We have some plans to review and experiment with the scheduling side once we reach a stable state on the updater side (this ofc includes nfd-topology-updater, bringing it up to speed to the last features and fixes we are experimenting and testing on the RTE).

@Garrybest
Copy link
Member Author

Garrybest commented Dec 21, 2021

Thanks for explanation. I still have 2 questions. @swatisehgal @fromanirh

  1. How do we get the events(pod creation/deletion) if we don't watch apiserver?

It seems like we don't want nfd-topology-updater to watch apiserver, it only interacts with kubelet now.

  1. Even if we watch the events, scheduler still needs to face the stale information, how do we fix this?

Well, the watch could update the CR ASAP, but there may still be an inconsistent issue. When the scheduler selects a node and assgins a pod to the target node, but before the watch and update is done, the CR is not updated. So the scheduler does not have the latest topology information. This is quite similar with what @fromanirh said.

Another part of the solution is how the scheduler consume this data and how it behaves with stale information. This is a problem quite similar to the one the default scheduler faces.

I think the essence of this issue is that the scheduler select a node based on the topology information, but it does not select the specific cpus. The associated topology management, like selecting NUMA node, is done by kubelet not by kube-scheduler. So we may still have some inconsistent problems. Do we have any ideas?

@ffromani
Copy link
Contributor

Thanks for explanation. I still have 2 questions. @swatisehgal @fromanirh

1. How do we get the events(pod creation/deletion) if we don't watch apiserver?

It seems like we don't want nfd-topology-updater to watch apiserver, it only interacts with kubelet now.

We totally don't want to watch the apiserver for scalability reasons if we can help it. And we can: we can either extend the podresources API to notify resource allocation changes or get notifications from the runtime

2. Even if we `watch` the events, scheduler still needs to face the stale information, how do we fix this?

Well, the watch could update the CR ASAP, but there may still be an inconsistent issue. When the scheduler selects a node and assgins a pod to the target node, but before the watch and update is done, the CR is not updated. So the scheduler does not have the latest topology information. This is quite similar with what @fromanirh said.

We cannot fully solve this problem. We cannot guarantee the scheduler always have fresh information. We will need to improve the scheduler side and basically each cluster has to be tuned appropriately to find a right balance between the updating frequency and the scheduling effectiveness. We will always have tradeoffs in the current architecture, which is in turn heavily driven by the kubernetes architecture.

Another part of the solution is how the scheduler consume this data and how it behaves with stale information. This is a problem quite similar to the one the default scheduler faces.

I think the essence of this issue is that the scheduler select a node based on the topology information, but it does not select the specific cpus. The associated topology management, like selecting NUMA node, is done by kubelet not by kube-scheduler. So we may still have some inconsistent problems. Do we have some ideas?

To some extent this is an open question. Further development and refinement is planned for 2022, once we reach a good state on the updater side. RTE is maturing and we will soon push the changes to NFD, while keep working adding the missing API to the kubelet.
I can only speak for myself, but I think further work and experimentation on the scheduling side is not only possible in the meantime, but warmly welcomed.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 21, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

6 participants