Question about when topology updater runs #703

Garrybest · 2021-12-19T16:08:35Z

I had a glance at the source code of nfd-topology-updater and found it aggregates all pod resources from kubelet and update the result through gRPC client every 60s, See here.

When taken topology aware scheduling as consideration, I'm a little confused about the way we update the numa topology. If we collect node topology every minute but not by pod informer event, what will scheduler do if the topology status has changed during the 1 minute? Let me give an example.

The nfd-client collect the node topology zone info and update by nfd-master.
Based on the zone info, the scheduler tries to schedule a pod to this node.
Now the next collection(after about 60s) has not come yet, so the zone info does not change.
The scheduler still uses the old zone info for scheduling, will there be an inconsistent issue?

Garrybest · 2021-12-19T16:09:43Z

/cc @swatisehgal @fromanirh

swatisehgal · 2021-12-20T12:41:48Z

Hi @Garrybest, your observation is totally correct. In case there is a batch of pods being scheduled on nodes and the monitoring interval of the podresource API is too slow, the aggregated information will be stale. This is a known issue and is due to the fact that podresources API needs to be polled. Our long term plan is to introduce a watch endpoint in podresource API to make it event based which will allow to obtain upto date information of the available resources and enable that capability in NFD. We have had discussion in SIG Node regarding this and are currently working on the design.

For now, you can play around with https://github.com/k8stopologyawareschedwg/resource-topology-exporter (which is another exporter and can replace NFD) where we have smart polling enabled by means of watching for changes in the kubelet state directory. Please refer to PRs related to this:

/cc jlojosnegros

zvonkok · 2021-12-21T07:36:06Z

@swatisehgal Do you have a KEP for the watch endpoint? How hard would it be to introduce the PRs you stated above into NFD?

ffromani · 2021-12-21T07:40:09Z

@zvonkok we have plans to update the nfd-topology-updater, add missing features and proposing the new ones. It is planned for early 2022.

ffromani · 2021-12-21T07:54:48Z

Hi @Garrybest, your observation is totally correct. In case there is a batch of pods being scheduled on nodes and the monitoring interval of the podresource API is too slow, the aggregated information will be stale. This is a known issue and is due to the fact that podresources API needs to be polled. Our long term plan is to introduce a watch endpoint in podresource API to make it event based which will allow to obtain upto date information of the available resources and enable that capability in NFD. We have had discussion in SIG Node regarding this and are currently working on the design.

For now, you can play around with https://github.com/k8stopologyawareschedwg/resource-topology-exporter (which is another exporter and can replace NFD) where we have smart polling enabled by means of watching for changes in the kubelet state directory. Please refer to PRs related to this:
1. [add support for notification file k8stopologyawareschedwg/resource-topology-exporter#54](https://github.com/k8stopologyawareschedwg/resource-topology-exporter/pull/54)

2. [Throttle RTE events k8stopologyawareschedwg/resource-topology-exporter#85](https://github.com/k8stopologyawareschedwg/resource-topology-exporter/pull/85)

this is very correct. The updater - either nfd-topology-updater or the RTE described here, should be able to update as soon as possible. However very aggressive setting may put too much pressure on the cluster, and this is a design constraint which is hardly be solved by the implementation: the more frequent the updates, the more the traffic, thus the load on the updater/cluster/apiserver/etcd.

This means that more timely updates are part of the solution, not the complete solution. The cluster admin (or a operator/agent on their behalf) will have to tune the updater and the frequency of the updates depending on the specific workload.

Another part of the solution is how the scheduler consume this data and how it behaves with stale information. This is a problem quite similar to the one the default scheduler faces. We have some plans to review and experiment with the scheduling side once we reach a stable state on the updater side (this ofc includes nfd-topology-updater, bringing it up to speed to the last features and fixes we are experimenting and testing on the RTE).

Garrybest · 2021-12-21T08:20:27Z

Thanks for explanation. I still have 2 questions. @swatisehgal @fromanirh

How do we get the events(pod creation/deletion) if we don't watch apiserver?

It seems like we don't want nfd-topology-updater to watch apiserver, it only interacts with kubelet now.

Even if we watch the events, scheduler still needs to face the stale information, how do we fix this?

Well, the watch could update the CR ASAP, but there may still be an inconsistent issue. When the scheduler selects a node and assgins a pod to the target node, but before the watch and update is done, the CR is not updated. So the scheduler does not have the latest topology information. This is quite similar with what @fromanirh said.

Another part of the solution is how the scheduler consume this data and how it behaves with stale information. This is a problem quite similar to the one the default scheduler faces.

I think the essence of this issue is that the scheduler select a node based on the topology information, but it does not select the specific cpus. The associated topology management, like selecting NUMA node, is done by kubelet not by kube-scheduler. So we may still have some inconsistent problems. Do we have any ideas?

ffromani · 2021-12-21T08:28:38Z

Thanks for explanation. I still have 2 questions. @swatisehgal @fromanirh
1. How do we get the events(pod creation/deletion) if we don't watch apiserver?
It seems like we don't want nfd-topology-updater to watch apiserver, it only interacts with kubelet now.

We totally don't want to watch the apiserver for scalability reasons if we can help it. And we can: we can either extend the podresources API to notify resource allocation changes or get notifications from the runtime

2. Even if we `watch` the events, scheduler still needs to face the stale information, how do we fix this?
Well, the watch could update the CR ASAP, but there may still be an inconsistent issue. When the scheduler selects a node and assgins a pod to the target node, but before the watch and update is done, the CR is not updated. So the scheduler does not have the latest topology information. This is quite similar with what @fromanirh said.

We cannot fully solve this problem. We cannot guarantee the scheduler always have fresh information. We will need to improve the scheduler side and basically each cluster has to be tuned appropriately to find a right balance between the updating frequency and the scheduling effectiveness. We will always have tradeoffs in the current architecture, which is in turn heavily driven by the kubernetes architecture.

Another part of the solution is how the scheduler consume this data and how it behaves with stale information. This is a problem quite similar to the one the default scheduler faces.

I think the essence of this issue is that the scheduler select a node based on the topology information, but it does not select the specific cpus. The associated topology management, like selecting NUMA node, is done by kubelet not by kube-scheduler. So we may still have some inconsistent problems. Do we have some ideas?

To some extent this is an open question. Further development and refinement is planned for 2022, once we reach a good state on the updater side. RTE is maturing and we will soon push the changes to NFD, while keep working adding the missing API to the kubelet.
I can only speak for myself, but I think further work and experimentation on the scheduling side is not only possible in the meantime, but warmly welcomed.

k8s-triage-robot · 2022-03-21T09:15:25Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-04-20T09:24:38Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 21, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 20, 2022

Garrybest closed this as completed May 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about when topology updater runs #703

Question about when topology updater runs #703

Garrybest commented Dec 19, 2021

Garrybest commented Dec 19, 2021

swatisehgal commented Dec 20, 2021 •

edited

Loading

zvonkok commented Dec 21, 2021

ffromani commented Dec 21, 2021

ffromani commented Dec 21, 2021

Garrybest commented Dec 21, 2021 •

edited

Loading

ffromani commented Dec 21, 2021

k8s-triage-robot commented Mar 21, 2022

k8s-triage-robot commented Apr 20, 2022

Question about when topology updater runs #703

Question about when topology updater runs #703

Comments

Garrybest commented Dec 19, 2021

Garrybest commented Dec 19, 2021

swatisehgal commented Dec 20, 2021 • edited Loading

zvonkok commented Dec 21, 2021

ffromani commented Dec 21, 2021

ffromani commented Dec 21, 2021

Garrybest commented Dec 21, 2021 • edited Loading

ffromani commented Dec 21, 2021

k8s-triage-robot commented Mar 21, 2022

k8s-triage-robot commented Apr 20, 2022

swatisehgal commented Dec 20, 2021 •

edited

Loading

Garrybest commented Dec 21, 2021 •

edited

Loading