-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect CPU topology on Single NUMA and Multi socket system leads to performance degradation for POD #2798
Comments
@hanamantagoudvk are you able to check behavior with code from PR mentioned above? |
@iwankgb : I havenot tested with above PR. Have you tested the above code changes on single numa and multi socket system ? Above code changes work with that like before ? |
@hanamantagoudvk no, but It was tested on other complex configurations. If you can provide a snapshot of |
@hanamantagoudvk content of |
processor : 0 processor : 1 processor : 2 processor : 3 processor : 4 processor : 5 processor : 6 processor : 7 processor : 8 processor : 9 processor : 10 processor : 11 processor : 12 processor : 13 processor : 14 processor : 15 processor : 16 processor : 17 processor : 18 processor : 19 processor : 20 processor : 21 processor : 22 processor : 23 processor : 24 processor : 25 processor : 26 processor : 27 processor : 28 processor : 29 processor : 30 processor : 31 |
$ ls sys/devices/system/cpu/ |
@hanamantagoudvk what is really important is content of the files in all these directories. Can you rsync or copy whole directory structure, create a tarball and provide us with a link to it? We need something along these lines. |
Please find the logs |
@iwankgb : I have attached the logs. Please look into it. |
@hanamantagoudvk I found what causes this. I'm trying to fix it now. I'll make a solution as soon as possible, be patient. |
@Creatone : i believe that fix will go in master branch. Is it possible to put the fix in both 0.37.x and 0.38.x release as well ? K8s version 1.19.3 uses 0.37.0 and K8s version 1.20.3 uses 0.38.x . |
@Creatone @iwankgb : Our team has tested the fix given by you on 0.37.x . It is working. But we see the difference in topology format between 0.35.x (where it worked earlier) and the fix you have give now. Our concern is due to change topology format , whether kubelet also needs to change its code. Kubelet assumes at many places len(socket) == len(numa nodes). Could you please check this ? |
Can you post the topology for both versions, please? |
@iwankgb : I am attaching the zip file containing topology after fix , 0.35.x version where it worked earlier and 0.37.x version where it didnot work. |
Issue and Impact seen:
Due to incorrect CPU topology generated by cadvisor (version 0.37), CNF Pod is allocated CPU 1 which is a hyperthread sibling of CPU 0 (CPU 0 hosts most of the OS and other components). This leads performance degradation of the Pod.
In earlier versions (k8s 1.18.9, cadvisor 0.35.0) , CPU allocation happened from CPU 2-14 , whereas in k8s version 1.19.3 , cadvisor version 0.37.0 , CPU allocation happens from CPU 1-13.
Issue detected version:
k8s version 1.19.3 , cadvisor version 0.37.0
Version where it works well:
k8s version 1.18.9, cadvisor version 0.35.0
Analysis done so far:
We see this issue on Singel NUMA , Multi Socket System
On version 0.35.0 , topology generated by cadvisor and fed to kubelet looks like below.
Here there is a clear mapping of node/socket to cores/threads.
On version 0.37.0 , topology generated by cadvisor and fed to kubelet looks like below.
Below there seems to be a mix up of socket and numa nodes.
We suspect that following commit has introduced the bug.
The text was updated successfully, but these errors were encountered: