-
Notifications
You must be signed in to change notification settings - Fork 374
hugepage usage with kubernetes and kata-container #2353
Comments
What a comprehensive report about kata hugepage and k8s this issuses is ! |
@liangxianlong But i can't find any hugepages when i try to check hugepages in pod. I expected to be able to check on the pod as many numbers as kata-runtime uses in the number of hugepages allocated on the host. In your case, can you find any hugepages in pod? |
Fixes: kata-containers#2353 Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
Fixes: kata-containers#2353 Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
Fixes: kata-containers#2353 Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
Fixes: kata-containers#2353 Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
Fixes: kata-containers#2353 Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
Pod Overhead won't work here, since it currently only accounts for overhead associated with memory and cpu cgroups. In 2.3.1, I expect things to be killed because we are using more the hugetlbfs than Kubelet expects us to (ie, more than is specified in the pod cgroup within the hugetlbfs subsystem). When SandboxCgroupOnly is set to false, the VMM isn't put into the pod cgroup -- that's why it succeeds. Even with that succeeding, it is definitely doing the wrong thing to the node. Node thinks it has N huge pages available, but it actually has much less since our VMM isn't being accounted for. |
I think we would have to make some unfortunate changes in Kubernetes in order to facilitate this working, or make an admission controller/webhook which will update the PodSpec if/when huge pages are used and a Kata runtimeclass is specified. With that, we'd need to:
|
Hi @egernst I do agree on some change of k8s/controller/webhook to tell the PodSpec to add some HugePage overhead to hugetlb cgroup, but why we need to account all memory limits into the hugepage overhead? Is it possible to split the guest memory into two parts/slots: the first one is normal page size memory with memory limits size; and the other one is the huge page with emptyDir limit + PodOverhead. Thus we can make the hugepage request as small as possible. Yeah, we can use the annotation to pass the hugepage request to runtime. |
@lifupan If we want to do such split, we need to pass give hugepages and normal pages to the guest. It is possible? I thought a guest can be using either hugepages or not at all. Also even if it's possible, we'd have to figure out how to hand out different memory region to different applications (guest kernel, tmpfs, agent, containers). I don't think it's simply do-able. I think we can make all guest part of memory use hugepage and all host part of podoverhead (vmm + runtime) use normal size memory (that's the current situation IIUC.) The problem is that k8s hugetlb control group doesn't take kata's podoverhead into account. A webhook might just work by altering the container memory field but kata runtime should find a way to revert webhook's modification so that a container still uses its original claimed size of memory in the guest. A more intrusive way is to make k8s aware of such detail by adding hugetlb overhead to podoverhead so that node and scheduler know how to properly handle it. Just my 2 cents. |
Fixes: kata-containers#2353 Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
Fixes: kata-containers#2353 Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
Fixes: kata-containers#2353 Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
Fixes: kata-containers#2353 Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
Fixes: kata-containers#2353 Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
Fixes: kata-containers#2353 Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
Fixes: kata-containers#2353 Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
Fixes: kata-containers#2353 Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
Fixes: kata-containers#2353 Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
@egernst I'm trying to understand your above comments and putting some questions here. Please be patient as I try to deconstruct and understand :-) My understanding is your comments refer to Usecase-2. Please correct me if wrong ? |
No problem -- want to make sure you understand and we're on the same page! Yeah, I'm talking about use-case #2. I was under the assumption that you'd want to be be backed by huge pages on the host. This is what Kubernetes is accounting for when creating your huge page volume, right? There isn't a way to "provide this" to the application. |
Thanks @egernst. Yeah ideally I would want VM memory to be backed up hugepages to actually reap the benefit of hugepages. However my primary goal for the runtime and agent changes was to ensure a regular pod.yaml works and doesn't break when running with Kata (from user experience stand point). And we document the limitations.
Right, In Kata case, the hugepage is actually not (cannot be) consumed. We'll need to look at some of the options that you had already mentioned - like introducing Pod overhead for hugepages so that the VM memory gets accounted for in hugepage limits etc. Also things like how to handle different hugepage sizes since afaik VM can only be backed by a specific hugepage size. |
@egernst @bpradipt do I understand correctly that both of you got itnto a consensus? :-) If so, I'd like proceed with #3109 and kata-containers/agent#872 |
I'm going ahead and proceeding with #3109 and kata-containers/agent#872 |
Fixes: kata-containers#2353 Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
Description of problem
I summarized some usage scenarios of hugepages
1. Configuration
The following two options have an impact on huge pages, where
enable_hugepages
is always configured astrue
, thedefault_memory
always configured as2048
andsandbox_cgroup_only
is divided into two cases:(1) true (2) false
config hugepage on host
if the
nr_hugepages
has changed,thekubelet
should restart.2. Scenarios
2.1 pod does not configure huge page parameters and sandbox_cgroup_only=true
The
yaml
file:Restart
kubelet
serviceStart
pod
,and then i get the error:Warning FailedCreatePodSandBox 10s kubelet, host-69 Failed create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: OCI runtime create failed: Failed to check if grpc server is working: rpc error: code = Unavailable desc = transport is closing: unknown
In this case, the usage scenario is incorrect. Because the huge page parameter is not configured in the yaml file, the limit at the pod level is 0. At this time, the qemu process is placed in a subdirectory under the pod level because sandbox_cgroup_only = true.Thus the qemu porcess will be restricted.More detailed analysis can refer to the following link:
https://github.com/kata-containers/runtime/issues/2172
2.2 pod does not configure huge page parameters and sandbox_cgroup_only = false
sandbox_cgroup_only = false
Start
pod
:At this point, the pod can be successfully started, because when sandbox_cgroup_only = false, although the pod level limit is still 0, the qemu process is not restricted because it is not placed in the control group under the pod.
2.3 pod configuration huge page parameters and sandbox_cgroup_only=true
2.3.1 pod started failed
pod
config:modify host hugepage configuration:
Restart
kubelet
service:Run
pod
,and then i get the error:k8s+qemu = 4096Mi + 2048Mi = needHugePages HostHugepages = 2048 * 2Mi
The above error is because there is only 4096M huge page memory on the host, but when kata calculates, 4096Mi in the pod and 2048Mi configured by kata itself are accumulated. When the virtual machine is started, the qemu process detects that the requested huge page memory exceeds Host configuration value, so an error is reported.Also,k8s does not know that kata itself has configured huge pages for virtual machines.
2.3.2 pod started successfully
pod
config:Host hugepage conig:
modify host hugepage configuration:
Restart kubeletservice:
Run pod, pod started successfully:
Although the above pods can be successfully launched, there are also hidden problems. Because the
hugepages-2Mi: 128Mi
value is written in the pod-level control group, the virtual machine is configured with a huge page of2Gi
; when the business in the virtual machine uses a large amount of memory, at the same time the qemu process is placed in the pod's child control group which will exceed the limit of the pod control group. This issue may cause the virtual machine to go down.And i guess,the error is same as2.1
The text was updated successfully, but these errors were encountered: