Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

hugepage usage with kubernetes and kata-container #2353

Closed
liangxianlong opened this issue Dec 13, 2019 · 13 comments · Fixed by #3109
Closed

hugepage usage with kubernetes and kata-container #2353

liangxianlong opened this issue Dec 13, 2019 · 13 comments · Fixed by #3109
Labels
bug Incorrect behaviour needs-review Needs to be assessed by the team.

Comments

@liangxianlong
Copy link

Description of problem

I summarized some usage scenarios of hugepages

1. Configuration

The following two options have an impact on huge pages, where enable_hugepages is always configured as true, the default_memory always configured as 2048and sandbox_cgroup_only is divided into two cases: (1) true (2) false

sandbox_cgroup_only = true / false
enable_hugepages = true
default_memory = 2048

config hugepage on host
if the nr_hugepageshas changed,the kubeletshould restart.

[root@host-69 yaml]# sysctl vm.nr_hugepages=1024
vm.nr_hugepages = 1024
[root@host-69 yaml]# cat /proc/meminfo | grep -i huge
AnonHugePages:    280576 kB
HugePages_Total:    1024
HugePages_Free:     1024
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
[root@host-69 yaml]# systemctl restart kubelet

2. Scenarios

2.1 pod does not configure huge page parameters and sandbox_cgroup_only=true 

The yamlfile:

apiVersion: v1
kind: Pod
metadata:
  name: busybox-one
spec:
  runtimeClassName: kata-shimv1
  containers:
  - name: busybox-1
    image: busybox:latest
    stdin: true
    tty: true
    imagePullPolicy: IfNotPresent

Restart kubeletservice

systemctl restart kubelet

Start pod,and then i get the error:

Warning  FailedCreatePodSandBox  10s        kubelet, host-69   Failed create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: OCI runtime create failed: Failed to check if grpc server is working: rpc error: code = Unavailable desc = transport is closing: unknown

In this case, the usage scenario is incorrect. Because the huge page parameter is not configured in the yaml file, the limit at the pod level is 0. At this time, the qemu process is placed in a subdirectory under the pod level because sandbox_cgroup_only = true.Thus the qemu porcess will be restricted.More detailed analysis can refer to the following link:
https://github.com/kata-containers/runtime/issues/2172

2.2 pod does not configure huge page parameters and sandbox_cgroup_only = false

sandbox_cgroup_only = false
apiVersion: v1
kind: Pod
metadata:
  name: busybox-one
spec:
  runtimeClassName: kata-shimv1
  containers:
  - name: busybox-1
    image: busybox:latest
    stdin: true
    tty: true
    imagePullPolicy: IfNotPresent

Start pod:

[root@host-69 yaml]# systemctl restart kubelet
[root@host-69 yaml]# kubectl apply -f busybox-one_no_hugepage.yaml
root@host-69 yaml]# kubectl get pods
NAME          READY   STATUS    RESTARTS   AGE
busybox-one   1/1     Running   0          60s

At this point, the pod can be successfully started, because when sandbox_cgroup_only = false, although the pod level limit is still 0, the qemu process is not restricted because it is not placed in the control group under the pod.

2.3 pod configuration huge page parameters and sandbox_cgroup_only=true

2.3.1 pod started failed

podconfig:

apiVersion: v1
kind: Pod
metadata:
  name: busybox-one
spec:
  runtimeClassName: kata-shimv1
  containers:
  - name: busybox-1
    image: busybox:latest
    stdin: true
    tty: true
    imagePullPolicy: IfNotPresent
    volumeMounts:
    - mountPath: /dev/hugepages
      name: hugepage
    resources:
      limits:
        hugepages-2Mi: 4096Mi
        memory: 4096Mi
      requests:
        memory: 4096Mi
  volumes:
  - name: hugepage
    emptyDir:
      medium: HugePages

modify host hugepage configuration:

[root@host-69 yaml]# sysctl vm.nr_hugepages=2048
vm.nr_hugepages = 2048
[root@host-69 yaml]#

Restart kubeletservice:

systemctl restart kubelet

Run pod,and then i get the error:

[root@host-69 yaml]# kubectl apply -f busybox-one.yaml 
pod/busybox-one created
[root@host-69 yaml]# kubectl get pods
NAME          READY   STATUS              RESTARTS   AGE
busybox-one   0/1     RunContainerError   1          5s
[root@host-69 yaml]# kubectl describe pods busybox-one
 Warning  Failed     25s (x3 over 41s)  kubelet, host-69   Error: failed to create containerd task: OCI runtime create failed: QMP command failed: unable to map backing store for guest RAM: Cannot allocate memory: unknown
k8s+qemu = 4096Mi + 2048Mi = needHugePages
HostHugepages = 2048 * 2Mi

The above error is because there is only 4096M huge page memory on the host, but when kata calculates, 4096Mi in the pod and 2048Mi configured by kata itself are accumulated. When the virtual machine is started, the qemu process detects that the requested huge page memory exceeds Host configuration value, so an error is reported.Also,k8s does not know that kata itself has configured huge pages for virtual machines.

2.3.2 pod started successfully

podconfig:

apiVersion: v1
kind: Pod
metadata:
  name: busybox-one
spec:
  runtimeClassName: kata-shimv1
  containers:
  - name: busybox-1
    image: busybox:latest
    stdin: true
    tty: true
    imagePullPolicy: IfNotPresent
    volumeMounts:
    - mountPath: /dev/hugepages
      name: hugepage
    resources:
      limits:
        hugepages-2Mi: 128Mi
        memory: 128Mi
      requests:
        memory: 128Mi
  volumes:
  - name: hugepage
    emptyDir:
      medium: HugePages

Host hugepage conig:
modify host hugepage configuration:

[root@host-69 yaml]# sysctl vm.nr_hugepages=1088
vm.nr_hugepages = 1088
[root@host-69 yaml]#

Restart kubeletservice:

systemctl restart kubelet

Run pod, pod started successfully:

root@host-69 yaml]# kubectl apply -f busybox-one.yaml 
pod/busybox-one created
[root@host-69 yaml]# kubectl get pods
NAME          READY   STATUS    RESTARTS   AGE
busybox-one   1/1     Running   0          4s
[root@host-69 yaml]# cat /proc/meminfo | grep -i huge
AnonHugePages:    331776 kB
HugePages_Total:    1088
HugePages_Free:     1034
HugePages_Rsvd:     1034
HugePages_Surp:        0
Hugepagesize:       2048 kB

Although the above pods can be successfully launched, there are also hidden problems. Because the hugepages-2Mi: 128Mi value is written in the pod-level control group, the virtual machine is configured with a huge page of 2Gi; when the business in the virtual machine uses a large amount of memory, at the same time the qemu process is placed in the pod's child control group which will exceed the limit of the pod control group. This issue may cause the virtual machine to go down.And i guess,the error is same as 2.1

[root@host-69 pod5e2d430f-1ebe-4d80-a6fc-5ae52d897b6e]# pwd
/sys/fs/cgroup/hugetlb/kubepods/burstable/pod5e2d430f-1ebe-4d80-a6fc-5ae52d897b6e
[root@host-69 pod5e2d430f-1ebe-4d80-a6fc-5ae52d897b6e]# cat hugetlb.2MB.limit_in_bytes
134217728
[root@host-69 pod5e2d430f-1ebe-4d80-a6fc-5ae52d897b6e
[root@host-69 pod5e2d430f-1ebe-4d80-a6fc-5ae52d897b6e]# cd ./kata_7f3be98b62b18f29acbe93aea60c04e0168d9edafce698ff4184bda8a0c1c38b/
[root@host-69 kata_7f3be98b62b18f29acbe93aea60c04e0168d9edafce698ff4184bda8a0c1c38b]# cat cgroup.procs 
43436
43438
43444
43458
[root@host-69 kata_7f3be98b62b18f29acbe93aea60c04e0168d9edafce698ff4184bda8a0c1c38b]# ps aux | grep 43436
root      4052  0.0  0.0  16340  1112 pts/2    S+   09:49   0:00 grep --color=auto 43436
root     43436  0.0  0.0 2759216 67008 ?       Sl   12月12   0:42 /usr/bin/qemu-vanilla-system-x86_64 -name sandbox-7f3be98b62b18f29acbe93aea60c04e0168d9edafce698ff4184bda8a0c1c38b -uuid c13cf4c5-21a3-45bf-ba0b-77dbdef716b4 -machine pc,accel=kvm,kernel_irqchip,nvdimm -cpu host -qmp unix:/run/vc/vm/7f3be98b62b18f29acbe93aea60c04e0168d9edafce698ff4184bda8a0c1c38b/qmp.sock,server,nowait -m 2048M,slots=10,maxmem=97558M -device pci-bridge,bus=pci.0,id=pci-bridge-0,chassis_nr=1,shpc=on,addr=2,romfile= -device virtio-serial-pci,disable-modern=false,id=serial0,romfile= -device virtconsole,chardev=charconsole0,id=console0 -chardev socket,id=charconsole0,path=/run/vc/vm/7f3be98b62b18f29acbe93aea60c04e0168d9edafce698ff4184bda8a0c1c38b/console.sock,server,nowait -device nvdimm,id=nv0,memdev=mem0 -object memory-backend-file,id=mem0,mem-path=/usr/share/kata-containers/kata-containers-image_clearlinux_1.10.0-rc0_agent_8a4d901772.img,size=134217728 -device virtio-scsi-pci,id=scsi0,disable-modern=false,romfile= -object rng-random,id=rng0,filename=/dev/urandom -device virtio-rng,rng=rng0,romfile= -device virtserialport,chardev=charch0,id=channel0,name=agent.channel.0 -chardev socket,id=charch0,path=/run/vc/vm/7f3be98b62b18f29acbe93aea60c04e0168d9edafce698ff4184bda8a0c1c38b/kata.sock,server,nowait -device virtio-9p-pci,disable-modern=false,fsdev=extra-9p-kataShared,mount_tag=kataShared,romfile= -fsdev local,id=extra-9p-kataShared,path=/run/kata-containers/shared/sandboxes/7f3be98b62b18f29acbe93aea60c04e0168d9edafce698ff4184bda8a0c1c38b,security_model=none -netdev tap,id=network-0,vhost=on,vhostfds=3,fds=4 -device driver=virtio-net-pci,netdev=network-0,mac=26:e5:a9:23:9e:9a,disable-modern=false,mq=on,vectors=4,romfile= -global kvm-pit.lost_tick_policy=discard -vga none -no-user-config -nodefaults -nographic -daemonize -object memory-backend-file,id=dimm1,size=2048M,mem-path=/dev/hugepages -numa node,memdev=dimm1 -kernel /usr/share/kata-containers/vmlinuz-4.19.86.59-3.1.container -append tsc=reliable no_timer_check rcupdate.rcu_expedited=1 i8042.direct=1 i8042.dumbkbd=1 i8042.nopnp=1 i8042.noaux=1 noreplace-smp reboot=k console=hvc0 console=hvc1 iommu=off cryptomgr.notests net.ifnames=0 pci=lastbus=0 root=/dev/pmem0p1 rootflags=dax,data=ordered,errors=remount-ro ro rootfstype=ext4 quiet systemd.show_status=false panic=1 nr_cpus=8 agent.use_vsock=false systemd.unit=kata-containers.target systemd.mask=systemd-networkd.service systemd.mask=systemd-networkd.socket -pidfile /run/vc/vm/7f3be98b62b18f29acbe93aea60c04e0168d9edafce698ff4184bda8a0c1c38b/pid -smp 1,cores=1,threads=1,sockets=8,maxcpus=8
root     43438  0.0  0.0      0     0 ?        S    12月12   0:00 [vhost-43436]
root     43442  0.0  0.0      0     0 ?        S    12月12   0:00 [kvm-pit/43436]
[root@host-69 kata_7f3be98b62b18f29acbe93aea60c04e0168d9edafce698ff4184bda8a0c1c38b]#
@liangxianlong liangxianlong added bug Incorrect behaviour needs-review Needs to be assessed by the team. labels Dec 13, 2019
@liangxianlong
Copy link
Author

liangxianlong commented Dec 13, 2019

@gnawux @GabyCT @amshinde

@Marshalzxy
Copy link

What a comprehensive report about kata hugepage and k8s this issuses is !
Did you test pod overhead in scenario 2.3.1?

@wParkhi
Copy link

wParkhi commented Dec 24, 2019

@liangxianlong
When you finished to deploy your pod using kata-runtime, you can checked host's hugepages are allocated to guest(kata).

But i can't find any hugepages when i try to check hugepages in pod.
I mean, i attached to pod and then execute below cmd line for check hugepages.
. cat /proc/meminfo | grep Huge
: HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 0 kB

I expected to be able to check on the pod as many numbers as kata-runtime uses in the number of hugepages allocated on the host.

In your case, can you find any hugepages in pod?

bpradipt added a commit to bpradipt/runtime that referenced this issue Dec 6, 2020
Fixes: kata-containers#2353

Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
bpradipt added a commit to bpradipt/runtime that referenced this issue Dec 9, 2020
Fixes: kata-containers#2353

Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
bpradipt added a commit to bpradipt/runtime that referenced this issue Dec 17, 2020
Fixes: kata-containers#2353

Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
bpradipt added a commit to bpradipt/runtime that referenced this issue Jan 7, 2021
Fixes: kata-containers#2353

Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
bpradipt added a commit to bpradipt/runtime that referenced this issue Jan 8, 2021
Fixes: kata-containers#2353

Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
@egernst
Copy link
Member

egernst commented Jan 9, 2021

What a comprehensive report about kata hugepage and k8s this issuses is !
Did you test pod overhead in scenario 2.3.1?

Pod Overhead won't work here, since it currently only accounts for overhead associated with memory and cpu cgroups.

In 2.3.1, I expect things to be killed because we are using more the hugetlbfs than Kubelet expects us to (ie, more than is specified in the pod cgroup within the hugetlbfs subsystem). When SandboxCgroupOnly is set to false, the VMM isn't put into the pod cgroup -- that's why it succeeds.

Even with that succeeding, it is definitely doing the wrong thing to the node. Node thinks it has N huge pages available, but it actually has much less since our VMM isn't being accounted for.

@egernst
Copy link
Member

egernst commented Jan 9, 2021

I think we would have to make some unfortunate changes in Kubernetes in order to facilitate this working, or make an admission controller/webhook which will update the PodSpec if/when huge pages are used and a Kata runtimeclass is specified. With that, we'd need to:

  1. Augment the huge page limit to include: emptyDir limit specified + all memory limits + memory limit within the runtime class's PodOverhead field (or add a special/magic emptyDir with the overhead accounting)
  2. Find a way to communicate to the runtime (annotation?) that the VM should be backed by huge pages and what the limit should be for the ephemeral volume that is created in the guest.

@bergwolf @gnawux @sameo curious if ya'll have input here?

@gnawux
Copy link
Member

gnawux commented Jan 10, 2021

@lifupan @bergwolf

@lifupan
Copy link
Member

lifupan commented Jan 11, 2021

I think we would have to make some unfortunate changes in Kubernetes in order to facilitate this working, or make an admission controller/webhook which will update the PodSpec if/when huge pages are used and a Kata runtimeclass is specified. With that, we'd need to:

  1. Augment the huge page limit to include: emptyDir limit specified + all memory limits + memory limit within the runtime class's PodOverhead field (or add a special/magic emptyDir with the overhead accounting)
  2. Find a way to communicate to the runtime (annotation?) that the VM should be backed by huge pages and what the limit should be for the ephemeral volume that is created in the guest.

@bergwolf @gnawux @sameo curious if ya'll have input here?

Hi @egernst

I do agree on some change of k8s/controller/webhook to tell the PodSpec to add some HugePage overhead to hugetlb cgroup, but why we need to account all memory limits into the hugepage overhead? Is it possible to split the guest memory into two parts/slots: the first one is normal page size memory with memory limits size; and the other one is the huge page with emptyDir limit + PodOverhead. Thus we can make the hugepage request as small as possible.

Yeah, we can use the annotation to pass the hugepage request to runtime.

@bergwolf
Copy link
Member

@lifupan If we want to do such split, we need to pass give hugepages and normal pages to the guest. It is possible? I thought a guest can be using either hugepages or not at all. Also even if it's possible, we'd have to figure out how to hand out different memory region to different applications (guest kernel, tmpfs, agent, containers). I don't think it's simply do-able.

I think we can make all guest part of memory use hugepage and all host part of podoverhead (vmm + runtime) use normal size memory (that's the current situation IIUC.) The problem is that k8s hugetlb control group doesn't take kata's podoverhead into account. A webhook might just work by altering the container memory field but kata runtime should find a way to revert webhook's modification so that a container still uses its original claimed size of memory in the guest.

A more intrusive way is to make k8s aware of such detail by adding hugetlb overhead to podoverhead so that node and scheduler know how to properly handle it.

Just my 2 cents.

bpradipt added a commit to bpradipt/runtime that referenced this issue Jan 25, 2021
Fixes: kata-containers#2353

Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
bpradipt added a commit to bpradipt/runtime that referenced this issue Jan 25, 2021
Fixes: kata-containers#2353

Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
bpradipt added a commit to bpradipt/runtime that referenced this issue Jan 26, 2021
Fixes: kata-containers#2353

Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
bpradipt added a commit to bpradipt/runtime that referenced this issue Jan 27, 2021
Fixes: kata-containers#2353

Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
bpradipt added a commit to bpradipt/runtime that referenced this issue Jan 27, 2021
Fixes: kata-containers#2353

Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
bpradipt added a commit to bpradipt/runtime that referenced this issue Feb 2, 2021
Fixes: kata-containers#2353

Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
bpradipt added a commit to bpradipt/runtime that referenced this issue Feb 2, 2021
Fixes: kata-containers#2353

Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
bpradipt added a commit to bpradipt/runtime that referenced this issue Feb 2, 2021
Fixes: kata-containers#2353

Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
bpradipt added a commit to bpradipt/runtime that referenced this issue Feb 3, 2021
Fixes: kata-containers#2353

Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
@bpradipt
Copy link
Contributor

I think we would have to make some unfortunate changes in Kubernetes in order to facilitate this working, or make an admission controller/webhook which will update the PodSpec if/when huge pages are used and a Kata runtimeclass is specified. With that, we'd need to:

1. Augment the huge page limit to include: emptyDir limit specified +  all memory limits + memory limit within the runtime class's PodOverhead field (or add a special/magic emptyDir with the overhead accounting)

2. Find a way to communicate to the runtime (annotation?) that the VM should be backed by huge pages and what the limit should be for the ephemeral volume that is created in the guest.

@bergwolf @gnawux @sameo curious if ya'll have input here?

@egernst I'm trying to understand your above comments and putting some questions here. Please be patient as I try to deconstruct and understand :-)
Usecase-1: Trying to deploy an existing pod.yaml with hugepage req/limit using Kata. For this to functionally work the Kata VM needs to have same number and type of hugepages that are requested by the pod. The Kata VM memory itself might not be backed up by hugepages
Usecase-2: Trying to deploy an existing pod.yaml with hugepage req/limit using Kata. For this to functionally work the Kata VM needs to have same number and type of hugepages that are requested by the pod. Also for optimal performance the Kata VM memory needs to be backed up by hugepages.

My understanding is your comments refer to Usecase-2. Please correct me if wrong ?

@egernst
Copy link
Member

egernst commented Feb 24, 2021

No problem -- want to make sure you understand and we're on the same page!

Yeah, I'm talking about use-case #2. I was under the assumption that you'd want to be be backed by huge pages on the host. This is what Kubernetes is accounting for when creating your huge page volume, right? There isn't a way to "provide this" to the application.

@bpradipt
Copy link
Contributor

bpradipt commented Feb 25, 2021

No problem -- want to make sure you understand and we're on the same page!

Yeah, I'm talking about use-case #2. I was under the assumption that you'd want to be be backed by huge pages on the host.

Thanks @egernst. Yeah ideally I would want VM memory to be backed up hugepages to actually reap the benefit of hugepages. However my primary goal for the runtime and agent changes was to ensure a regular pod.yaml works and doesn't break when running with Kata (from user experience stand point). And we document the limitations.

This is what Kubernetes is accounting for when creating your huge page volume, right? There isn't a way to "provide this" to the application.

Right, In Kata case, the hugepage is actually not (cannot be) consumed. We'll need to look at some of the options that you had already mentioned - like introducing Pod overhead for hugepages so that the VM memory gets accounted for in hugepage limits etc. Also things like how to handle different hugepage sizes since afaik VM can only be backed by a specific hugepage size.
This will be a long lead item to figure out a workable approach given these complexities.

@fidencio
Copy link
Member

@egernst @bpradipt do I understand correctly that both of you got itnto a consensus? :-)

If so, I'd like proceed with #3109 and kata-containers/agent#872

@fidencio
Copy link
Member

I'm going ahead and proceeding with #3109 and kata-containers/agent#872

c3d pushed a commit to c3d/runtime that referenced this issue May 7, 2021
Fixes: kata-containers#2353

Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Incorrect behaviour needs-review Needs to be assessed by the team.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants