Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

Does the cgroup can make better? #2090

Closed
liangxianlong opened this issue Sep 27, 2019 · 7 comments
Closed

Does the cgroup can make better? #2090

liangxianlong opened this issue Sep 27, 2019 · 7 comments
Assignees
Labels
bug Incorrect behaviour

Comments

@liangxianlong
Copy link

liangxianlong commented Sep 27, 2019

Description of problem

Hi, I try to use podoverhead in kata-containers,but i have some questions.

  1. The kata-container config:
sandbox_cgroup_only=true
  1. The runtimeclass crd yamlruntimeclass_crd.yaml:
apiVersion: apiextensions.k8s.io/v1beta1
metadata:
  name: runtimeclasses.node.k8s.io
  labels:
    addonmanager.kubernetes.io/mode: Reconcile
spec:
  group: node.k8s.io
  version: v1alpha1
  versions:
    - name: v1alpha1
      served: true
      storage: true
  names:
    plural:   runtimeclasses
    singular: runtimeclass
    kind:     RuntimeClass
  scope: Cluster
  validation:
    openAPIV3Schema:
      properties:
        spec:
          properties:
            runtimeHandler:
              type: string
              pattern: '^([a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)?$'
            runtimeCpuReqOverhead:
              type: string
              pattern: '^([0-9]+([.][0-9])?)|[0-9]+(m)$'
            runtimeCpuLimitOverhead:
              type: string
              pattern: '^([0-9]+([.][0-9])?)|[0-9]+(m)$'
            runtimeMemoryReqOverhead:
              type: string
              pattern: '^[0-9]+([.][0-9]+)+(Mi|Gi|M|G)$'
            runtimeMemoryLimitOverhead:
              type: string
              pattern: '^[0-9]+([.][0-9]+)+(Mi|Gi|M|G)$'
  1. The runtimeclass yamlkata_v2_class.yaml:
apiVersion: node.k8s.io/v1beta1  # RuntimeClass is defined in the node.k8s.io API group
kind: RuntimeClass
metadata:
  name: kata-containers  # The name the RuntimeClass will be referenced by
  # RuntimeClass is a non-namespaced resource
handler: kata
overhead:
    podFixed:
        memory: "1024"
        cpu: "1"
  1. The pod yamlbusybox-two.yaml:
apiVersion: v1
kind: Pod
metadata:
  name: busybox-two
spec:
  runtimeClassName: kata-containers
  containers:
  - name: busybox-1
    image: busybox
    stdin: true
    tty: true
    imagePullPolicy: IfNotPresent
    resources:
      limits:
        cpu: 1
  - name: busybox-2
    image: busybox
    stdin: true
    tty: true
    imagePullPolicy: IfNotPresent
    resources:
      limits:
        cpu: 1
  1. Execute commands as below:
kubectl apply -f runtimeclass_crd.yaml
kubectl apply -f kata_v2_class.yaml
kubectl apply -f busybox-two.yaml

We can see the pod is running.

[root@host-69 yaml]# kubectl get pods
NAME          READY   STATUS    RESTARTS   AGE
busybox-two   2/2     Running   0          4h51m

Then, I go to the pod cgroup:

[root@host-69 pod21390dc3-79c0-46f1-9f1d-00be3aec9cc9]# pwd
/sys/fs/cgroup/cpu/kubepods/burstable/pod21390dc3-79c0-46f1-9f1d-00be3aec9cc9
[root@host-69 pod21390dc3-79c0-46f1-9f1d-00be3aec9cc9]# cat cpu.cfs_period_us
100000
[root@host-69 pod21390dc3-79c0-46f1-9f1d-00be3aec9cc9]# cat cpu.cfs_quota_us
300000
[root@host-69 pod21390dc3-79c0-46f1-9f1d-00be3aec9cc9]# cat cgroup.procs 
[root@host-69 pod21390dc3-79c0-46f1-9f1d-00be3aec9cc9]#

Then go to the sandbox cgroup:

[root@host-69 kata_57d39a01c7f0abc582f9fb32b8c25db7a8b686242dc192844e92a5e119c02c92]# pwd 
/sys/fs/cgroup/cpu/kubepods/burstable/pod21390dc3-79c0-46f1-9f1d-00be3aec9cc9/kata_57d39a01c7f0abc582f9fb32b8c25db7a8b686242dc192844e92a5e119c02c92
[root@host-69 kata_57d39a01c7f0abc582f9fb32b8c25db7a8b686242dc192844e92a5e119c02c92]# cat cpu.cfs_period_us
100000
[root@host-69 kata_57d39a01c7f0abc582f9fb32b8c25db7a8b686242dc192844e92a5e119c02c92]# cat cpu.cfs_quota_us
200000
[root@host-69 kata_57d39a01c7f0abc582f9fb32b8c25db7a8b686242dc192844e92a5e119c02c92]# cat cgroup.procs 
1840
1943
1946
[root@host-69 kata_57d39a01c7f0abc582f9fb32b8c25db7a8b686242dc192844e92a5e119c02c92]#

And the pids and process:

[root@host-69 kata_57d39a01c7f0abc582f9fb32b8c25db7a8b686242dc192844e92a5e119c02c92]# ps aux | grep 1840
root      1840  0.1  0.0 475132 23364 ?        Sl   10:03   0:31 /usr/bin/containerd-shim-kata-v2 -namespace k8s.io -address /run/containerd/containerd.sock -publish-binary /usr/bin/containerd -id 57d39a01c7f0abc582f9fb32b8c25db7a8b686242dc192844e92a5e119c02c92 -debug
[root@host-69 kata_57d39a01c7f0abc582f9fb32b8c25db7a8b686242dc192844e92a5e119c02c92]# ps aux | grep 1943
root      1943  0.4  0.1 2775560 179664 ?      Sl   10:03   1:12 /usr/bin/qemu-vanilla-system-x86_64 -name sandbox-57d39a01c7f0abc582f9fb32b8c25db7a8b686242dc192844e92a5e119c02c92 -uuid 7a7c089a-f268-4f3c-b0ec-4f31f570b0f0 -machine pc,accel=kvm,kernel_irqchip,nvdimm -cpu host -qmp unix:/run/vc/vm/57d39a01c7f0abc582f9fb32b8c25db7a8b686242dc192844e92a5e119c02c92/qmp.sock,server,nowait -m 2048M,slots=10,maxmem=97494M -device pci-bridge,bus=pci.0,id=pci-bridge-0,chassis_nr=1,shpc=on,addr=2,romfile= -device virtio-serial-pci,disable-modern=false,id=serial0,romfile= -device virtconsole,chardev=charconsole0,id=console0 -chardev socket,id=charconsole0,path=/run/vc/vm/57d39a01c7f0abc582f9fb32b8c25db7a8b686242dc192844e92a5e119c02c92/console.sock,server,nowait -device nvdimm,id=nv0,memdev=mem0 -object memory-backend-file,id=mem0,mem-path=/usr/share/kata-containers/kata-containers-image_clearlinux_1.9.0-alpha1_agent_dc235c3d74.img,size=134217728 -device virtio-scsi-pci,id=scsi0,disable-modern=false,romfile= -object rng-random,id=rng0,filename=/dev/urandom -device virtio-rng,rng=rng0,romfile= -device virtserialport,chardev=charch0,id=channel0,name=agent.channel.0 -chardev socket,id=charch0,path=/run/vc/vm/57d39a01c7f0abc582f9fb32b8c25db7a8b686242dc192844e92a5e119c02c92/kata.sock,server,nowait -device virtio-9p-pci,disable-modern=false,fsdev=extra-9p-kataShared,mount_tag=kataShared,romfile= -fsdev local,id=extra-9p-kataShared,path=/run/kata-containers/shared/sandboxes/57d39a01c7f0abc582f9fb32b8c25db7a8b686242dc192844e92a5e119c02c92,security_model=none -netdev tap,id=network-0,vhost=on,vhostfds=3,fds=4 -device driver=virtio-net-pci,netdev=network-0,mac=26:66:a7:0e:f6:aa,disable-modern=false,mq=on,vectors=4,romfile= -global kvm-pit.lost_tick_policy=discard -vga none -no-user-config -nodefaults -nographic -daemonize -object memory-backend-ram,id=dimm1,size=2048M -numa node,memdev=dimm1 -kernel /usr/share/kata-containers/vmlinuz-4.19.71.48-43.1.container -append tsc=reliable no_timer_check rcupdate.rcu_expedited=1 i8042.direct=1 i8042.dumbkbd=1 i8042.nopnp=1 i8042.noaux=1 noreplace-smp reboot=k console=hvc0 console=hvc1 iommu=off cryptomgr.notests net.ifnames=0 pci=lastbus=0 root=/dev/pmem0p1 rootflags=dax,data=ordered,errors=remount-ro ro rootfstype=ext4 quiet systemd.show_status=false panic=1 nr_cpus=8 agent.use_vsock=false systemd.unit=kata-containers.target systemd.mask=systemd-networkd.service systemd.mask=systemd-networkd.socket -pidfile /run/vc/vm/57d39a01c7f0abc582f9fb32b8c25db7a8b686242dc192844e92a5e119c02c92/pid -smp 1,cores=1,threads=1,sockets=8,maxcpus=8
[root@host-69 kata_57d39a01c7f0abc582f9fb32b8c25db7a8b686242dc192844e92a5e119c02c92]# ps aux | grep 1946
root      1946  0.0  0.0      0     0 ?        S    10:03   0:00 [vhost-1943]

Accoding to these resources:
https://github.com/kata-containers/runtime/pull/1880
https://github.com/kata-containers/runtime/issues/1879

config add option to use only pod cgroup (SandboxCgroupOnly)
Get the sandbox Cgroup path based on cgroupsPath from Sandbox Container
Join the Sandbox Cgroup before create any process
All the kata processes are inside the Sandbox Cgroup
The cgroup is not restricted by the runtime but the caller is free to limit it.

I think we should not write limits into the sandbox cgroup!
@jcvenegas
@egernst

@liangxianlong liangxianlong added bug Incorrect behaviour needs-review Needs to be assessed by the team. labels Sep 27, 2019
@Marshalzxy
Copy link

I don't understand neither why kata's overheads(qemu,kata-proxy,shimv2) are limited by sandbox cgroup.In my opinion ,kata's processes(qemu,kata-proxy,shimv2 )should be control by parents cgroup which is create by k8s.

@devimc
Copy link

devimc commented Sep 27, 2019

cc @jcvenegas

@egernst egernst self-assigned this Oct 28, 2019
@egernst
Copy link
Member

egernst commented Oct 28, 2019

@Marshalzxy you are right that kubelet will create a cgroup specifically for the pod. In Kubernetes, as is, for most scenarios, we shouldn't need to create a child cgroup, and instead should operate within the parents'. As @liangxianlong highlights, today's implementation results in the 'sandbox' being over-constrained, since the Kata created cgroup is not sized to take pod overhead into account.

There were a couple of initial concerns which lead to this:

  1. If we want to handle static CPU management, we may need to edit the cgroup that is constraining the Kata components. We wanted to avoid modifying a cgroup which is managed by "our parent."
  2. If Kata is called from an orchestrator/tool that doesn't create a cgroup, the Kata components will run completely unconstrained. This is the scenario with Docker, for example.

The options to fix all of this is to either just rely on upper-layer doing the right thing, or to size our cgroup to take pod-overhead into account. CRI, and OCI doesn't have a concept of PodOverhead today, and this just lives within the Pod.Spec in Kubernetes. As a result, today we cannot get this information in Kata.

Based on this, I would propose we:

  • First, modify SandboxCgroupOnly to assume that the caller has setup an appropriate sandbox for us (this is easy - just do nothing). If this is not the case, SandboxCgroupOnly should not be selected.
  • Drive update to CRI specification to pass the sandbox details, including CPU/memory limits as well as pod overhead.
  • Long term if CRI API is updated to include pod information, Kata can size the host cgroup appropriately.

@liangxianlong @Marshalzxy @jcvenegas WDYT?

@egernst egernst added needs-review Needs to be assessed by the team. bug Incorrect behaviour and removed bug Incorrect behaviour needs-review Needs to be assessed by the team. needs-review Needs to be assessed by the team. labels Oct 28, 2019
@egernst
Copy link
Member

egernst commented Oct 28, 2019

To follow up, I talked to @jcvenegas IRL, and we agree this is a bug. The current implementation should just inherit the parent's constraints.

@Marshalzxy
Copy link

I agree with you that while sandboxonly=true kata should inherit parent's cgroup configuration @egernst .
Still I think there is another bug with sandboxonly feature. When orchestrator doesn't set parent cgroup ,we should set sandboxonly=false right? In this case kata create sandbox cgroup and containers cgroups,put shimv1 into those cgroups,but leave alone qemu process and kata-proxy.
Is there any reason kata don't set those processes? @jcvenegas @egernst

@liangxianlong
Copy link
Author

The same question

Still I think there is another bug with sandboxonly feature. When orchestrator doesn't set parent cgroup ,we should set sandboxonly=false right? In this case kata create sandbox cgroup and containers cgroups,put shimv1 into those cgroups,but leave alone qemu process and kata-proxy.
Is there any reason kata don't set those processes?

@egernst
Copy link
Member

egernst commented Oct 29, 2019

@liangxianlong -- Was looking to reproduce your issue exactly -- why do you define a CRD for runtimeclass? What version of kubernetes are you using?

And, what version of Kata did you test with?

Thx.

egernst pushed a commit to egernst/runtime that referenced this issue Oct 30, 2019
When SandboxCgroupsOnly is set, we are expected to just inherit our parent's
cgroup settings and to move all Kata threads within that sandbox cgroup. The
initial implementation still adjusted the size of this cgroup. This commit
fixes this.

This commit makes a couple of functional changes, small refactors, and
adds clarifying comments for some functions.

Fixes: kata-containers#2090

Signed-off-by: Eric Ernst <eric.ernst@intel.com>
egernst pushed a commit to egernst/runtime that referenced this issue Oct 30, 2019
When SandboxCgroupsOnly is set, we are expected to just inherit our parent's
cgroup settings and to move all Kata threads within that sandbox cgroup. The
initial implementation still adjusted the size of this cgroup. This commit
fixes this.

This commit makes a couple of functional changes, small refactors, and
adds clarifying comments for some functions.

Fixes: kata-containers#2090

Signed-off-by: Eric Ernst <eric.ernst@intel.com>
egernst pushed a commit to egernst/runtime that referenced this issue Oct 30, 2019
When SandboxCgroupsOnly is set, we are expected to just inherit our parent's
cgroup settings and to move all Kata threads within that sandbox cgroup. The
initial implementation still adjusted the size of this cgroup. This commit
fixes this.

This commit makes a couple of functional changes, small refactors, and
adds clarifying comments for some functions.

Fixes: kata-containers#2090

Signed-off-by: Eric Ernst <eric.ernst@intel.com>
egernst pushed a commit to egernst/runtime that referenced this issue Oct 31, 2019
When SandboxCgroupsOnly is set, we are expected to just inherit our parent's
cgroup settings and to move all Kata threads within that sandbox cgroup. The
initial implementation still adjusted the size of this cgroup. This commit
fixes this.

This commit makes a couple of functional changes, small refactors, and
adds clarifying comments for some functions.

Fixes: kata-containers#2090

Signed-off-by: Eric Ernst <eric.ernst@intel.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Incorrect behaviour
Projects
None yet
Development

No branches or pull requests

4 participants