[WIP] k8s memory resource hotplug #580

miaoyq · 2018-08-14T05:05:41Z

At present, kata-runtime already supports memory hotplug in hypervisor side，however, the limitation of pod resources in k8s cannot be satisfied.

This pr try to satisfy the usage of k8s.

Signed-off-by: Yanqiang Miao miao.yanqiang@zte.com.cn

Signed-off-by: Yanqiang Miao <miao.yanqiang@zte.com.cn>

katacontainersbot · 2018-08-14T05:17:12Z

PSS Measurement:
Qemu: 165387 KB
Proxy: 4024 KB
Shim: 8772 KB

Memory inside container:
Total Memory: 2043464 KB
Free Memory: 2003316 KB

opendev-zuul · 2018-08-14T05:22:15Z

Build failed (third-party-check pipeline) integration testing with
OpenStack. For information on how to proceed, see
http://docs.openstack.org/infra/manual/developers.html#automated-testing

kata-runsh : RETRY_LIMIT in 5m 07s

amshinde · 2018-08-15T17:12:37Z

@miaoyq Can you add bit more details about what problem you are solving here and how you are solving it.

miaoyq · 2018-08-20T03:21:37Z

@amshinde Related to #400
At present, the kata VM cannot dynamically increase or decrease memory when the VM has been started.
The first container of the k8s pod is pause container that can't set the memory resource limit,
so when we create a k8s pod with kata, the memory resource of this pod is the default value(2048M).

But in k8s pod, resource limits is set in app containers config, like:

apiVersion: v1
kind: Pod
metadata:
  name: mem-busybox-untrusted
  annotations:
    io.kubernetes.cri.untrusted-workload: "true"
spec:
  containers:
  - name: busybox
    image: busybox
    command: ['sh', '-c', 'sleep 3600']
    resources:
      limits:
        memory: "2500Mi"
      requests:
        memory: "1000Mi"

So we should dynamically increase or decrease memory of VM according to the app containers config.

Kata-runtime already supports memory hotplug in hypervisor side, see #470
This PR mainly use this feature to hotplug the memory of VM according to the app containers config of pod.

WeiZhang555 · 2018-08-20T06:25:58Z

You need to add Fixes #470 and some detailed description of the issue in your commit message, or the CI won't pass.

WeiZhang555 · 2018-08-20T06:28:39Z

virtcontainers/pkg/oci/utils.go

@@ -605,6 +605,11 @@ func ContainerConfig(ocispec CompatOCISpec, bundlePath, cid, console string, det
 			resources.VCPUs = uint32(utils.ConstraintsToVCPUs(*ocispec.Linux.Resources.CPU.Quota, *ocispec.Linux.Resources.CPU.Period))
 		}
 	}
+	if ocispec.Linux.Resources.Memory != nil {
+		if ocispec.Linux.Resources.Memory.Limit != nil {


You can make a short sentense:

if ocispec.Linux.Resources.Memory != nil && ocispec.Linux.Resources.Memory.Limit != nil {

bergwolf · 2018-08-20T09:27:49Z

virtcontainers/container.go

+		sizeMB := int(mem / 1024 / 1024)
+		// sizeMB needs to be divisible by 2
+		sizeMB = sizeMB + sizeMB%2
+		_, err := c.sandbox.hypervisor.hotplugRemoveDevice(&memoryDevice{1, sizeMB}, memoryDev)


To support hot-unplug memory devices, we need to support guest ballooning first (which I don't think we have added yet).

We don't support hot memory unplug. This code should be removed from this PR as it will make the code failing.

WeiZhang555 · 2018-08-20T12:24:40Z

@miaoyq I played with your PR, but I can't see expected effect with your test POD spec. The POD's memory is always 2048MiB as specified in kata configure file.

Do I miss something? What's the expected behaviour with this PR?

jodh-intel · 2018-08-28T14:00:37Z

Hi @miaoyq - any update on this?

Related: #624.

miaoyq · 2018-08-29T01:19:16Z

@WeiZhang555 @bergwolf @jodh-intel
I've been busy with other things these days, sorry for delay, this pr is not working properly yet.
I will update this pr based on #624

linzichang · 2018-08-29T02:35:10Z

@miaoyq @jodh-intel @WeiZhang555 @bergwolf We should think about this carefully. We face 3 problems:

Each memory online size on Linux must be memblock aligned. That also mean the hotplug memory size need to be multiple of memblock size. We should read /sys/devices/system/memoryblock_size_bytes on GuestOS to get this size.
“Memory Hot-add” has two phase. First phase know as "Physical Hotplug", Qemu virtual hotplug or physical hotplug a memory device, cause an ACPI notify IRQ to linux kernel. Kernel handle IRQ and setup memory section, allocate page struct, etc. Second phase know as "Logical Hotplug", using the sysfs interface to online memory and kernel will let the Buddy memory system to manage those new memory. In the first phase, kernel need to allocate page struct. The size is (hos-add memory size/4KB) * sizeof(struct page). Nearly 40:1 or 50:1(I haven't calculate it very accurately). This maybe not a problem but it affect the choosing of default vm boot memory size and per-hot-add size. For example, a vm with 128MB, the maximum hot-ad size once maybe 4G.
How to choose the default boot size, now is 2G. If pod don't need 2G, we may use hot-remove. But this face the memblock problem. And in addition, we need to care about ballooning and page migration which not always succeed.

cedriccchen · 2018-08-29T08:14:37Z

Indeed it's a better solution to hot add memory aligned to guestos memory block size. As now we have the fact:

Hot add memory should be aligned to memory section size, but the size of a memory section is architecture dependent. For example, power uses 16MB, ia64 uses 1GB, x86_64 uses 128MB, ppc64le uses 256MB.
Each memory block is described under /sys/devices/system/memory as /sys/devices/system/memory/memoryXXX (XXX is the memory block id.), which is the unit of memory online/offline. In sparse memory model, memory block size is a multiple of memory section size, reference. We can online a memory block only if we hotplug a memory-block-sized dimm, see pages_correctly_reserved in reference.

We can get the size of guestos memory block by reading /sys/devices/system/memoryblock_size_bytes in vm.

devimc · 2018-08-29T15:58:11Z

My two cents,

there is still some points that haven't been discussed:

How many mem slots should have a VM?
Should number of mem slots be configurable?
Same as cpu sockets each mem slot consumes memory, how many MBs?

Since memory hot remove is almost impossible (or at least not reliable), we should track how much memory was assigned/required to/by a container, for example, if a container has 2GB but it's updated or removed (PODs scenario) its memory still can be used by other new or existing containers

miaoyq · 2018-08-30T04:11:54Z

Hot add memory should be aligned to memory section size, but the size of a memory section is architecture dependent. For example, power uses 16MB, ia64 uses 1GB, x86_64 uses 128MB, ppc64le uses 256MB.

Maybe we can make the memory section size configurable, for different architecture kata will use the different qemu binary. For exemple: x86_64 to qemu-system-x86_64(or others related), ppc64le to qemu-system-ppc64le.

linzichang · 2018-08-30T08:16:52Z

@miaoyq

Physical hot-add need memory section aligned. Logical hot-add(online) need memory block aligned. Memory block size is a multiple of memory section size. So we need “memory block” aligned.
memory section size and memory block size is already configurable. It's in kernel configure/code. Qemu don't know what is memory block/section. x86 kernel code, when total memory is less than 64G, memory block is equal to memory section. Otherwise memory block is 2G.

static unsigned long probe_memory_block_size(void)
{
	/* start from 2g */
	unsigned long bz = 1UL<<31;

	if (totalram_pages >= (64ULL << (30 - PAGE_SHIFT))) {
		pr_info("Using 2GB memory block size for large-memory system\n");
		return 2UL * 1024 * 1024 * 1024;
	}

	/* less than 64g installed */
	if ((max_pfn << PAGE_SHIFT) < (16UL << 32))
		return MIN_MEMORY_BLOCK_SIZE;

	/* get the tail size */
	while (bz > MIN_MEMORY_BLOCK_SIZE) {
		if (!((max_pfn << PAGE_SHIFT) & (bz - 1)))
			break;
		bz >>= 1;
	}

	printk(KERN_DEBUG "memory block size : %ldMB\n", bz >> 20);

	return bz;
}

Two way to clear “memblock size”：

Read /sys/devices/system/memoryblock_size_bytes on GuestOS to get the size though agent.
Modify kernel configure and code to make the memory block size to what we want.

cedriccchen · 2018-08-30T08:45:29Z

@miaoyq
For example, power uses 16MB, ia64 uses 1GB, x86_64 uses 128MB, ppc64le uses 256MB.
This is just always not absolute. Memory section size may be different according to different kernel config. So it's difficult to make it configurable in host.

miaoyq · 2018-08-30T09:02:53Z

@linzichang @clarecch
Thanks for your explanation.

1 .Read /sys/devices/system/memoryblock_size_bytes on GuestOS to get the size though agent.
2. Modify kernel configure and code to make the memory block size to what we want.

If so, the first option is a little better I think, we can add a interface in agent, via which runtime can ask the block_size for agent.

cedriccchen · 2018-08-31T01:29:11Z

@miaoyq @devimc @linzichang Maybe we can get memory block size from agent when kata-runtime create or kata-runtime start, then store it in file. We don't need to ask the memory block_size for agent everytime hotplug.

miaoyq · 2018-08-31T01:35:51Z

@clarecch Agree.

cedriccchen · 2018-08-31T03:29:32Z

there is still some points that haven't been discussed:

How many mem slots should have a VM?
Should number of mem slots be configurable?
Same as cpu sockets each mem slot consumes memory, how many MBs?

@grahamwhaley Do you have any idea about memory slot, which I'm not quite familiar with? I have read clearcontainers/runtime#380. Is there any result of memory slot discussion?

grahamwhaley · 2018-08-31T09:08:20Z

How many mem slots should have a VM?
Should number of mem slots be configurable?
Same as cpu sockets each mem slot consumes memory, how many MBs?
@grahamwhaley Do you have any idea about memory slot, which I'm not quite familiar with? I have read clearcontainers/runtime#380. Is there any result of memory slot discussion?

Hi @clarecch - heh, good digging, finding that 1yr old thread! I think any measures we did before on number-of-slot overheads (~1.5yr ago now) will be out of date - we should/would have to re-run them if we need that data.

But, I think this will always be difficult - I very much doubt we can find a default that will suit all situations and users. I think we need to make this configurable (in the runtime toml config file), and set some sensible defaults. The only sensible default I can think of might be:

split MAXMEM into a fixed number of slots (say 16 or 32 - with some math/rounding to make sure we end up on nice boundaries etc.). This has the flexibility that it will work on all systems probably (big and small memory systems).

In the config file I think we should probably offer two config options to give the user flexibility:

the ability to set the number of slots (so MAXMEM/n slots)
the ability to set the slot size (so nslots = MAXME/slotsize)

cedriccchen · 2018-08-31T09:17:20Z

@grahamwhaley very enlightening!

grahamwhaley · 2018-08-31T09:45:31Z

Maybe I'm thinking about this wrong as well.... I don't think slots have a fixed size do they? (that is, you can plug different sized dimms into each slot I think in QEMU?).
In that case, maybe the number of slots does not really relate to the MAXSIZE of memory, but does relate to how many times we expect to hotplug add memory to a pod.
Do we only expect to hotplug one dimm for each extra container added to a pod? In that case, the number of slots relates directly to the maximum number of containers we ever expect to add to a pod. That may change our view here. 8 or 16 slots still feels sensible to me - but, does anybody have evidence or numbers around how many containers make up a pod?

cedriccchen · 2018-08-31T09:58:34Z

I think in some case we will hotplug multiple dimms for every container in a pod, which contains multiple containers.

linzichang · 2018-09-04T01:58:35Z

@devimc I change defaultMemSlots from 2 to 255. The PSS of qemu process will increase 11MB.

grahamwhaley · 2018-09-04T09:40:26Z

Thanks @linzichang, thanks for grabbing that data. 11Mb over how much - what is to total PSS (so we can see what % increase there is).
Also, was that with KSM enabled or not on the host?, and how many containers did you run?

Our PSS footprint can be between maybe 48Mb (20 containers, KSM) to 135Mb (20 containers, no KSM) for instance on my local system - in which case 11Mb could represent anywhere from a 23% to 8% footprint increase. Both of those I would call significant enough to need a discussion :-)

You could maybe use the reportgen from the tests repo to generate us a report.
Over at: https://github.com/kata-containers/tests/tree/master/metrics/report
If you run the grabdata.sh with and without your changes installed and then makereport.sh, the resulting pdf report may help us assess.

grabdata can take some time, and if you are only looking at the density/footprint then the version on the pending PR kata-containers/tests#650 allows you to tell grabdata to only run a subset of tests (like, just density). I think it would also be good in this case to see if there is any significant change in boot time as well though.

jodh-intel · 2018-09-04T09:34:03Z

virtcontainers/container.go

+		virtLog.Debugf("hot adding %d B memory", mem)
+		sizeMB := int(mem / 1024 / 1024)
+		// sizeMB needs to be divisible by 2
+		sizeMB = sizeMB + sizeMB%2


It might be better to move the log call to here and log sizeMB (and maybe mem too):

virtLog.WithField("memory-mb", sizeMB).Info("hot adding memory")

Same comment for the log call in removeResources().

jodh-intel · 2018-09-04T09:40:17Z

virtcontainers/pkg/oci/utils.go

@@ -605,6 +605,11 @@ func ContainerConfig(ocispec CompatOCISpec, bundlePath, cid, console string, det
 			resources.VCPUs = uint32(utils.ConstraintsToVCPUs(*ocispec.Linux.Resources.CPU.Quota, *ocispec.Linux.Resources.CPU.Period))
 		}
 	}
+	if ocispec.Linux.Resources.Memory != nil {
+		if ocispec.Linux.Resources.Memory.Limit != nil {
+			resources.Mem = uint32(*ocispec.Linux.Resources.Memory.Limit)


This is casting an int64 into an int32:

https://github.com/opencontainers/runtime-spec/blob/master/config-linux.md

Maybe ContainerResources.Mem should be changed to int64 too?

devimc · 2018-09-10T14:31:04Z

@linzichang

@devimc I change defaultMemSlots from 2 to 255. The PSS of qemu process will increase 11MB.

thanks, have you measure how much memory consume the guest kernel? I wouldn't like to see issue #295 again but now with memory slots

linzichang · 2018-09-10T15:24:04Z

@devimc You have @ the wrong guy twice, :)

devimc · 2018-09-10T15:31:34Z

@linzichang 😄 I'm sorry

sboeuf

@miaoyq please rework this PR according to all the comments you've got. Thanks!

sboeuf · 2018-09-10T19:19:44Z

virtcontainers/container.go

+		sizeMB := int(mem / 1024 / 1024)
+		// sizeMB needs to be divisible by 2
+		sizeMB = sizeMB + sizeMB%2
+		_, err := c.sandbox.hypervisor.hotplugRemoveDevice(&memoryDevice{1, sizeMB}, memoryDev)


We don't support hot memory unplug. This code should be removed from this PR as it will make the code failing.

linzichang · 2018-09-11T01:09:21Z

@sboeuf @devimc If @miaoyq has no time to work on this, I and @clarecch can help to rework this.

miaoyq · 2018-09-11T02:03:33Z

@sboeuf @linzichang @clarecch I'm working on this.
If I have any doubts, I will discuss with you. Thanks. :-)

linzichang · 2018-09-11T02:40:20Z

@miaoyq I need this feature in short time. We have already talk about the solution in #624 (comment) . I will implement some both needed method in update memory to help speedup this PR. Let'us get this done quicker together. Thank you to rework this very much :)

linzichang · 2018-09-11T02:47:06Z

@miaoyq I think you should first take care the memory footprint we mentions in #580 (comment) #580 (comment) #580 (comment). Because I haven't tested it clear yet.

miaoyq · 2018-09-11T02:56:53Z

@linzichang I think you're a little bit more familiar with this, so in order not to affect your usage, you can rework this freely. :-)

raravena80 · 2018-09-24T14:50:33Z

Hi, @linzichang @miaoyq any updates on this? thx!

miaoyq · 2018-09-25T00:43:57Z

@raravena80
I think @linzichang and @clarecch have finished this locally.
@linzichang Could you update this feature in a new PR?

linzichang · 2018-09-25T01:15:12Z

@miaoyq @raravena80 I will open a new PR in the last few days.

miaoyq · 2018-09-25T01:25:56Z

@linzichang Thanks. :-)

linzichang · 2018-09-26T06:33:04Z

Rework PR #786

miaoyq · 2018-09-26T07:14:36Z

Rework PR #786

@linzichang I will close this PR, thanks!

k8s memory resource hotplug

Verified

This commit was signed with the committer’s verified signature.

alexanderbez Aleksandr Bezobchuk

GPG key ID: E9E1BA3149A22A84

Verified
Learn about vigilant mode

69c5584

Signed-off-by: Yanqiang Miao <miao.yanqiang@zte.com.cn>

WeiZhang555 reviewed Aug 20, 2018

View reviewed changes

bergwolf suggested changes Aug 20, 2018

View reviewed changes

cedriccchen mentioned this pull request Aug 29, 2018

virtcontainers: hotplug memory with kata-runtime update command. #624

Merged

jodh-intel reviewed Sep 4, 2018

View reviewed changes

sboeuf suggested changes Sep 10, 2018

View reviewed changes

sboeuf added enhancement wip labels Sep 12, 2018

egernst mentioned this pull request Sep 18, 2018

Support Memory constrains #158

Closed

4 tasks

grahamwhaley mentioned this pull request Sep 19, 2018

config: Add Memory slots config #752

Merged

miaoyq closed this Sep 26, 2018

miaoyq deleted the k8s-mem-resoure-hotplug branch April 23, 2019 02:13

[WIP] k8s memory resource hotplug #580

[WIP] k8s memory resource hotplug #580

Conversation

miaoyq commented Aug 14, 2018

katacontainersbot commented Aug 14, 2018

opendev-zuul bot commented Aug 14, 2018

amshinde commented Aug 15, 2018

miaoyq commented Aug 20, 2018

WeiZhang555 commented Aug 20, 2018 • edited Loading

WeiZhang555 Aug 20, 2018

Choose a reason for hiding this comment

bergwolf Aug 20, 2018

Choose a reason for hiding this comment

sboeuf Sep 10, 2018

Choose a reason for hiding this comment

WeiZhang555 commented Aug 20, 2018

jodh-intel commented Aug 28, 2018

miaoyq commented Aug 29, 2018

linzichang commented Aug 29, 2018 • edited Loading

cedriccchen commented Aug 29, 2018

devimc commented Aug 29, 2018

miaoyq commented Aug 30, 2018

linzichang commented Aug 30, 2018

cedriccchen commented Aug 30, 2018

miaoyq commented Aug 30, 2018

cedriccchen commented Aug 31, 2018

miaoyq commented Aug 31, 2018

cedriccchen commented Aug 31, 2018

grahamwhaley commented Aug 31, 2018 • edited Loading

cedriccchen commented Aug 31, 2018

grahamwhaley commented Aug 31, 2018

cedriccchen commented Aug 31, 2018

linzichang commented Sep 4, 2018 • edited Loading

grahamwhaley commented Sep 4, 2018

jodh-intel Sep 4, 2018

Choose a reason for hiding this comment

jodh-intel Sep 4, 2018

Choose a reason for hiding this comment

devimc commented Sep 10, 2018 • edited Loading

linzichang commented Sep 10, 2018

devimc commented Sep 10, 2018

sboeuf left a comment

Choose a reason for hiding this comment

sboeuf Sep 10, 2018

Choose a reason for hiding this comment

linzichang commented Sep 11, 2018

miaoyq commented Sep 11, 2018

linzichang commented Sep 11, 2018

linzichang commented Sep 11, 2018

miaoyq commented Sep 11, 2018

raravena80 commented Sep 24, 2018

miaoyq commented Sep 25, 2018

linzichang commented Sep 25, 2018

miaoyq commented Sep 25, 2018

linzichang commented Sep 26, 2018

miaoyq commented Sep 26, 2018

WeiZhang555 commented Aug 20, 2018 •

edited

Loading

linzichang commented Aug 29, 2018 •

edited

Loading

grahamwhaley commented Aug 31, 2018 •

edited

Loading

linzichang commented Sep 4, 2018 •

edited

Loading

devimc commented Sep 10, 2018 •

edited

Loading