Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

[WIP] k8s memory resource hotplug #580

Closed

Conversation

miaoyq
Copy link

@miaoyq miaoyq commented Aug 14, 2018

At present, kata-runtime already supports memory hotplug in hypervisor side,however, the limitation of pod resources in k8s cannot be satisfied.

This pr try to satisfy the usage of k8s.

Signed-off-by: Yanqiang Miao miao.yanqiang@zte.com.cn

Verified

This commit was signed with the committer’s verified signature.
alexanderbez Aleksandr Bezobchuk
Signed-off-by: Yanqiang Miao <miao.yanqiang@zte.com.cn>
@katacontainersbot
Copy link
Contributor

PSS Measurement:
Qemu: 165387 KB
Proxy: 4024 KB
Shim: 8772 KB

Memory inside container:
Total Memory: 2043464 KB
Free Memory: 2003316 KB

@opendev-zuul
Copy link

opendev-zuul bot commented Aug 14, 2018

Build failed (third-party-check pipeline) integration testing with
OpenStack. For information on how to proceed, see
http://docs.openstack.org/infra/manual/developers.html#automated-testing

@amshinde
Copy link
Member

@miaoyq Can you add bit more details about what problem you are solving here and how you are solving it.

@miaoyq
Copy link
Author

miaoyq commented Aug 20, 2018

@amshinde Related to #400
At present, the kata VM cannot dynamically increase or decrease memory when the VM has been started.
The first container of the k8s pod is pause container that can't set the memory resource limit,
so when we create a k8s pod with kata, the memory resource of this pod is the default value(2048M).

But in k8s pod, resource limits is set in app containers config, like:

apiVersion: v1
kind: Pod
metadata:
  name: mem-busybox-untrusted
  annotations:
    io.kubernetes.cri.untrusted-workload: "true"
spec:
  containers:
  - name: busybox
    image: busybox
    command: ['sh', '-c', 'sleep 3600']
    resources:
      limits:
        memory: "2500Mi"
      requests:
        memory: "1000Mi"

So we should dynamically increase or decrease memory of VM according to the app containers config.

Kata-runtime already supports memory hotplug in hypervisor side, see #470
This PR mainly use this feature to hotplug the memory of VM according to the app containers config of pod.

@WeiZhang555
Copy link
Member

WeiZhang555 commented Aug 20, 2018

You need to add Fixes #470 and some detailed description of the issue in your commit message, or the CI won't pass.

@@ -605,6 +605,11 @@ func ContainerConfig(ocispec CompatOCISpec, bundlePath, cid, console string, det
resources.VCPUs = uint32(utils.ConstraintsToVCPUs(*ocispec.Linux.Resources.CPU.Quota, *ocispec.Linux.Resources.CPU.Period))
}
}
if ocispec.Linux.Resources.Memory != nil {
if ocispec.Linux.Resources.Memory.Limit != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can make a short sentense:

if ocispec.Linux.Resources.Memory != nil && ocispec.Linux.Resources.Memory.Limit != nil {

sizeMB := int(mem / 1024 / 1024)
// sizeMB needs to be divisible by 2
sizeMB = sizeMB + sizeMB%2
_, err := c.sandbox.hypervisor.hotplugRemoveDevice(&memoryDevice{1, sizeMB}, memoryDev)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To support hot-unplug memory devices, we need to support guest ballooning first (which I don't think we have added yet).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't support hot memory unplug. This code should be removed from this PR as it will make the code failing.

@WeiZhang555
Copy link
Member

@miaoyq I played with your PR, but I can't see expected effect with your test POD spec. The POD's memory is always 2048MiB as specified in kata configure file.

Do I miss something? What's the expected behaviour with this PR?

@jodh-intel
Copy link
Contributor

Hi @miaoyq - any update on this?

Related: #624.

@miaoyq
Copy link
Author

miaoyq commented Aug 29, 2018

@WeiZhang555 @bergwolf @jodh-intel
I've been busy with other things these days, sorry for delay, this pr is not working properly yet.
I will update this pr based on #624

@linzichang
Copy link
Contributor

linzichang commented Aug 29, 2018

@miaoyq @jodh-intel @WeiZhang555 @bergwolf We should think about this carefully. We face 3 problems:

  1. Each memory online size on Linux must be memblock aligned. That also mean the hotplug memory size need to be multiple of memblock size. We should read /sys/devices/system/memoryblock_size_bytes on GuestOS to get this size.
  2. “Memory Hot-add” has two phase. First phase know as "Physical Hotplug", Qemu virtual hotplug or physical hotplug a memory device, cause an ACPI notify IRQ to linux kernel. Kernel handle IRQ and setup memory section, allocate page struct, etc. Second phase know as "Logical Hotplug", using the sysfs interface to online memory and kernel will let the Buddy memory system to manage those new memory. In the first phase, kernel need to allocate page struct. The size is (hos-add memory size/4KB) * sizeof(struct page). Nearly 40:1 or 50:1(I haven't calculate it very accurately). This maybe not a problem but it affect the choosing of default vm boot memory size and per-hot-add size. For example, a vm with 128MB, the maximum hot-ad size once maybe 4G.
  3. How to choose the default boot size, now is 2G. If pod don't need 2G, we may use hot-remove. But this face the memblock problem. And in addition, we need to care about ballooning and page migration which not always succeed.

@cedriccchen
Copy link
Contributor

Indeed it's a better solution to hot add memory aligned to guestos memory block size. As now we have the fact:

  1. Hot add memory should be aligned to memory section size, but the size of a memory section is architecture dependent. For example, power uses 16MB, ia64 uses 1GB, x86_64 uses 128MB, ppc64le uses 256MB.
  2. Each memory block is described under /sys/devices/system/memory as /sys/devices/system/memory/memoryXXX (XXX is the memory block id.), which is the unit of memory online/offline. In sparse memory model, memory block size is a multiple of memory section size, reference. We can online a memory block only if we hotplug a memory-block-sized dimm, see pages_correctly_reserved in reference.

We can get the size of guestos memory block by reading /sys/devices/system/memoryblock_size_bytes in vm.

@devimc
Copy link

devimc commented Aug 29, 2018

My two cents,

there is still some points that haven't been discussed:

  • How many mem slots should have a VM?
  • Should number of mem slots be configurable?
  • Same as cpu sockets each mem slot consumes memory, how many MBs?

Since memory hot remove is almost impossible (or at least not reliable), we should track how much memory was assigned/required to/by a container, for example, if a container has 2GB but it's updated or removed (PODs scenario) its memory still can be used by other new or existing containers

@miaoyq
Copy link
Author

miaoyq commented Aug 30, 2018

Hot add memory should be aligned to memory section size, but the size of a memory section is architecture dependent. For example, power uses 16MB, ia64 uses 1GB, x86_64 uses 128MB, ppc64le uses 256MB.

Maybe we can make the memory section size configurable, for different architecture kata will use the different qemu binary. For exemple: x86_64 to qemu-system-x86_64(or others related), ppc64le to qemu-system-ppc64le.

@linzichang
Copy link
Contributor

@miaoyq

  1. Physical hot-add need memory section aligned. Logical hot-add(online) need memory block aligned. Memory block size is a multiple of memory section size. So we need “memory block” aligned.
  2. memory section size and memory block size is already configurable. It's in kernel configure/code. Qemu don't know what is memory block/section. x86 kernel code, when total memory is less than 64G, memory block is equal to memory section. Otherwise memory block is 2G.
static unsigned long probe_memory_block_size(void)
{
	/* start from 2g */
	unsigned long bz = 1UL<<31;

	if (totalram_pages >= (64ULL << (30 - PAGE_SHIFT))) {
		pr_info("Using 2GB memory block size for large-memory system\n");
		return 2UL * 1024 * 1024 * 1024;
	}

	/* less than 64g installed */
	if ((max_pfn << PAGE_SHIFT) < (16UL << 32))
		return MIN_MEMORY_BLOCK_SIZE;

	/* get the tail size */
	while (bz > MIN_MEMORY_BLOCK_SIZE) {
		if (!((max_pfn << PAGE_SHIFT) & (bz - 1)))
			break;
		bz >>= 1;
	}

	printk(KERN_DEBUG "memory block size : %ldMB\n", bz >> 20);

	return bz;
}

Two way to clear “memblock size”:

  1. Read /sys/devices/system/memoryblock_size_bytes on GuestOS to get the size though agent.
  2. Modify kernel configure and code to make the memory block size to what we want.

@cedriccchen
Copy link
Contributor

@miaoyq
For example, power uses 16MB, ia64 uses 1GB, x86_64 uses 128MB, ppc64le uses 256MB.
This is just always not absolute. Memory section size may be different according to different kernel config. So it's difficult to make it configurable in host.

@miaoyq
Copy link
Author

miaoyq commented Aug 30, 2018

@linzichang @clarecch
Thanks for your explanation.

1 .Read /sys/devices/system/memoryblock_size_bytes on GuestOS to get the size though agent.
2. Modify kernel configure and code to make the memory block size to what we want.

If so, the first option is a little better I think, we can add a interface in agent, via which runtime can ask the block_size for agent.

@cedriccchen
Copy link
Contributor

@miaoyq @devimc @linzichang Maybe we can get memory block size from agent when kata-runtime create or kata-runtime start, then store it in file. We don't need to ask the memory block_size for agent everytime hotplug.

@miaoyq
Copy link
Author

miaoyq commented Aug 31, 2018

@clarecch Agree.

@cedriccchen
Copy link
Contributor

there is still some points that haven't been discussed:

How many mem slots should have a VM?
Should number of mem slots be configurable?
Same as cpu sockets each mem slot consumes memory, how many MBs?

@grahamwhaley Do you have any idea about memory slot, which I'm not quite familiar with? I have read clearcontainers/runtime#380. Is there any result of memory slot discussion?

@grahamwhaley
Copy link
Contributor

grahamwhaley commented Aug 31, 2018

How many mem slots should have a VM?
Should number of mem slots be configurable?
Same as cpu sockets each mem slot consumes memory, how many MBs?
@grahamwhaley Do you have any idea about memory slot, which I'm not quite familiar with? I have read clearcontainers/runtime#380. Is there any result of memory slot discussion?

Hi @clarecch - heh, good digging, finding that 1yr old thread! I think any measures we did before on number-of-slot overheads (~1.5yr ago now) will be out of date - we should/would have to re-run them if we need that data.

But, I think this will always be difficult - I very much doubt we can find a default that will suit all situations and users. I think we need to make this configurable (in the runtime toml config file), and set some sensible defaults. The only sensible default I can think of might be:

  • split MAXMEM into a fixed number of slots (say 16 or 32 - with some math/rounding to make sure we end up on nice boundaries etc.). This has the flexibility that it will work on all systems probably (big and small memory systems).

In the config file I think we should probably offer two config options to give the user flexibility:

  • the ability to set the number of slots (so MAXMEM/n slots)
  • the ability to set the slot size (so nslots = MAXME/slotsize)

@cedriccchen
Copy link
Contributor

@grahamwhaley very enlightening!

@grahamwhaley
Copy link
Contributor

Maybe I'm thinking about this wrong as well.... I don't think slots have a fixed size do they? (that is, you can plug different sized dimms into each slot I think in QEMU?).
In that case, maybe the number of slots does not really relate to the MAXSIZE of memory, but does relate to how many times we expect to hotplug add memory to a pod.
Do we only expect to hotplug one dimm for each extra container added to a pod? In that case, the number of slots relates directly to the maximum number of containers we ever expect to add to a pod. That may change our view here. 8 or 16 slots still feels sensible to me - but, does anybody have evidence or numbers around how many containers make up a pod?

@cedriccchen
Copy link
Contributor

I think in some case we will hotplug multiple dimms for every container in a pod, which contains multiple containers.

@linzichang
Copy link
Contributor

linzichang commented Sep 4, 2018

@devimc I change defaultMemSlots from 2 to 255. The PSS of qemu process will increase 11MB.

@grahamwhaley
Copy link
Contributor

Thanks @linzichang, thanks for grabbing that data. 11Mb over how much - what is to total PSS (so we can see what % increase there is).
Also, was that with KSM enabled or not on the host?, and how many containers did you run?

Our PSS footprint can be between maybe 48Mb (20 containers, KSM) to 135Mb (20 containers, no KSM) for instance on my local system - in which case 11Mb could represent anywhere from a 23% to 8% footprint increase. Both of those I would call significant enough to need a discussion :-)

You could maybe use the reportgen from the tests repo to generate us a report.
Over at: https://github.com/kata-containers/tests/tree/master/metrics/report
If you run the grabdata.sh with and without your changes installed and then makereport.sh, the resulting pdf report may help us assess.

grabdata can take some time, and if you are only looking at the density/footprint then the version on the pending PR kata-containers/tests#650 allows you to tell grabdata to only run a subset of tests (like, just density). I think it would also be good in this case to see if there is any significant change in boot time as well though.

virtLog.Debugf("hot adding %d B memory", mem)
sizeMB := int(mem / 1024 / 1024)
// sizeMB needs to be divisible by 2
sizeMB = sizeMB + sizeMB%2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to move the log call to here and log sizeMB (and maybe mem too):

virtLog.WithField("memory-mb", sizeMB).Info("hot adding memory")

Same comment for the log call in removeResources().

@@ -605,6 +605,11 @@ func ContainerConfig(ocispec CompatOCISpec, bundlePath, cid, console string, det
resources.VCPUs = uint32(utils.ConstraintsToVCPUs(*ocispec.Linux.Resources.CPU.Quota, *ocispec.Linux.Resources.CPU.Period))
}
}
if ocispec.Linux.Resources.Memory != nil {
if ocispec.Linux.Resources.Memory.Limit != nil {
resources.Mem = uint32(*ocispec.Linux.Resources.Memory.Limit)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is casting an int64 into an int32:

Maybe ContainerResources.Mem should be changed to int64 too?

@devimc
Copy link

devimc commented Sep 10, 2018

@linzichang

@devimc I change defaultMemSlots from 2 to 255. The PSS of qemu process will increase 11MB.

thanks, have you measure how much memory consume the guest kernel? I wouldn't like to see issue #295 again but now with memory slots

@linzichang
Copy link
Contributor

@devimc You have @ the wrong guy twice, :)

@devimc
Copy link

devimc commented Sep 10, 2018

@linzichang 😄 I'm sorry

Copy link

@sboeuf sboeuf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@miaoyq please rework this PR according to all the comments you've got. Thanks!

sizeMB := int(mem / 1024 / 1024)
// sizeMB needs to be divisible by 2
sizeMB = sizeMB + sizeMB%2
_, err := c.sandbox.hypervisor.hotplugRemoveDevice(&memoryDevice{1, sizeMB}, memoryDev)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't support hot memory unplug. This code should be removed from this PR as it will make the code failing.

@linzichang
Copy link
Contributor

@sboeuf @devimc If @miaoyq has no time to work on this, I and @clarecch can help to rework this.

@miaoyq
Copy link
Author

miaoyq commented Sep 11, 2018

@sboeuf @linzichang @clarecch I'm working on this.
If I have any doubts, I will discuss with you. Thanks. :-)

@linzichang
Copy link
Contributor

@miaoyq I need this feature in short time. We have already talk about the solution in #624 (comment) . I will implement some both needed method in update memory to help speedup this PR. Let'us get this done quicker together. Thank you to rework this very much :)

@linzichang
Copy link
Contributor

@miaoyq I think you should first take care the memory footprint we mentions in #580 (comment) #580 (comment) #580 (comment). Because I haven't tested it clear yet.

@miaoyq
Copy link
Author

miaoyq commented Sep 11, 2018

@linzichang I think you're a little bit more familiar with this, so in order not to affect your usage, you can rework this freely. :-)

@sboeuf sboeuf added enhancement Improvement to an existing feature wip labels Sep 12, 2018
@egernst egernst mentioned this pull request Sep 18, 2018
4 tasks
@raravena80
Copy link
Member

Hi, @linzichang @miaoyq any updates on this? thx!

@miaoyq
Copy link
Author

miaoyq commented Sep 25, 2018

@raravena80
I think @linzichang and @clarecch have finished this locally.
@linzichang Could you update this feature in a new PR?

@linzichang
Copy link
Contributor

@miaoyq @raravena80 I will open a new PR in the last few days.

@miaoyq
Copy link
Author

miaoyq commented Sep 25, 2018

@linzichang Thanks. :-)

@linzichang
Copy link
Contributor

Rework PR #786

@miaoyq
Copy link
Author

miaoyq commented Sep 26, 2018

Rework PR #786

@linzichang I will close this PR, thanks!

@miaoyq miaoyq closed this Sep 26, 2018
@miaoyq miaoyq deleted the k8s-mem-resoure-hotplug branch April 23, 2019 02:13
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement Improvement to an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet