support PMEM inside Kata Containers when running under Kubernetes #2262

pohly · 2019-11-25T13:44:29Z

When using PMEM-CSI to manage PMEM storage, individual apps are going to have volumes created for them by Kubernetes with that storage driver and then want to use them like RAM, i.e. mmap a file or the entire volume. Depending on the application, MAP_SYNC and thus additional persistency guarantees will be needed.

Filesystem volumes currently get passed into kata-containers with 9p or virtio-fs. Neither of them supports MAP_SYNC. While virtio-fs supports mmap, performance is likely to be lower than native access because not all pages can be mapped at once (see below).

Block volumes are passed in as SCSI disk devices which is even worse (no mmap).

Describe the solution you'd like

A way to pass in the volume such that all of it can me mapped into an application's address space with MAP_SYNC semantic. Once that mapping exists, applications should be able to read and write bytes with native performance (= as if they weren't running under Kata Containers).

At this point, the most promising approach for achieving this seems to be to detect such special volumes and map them to QEMU objects and nvdimm device (https://github.com/qemu/qemu/blob/master/docs/nvdimm.txt). How to activate this special behavior is to be decided.

Describe alternatives you've considered

virtio-fs was considered, but doesn't meet all objectives because:

It does not map all pages at once. Instead, it maintains a cache of mapped pages which is considerably smaller ("a few GB") than the available PMEM. This should lead to lower performance and unpredictable latency spikes.
Because a page might not be currently mapped when written to, it does not meet MAP_SYNC requirements.

Before raising this feature request

This was discussed on freenode IRC, #kata-dev, on 2019-11-25:

(11:42:07 AM) pohly: stefanha: hello. I am trying to understand how (and how well) virtio-fs supports mmap. Background: I work on PMEM-CSI, a driver which enables the use of PMEM in Kubernetes. Ultimately the goal is that an application can do mmap(MAP_SYNC) and then do byte read/writes directly to the the underlying hardware. That works without kata-containers involved. I now looked at kata-containers 1.9.1 with the kata-qemu-virtiofs. I can see that this passes the dax-capable filesystem (XFS, in case that this matters) into the qemu instance with virtiofs. A test program can do mmap(MAP_SYNC) on a file.
(11:43:13 AM) pohly: But... it can also do that with 9p as file system and with the container root filesystem served by virtio-fs although that filesystem on the host does not support dax (hosted by plain SSD).
(11:45:00 AM) pohly: I was under the (perhaps mistaken) impression that virtio-fs would somehow support mmap. I though I had read that somewhere. Is that really true?
(11:46:42 AM) pohly: I checked the /proc//maps for the /opt/kata/bin/qemu-virtiofs-system-x86_64 process that runs the pod. It doesn't have any entry for the file that currently is mapped inside the container.
(11:51:52 AM) brtknr: pohly: following this discussion
(11:57:44 AM) davidgiluk: pohly: is the mount mounted with DAX?
(11:58:25 AM) pohly: Yes: kataShared on /data type virtio_fs (rw,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other,dax)
(11:58:35 AM) pohly: That is inside qemu.
(11:59:16 AM) pohly: And also outside of it: /dev/mapper/ndbus0region0fsdax-e7660acd0fd86e6aea32589af51903654f6a4e41 on /var/lib/kubelet/pods/6576fed5-5488-4ee4-a6a2-578c5519ae9c/volumes/kubernetes.io~csi/my-csi-volume/mount type xfs (rw,relatime,attr2,dax,inode64,noquota)
(12:00:53 PM) stefanha: pohly: virtio-fs isn't intended for pmem. QEMU won't use MAP_SYNC.
(12:01:10 PM) stefanha: pohly: If you need MAP_SYNC semantics then QEMU's nvdimm device can do that.
(12:01:37 PM) stefanha: pohly: MAP_SYNC support could be added to virtio-fs but today it doesn't do that.
(12:01:48 PM) pohly: stefanha: if virtio-fs doesn't support MAP_SYNC, shouldn't it then reject the mmap call?
(12:03:02 PM) stefanha: pohly: Probably. Inside the guest the virtio-fs and FUSE code isn't doing anything that violates MAP_SYNC,
(12:03:19 PM) stefanha: but the problem is that the host side doesn't necessarily honor those semantics.
(12:03:48 PM) pohly: But plain mmap works?
(12:03:57 PM) stefanha: pohly: Yep, plain mmap is supported.
(12:04:13 PM) pohly: Should I then see a /proc//maps entry for the file? I don't have that.
(12:04:50 PM) pohly: Or am I checking the wrong process? I looked at qemu-virtiofs-system-x86_64, because that is where the code runs.
(12:05:39 PM) stefanha: pohly: There isn't necessarily a 1:1 mmap relationship between guest application mmaps and host qemu-virtiofs-system-x86_64 mmaps.
(12:05:54 PM) stefanha: pohly: What are you trying to confirm by looking at qemu-virtiofs-system-x86_64 mmaps?
(12:06:26 PM) pohly: Looking more closely I do see one entry that has at least the right size: 7f2f1bffe000-7f2f1bfff000 ---p 00000000 00:00 0
(12:06:40 PM) pohly: But it doesn't have a file name associated with it. Should it have that?
(12:07:06 PM) pohly: I am trying to verify that a file on the host has indeed been mapped into the address space of the process running inside qemu.
(12:07:55 PM) pohly: If that isn't the case, then how does mmap support work?
(12:07:56 PM) stefanha: pohly: The lack of filename could be due to file descriptor passing
(12:08:10 PM) stefanha: The file is opened by virtiofsd and passed to QEMU. Maybe that's why no name is reported.
(12:08:15 PM) stefanha: But that's just a guess.
(12:08:30 PM) pohly: That might be it. Let me remove the mapping inside qemu...
(12:08:31 PM) davidgiluk: the name normally does show up
(12:08:53 PM) davidgiluk: pohly: Have you accessed the mmap'd area, or just done the mmap?
(12:10:20 PM) pohly: Just the mmap. So it's waiting for a page fault before doing anything on the host side? I can add that.
(12:11:12 PM) stefanha: Yes, that sounds likely.
(12:11:52 PM) stefanha: pohly: But again, if your goal is to get pmem semantics then virtio-fs in its current state doesn't guarantee that.
(12:12:15 PM) davidgiluk: pohly: Yes, I think so - remember for virtiofs we only have a fixed sized cache window, so we can't guarantee to mmap the whole region
(12:12:19 PM) stefanha: pohly: QEMU has -device nvdimm and -device virtio-pmem-pci for that.
(12:13:36 PM) pohly: Using those for a mounted filesystem in kata-containers isn't going to be easy.
(12:14:22 PM) pohly: virtio-fs looked much more promising ;-}
(12:15:00 PM) davidgiluk: stefanha: What stops us passing the MAP_SYNC all the way through?
(12:17:06 PM) pohly: davidgiluk: even if you do, "fixed size cache window" sounds like another big roadblock. PMEM comes in higher capacity than DRAM, that's partly why it is appealing for some workloads.
(12:18:10 PM) pohly: MAP_SYNC isn't even needed for all workloads. In fact, most apps currently don't depend on it.
(12:18:33 PM) pohly: So virtio-fs may already be a good step forward and sufficient.
(12:19:09 PM) pohly: OTOH, if it needs to set up and tear down mappings on the host side often, then that may affect performance.
(12:20:07 PM) pohly: memcached uses PMEM as DRAM replacement and stores its data there. Predictable access times for that data probably is important.
(12:20:41 PM) davidgiluk: pohly: Right; if you've got a single PMEM device to pass through then as stefan says using the -device stuff is the right way; if you're trying to pass through files that on the host are mountedon a filesystem that's backed by pmem, then virtiofs might be interesting
(12:22:06 PM) pohly: davidgiluk: we are trying the former. PMEM-CSI basically splits up a single PMEM device and hands out portions of it to individual apps. We cannot assume that only a single app uses that device; that would be rather limiting.
(12:22:26 PM) pohly: Ahem, I meant "we are trying the latter"...
(12:23:23 PM) davidgiluk: pohly: But does the PMEM-CSI portions look like individual block devices that you then put a filesystem on, and is that filesystem built in the host or the guest?
(12:26:10 PM) pohly: davidgiluk: it is a block device. But applications in Kubernetes typically will ask for a filesystem, so PMEM-CSI formats and mounts that device.
(12:26:26 PM) sameo left the room (quit: Ping timeout: 276 seconds).
(12:26:32 PM) pohly: And then Kubernetes passes the directory name of the mounted FS to the runtime.
(12:27:01 PM) pohly: I heard that kata-containers sometimes does tricks like then passing the device into qemu and mounting again inside.
(12:27:38 PM) pohly: That's a bit dirty, because there are two Linux kernels which both might write to the same block device.
(12:27:40 PM) davidgiluk: pohly: OK, if it's a device+filesystem just for that container then it does feel like passing that block device into the container is right rather than passing the filesystem through virtiofs
(12:28:27 PM) pohly: davidgiluk: yes, that would be the better alternative, except for the "is already mounted" part.
(12:29:49 PM) pohly: Also, does it have to be some actual device? Currently the block devices are either LVM logical volumes or PMEM namespaces (/dev/pmem).
(12:30:16 PM) pohly: We can't use PCI device pass-through - it's not even on the PCI bus.
(12:30:32 PM) pohly: Nor do we want to pass in the entire NVDIMM.
(12:39:19 PM) davidgiluk: pohly: Does one of the chunks of a PMEM-CSI look like a pmem device? i.e. would it make sense to pass that in using -device nvdimm or virtio-pmem-pci ?
(12:40:58 PM) pohly: davidgiluk: I need to check what those options expect, but for the LVM case the answer is probably "no" - it's just a logical volume.
(12:43:14 PM) pohly: Hmm, according to https://github.com/qemu/qemu/blob/master/docs/nvdimm.txt the "mem-path" can be an ordinary file. So we could just point that at the block device.
(12:45:06 PM) pohly: But how would kata-containers even recognize that it needs to do this special handling? All it gets is a path to a mounted filesystem or a loop device (block mode, which also works in Kubernetes).
(12:46:03 PM) pohly: This sounds doable, but I fear that it will be rather hacky and I have no idea how many different components need to be adapted to make this work.
(12:47:41 PM) pohly: May I copy this discussion into an issue in https://github.com/kata-containers/runtime/issues? Is that the right tracker for "add PMEM support to kata-containers"?
(12:51:49 PM) davidgiluk: pohly: Yeh probably best to make an issue; I'm also not sure the best way to wire it through - but if it looks like a block device, and that block device is intended just for this container, then treat it as a block device and let the guest handle it
(12:52:58 PM) gwhaley: pohly: include 'devimc' on that Issue, if not already - he'll have a good idea I think of what knitting would be required.
(12:53:27 PM) gwhaley: yes, the hard bit is how to annotate that volume/mount/device to ensure it ends up mapped via the correct route. It may be that 'annotations' are the route.
(12:53:40 PM) gwhaley: oh, amshinde might have good input as well
(12:55:03 PM) gwhaley: so, historically we've always noted that nvdimm/dax could be used to pass items in (kata uses it for iirc the kernel image, or is it the rootfs....) - but, I don't believe there is a defined mechanism to set that all up via the orchestrators and runtime, and I don't think I've ever seen anybody actually using an nvdimm/dax mount/map for themselves ... yet....
(12:59:52 PM) pohly: gwhaley: /opt/kata/share/kata-containers/kata-containers-image_clearlinux_1.9.1_agent_d4bbd8007f.img is passed via "-object" + "-device nvdimm".
(01:00:47 PM) pohly: Looks like the rootfs. There's also "root=/dev/pmem0p1".
(01:00:50 PM) gwhaley: pohly: right, the rootfs for the VM (I can never remember if it is the rootfs or the kernel we do it with ;-) )... so, we use it, we know it works.... now it would be how do we enable 'users' to do it...
(01:03:11 PM) pohly: davidgiluk: to get closure on this: when actually writing into the memory mapped region via virtio-fs, I do see map entries on the host side, including the file name.
(01:03:41 PM) pohly: davidgiluk: how large is this "fixed size cache window"?
(01:04:51 PM) davidgiluk: pohly: It's configurable via an option, normally a few GB
(01:05:08 PM) gwhaley: https://github.com/kata-containers/runtime/blob/master/cli/config/configuration-qemu-virtiofs.toml.in#L118-L131 :-)
(01:05:24 PM) ***davidgiluk disappears for a 2 hours
(01:05:31 PM) ***gwhaley goes for lunch...
(01:06:43 PM) pohly: So a lot less than the hundreds of GB that people may have as PMEM. MIght be worth testing how that affects performance. Thanks!

CC @devimc @GabyCT

The text was updated successfully, but these errors were encountered:

devimc · 2019-11-25T16:31:34Z

@pohly thanks for raising this, I'll take a look later

pohly · 2019-12-10T09:29:24Z

Here are the options that we investigated in combination with the nvdimm device support in QEMU which all don't work and why:

create a "fsdax" namespace, use /dev/pmem0.1 as backing store for a nvdimm device: QEMU cannot mmap /dev/pmem0.1, will silently go through host page cache and inside VM we don't have dax semantic although it looks to the guest kernel as if it has
create LVM LV on "fsdax" namespace: same problem
create a "devdax" namespace (= /dev/dax0.0): can be used as backing store with proper dax semantic in QEMU, but cannot be mounted on the host, which is a requirement because the volume may have to move between apps running under Kata Containers and those which don't
create a file on dax-capable filesystem, format it as a filesystem via /dev/loop, use it as backing store for QEMU: appears as a "raw" namespace (= /dev/pmem0) inside QEMU, can only be mounted without -odax there; /dev/loop on host cannot be mounted with -odax (a limitation of the Linux kernel)
as before, but inside QEMU convert from "raw" to "fsdax" (ndctl create-namespace -f -e namespace0.0 --mode=memory inside QEMU, from https://nvdimm.wiki.kernel.org/): this has some space overhead, changes the size of /dev/pmem0 and an existing filesystem would get destroyed
a file on a dax-capable filesystem, partitioned similar to the Kata Container rootfs (MBR 1-Pager (Better together story) #1, DAX, MBR [RFC ]Initial gRPC protocol for agent communication #2, Rootfs), pass entire file into QEMU, loop-mount just the rootfs on host: this results in a /dev/pmem0p1 inside QEMU which supports -odax properly, but on the host we are still stuck because of the /dev/loop limitation

The last option is the one which could be made to work by enhancing the Linux kernel such that /dev/loop supports -odax when it binds a file which itself is on a dax-capable filesystem (or in general supports mmap(MAP_SYNC)). We checked with Dan Williams and he confirmed that this could be implemented. He also said that it would be generally useful. It's now in the backlog of his team, with an unknown ETA.

@devimc: do you have links to documentation and/or code for the Kata Container rootfs?

grahamwhaley · 2019-12-10T10:28:41Z

@pohly - the best resource for kata rootfs info is going to be over in the osbuilder repo at https://github.com/kata-containers/osbuilder/tree/master/rootfs-builder

devimc · 2019-12-10T14:08:54Z

@pohly https://github.com/kata-containers/osbuilder/blob/master/image-builder/image_builder.sh#L77-L116 and https://github.com/kata-containers/osbuilder/blob/master/image-builder/nsdax.gpl.c

pohly · 2019-12-10T16:32:45Z

The underlying spec is https://pmem.io/documents/NVDIMM_Namespace_Spec.pdf

pohly · 2019-12-18T09:45:52Z

@devimc: I have image creation working in PMEM-CSI such that it works under QEMU. But I am still tweaking the code and then need to hook it into volume creation through Kubernetes. At that point it would be great if you could also enhance Kata Containers to pass such special volumes into QEMU with an nvdimm device. I'll ping you when PMEM-CSI is ready for that.

In the meantime, one more question: why is the rootfs at 3MiB (https://github.com/kata-containers/osbuilder/blob/dbbf16082da3de37d89af0783e023269210b2c91/image-builder/image_builder.sh#L98-L102)?

https://nvdimm.wiki.kernel.org/2mib_fs_dax states that partitions must be aligned at multiples of 2MiB for huge pages to work. /dev/pmem0 inside QEMU starts at MBR#2 (right?), and then /dev/pmem0p1 starts a the 1MiB offset relative to that, i.e. it is not aligned properly.

I haven't found the specification for this "MBR#1 + DAX + MBR#2" content. The NVDIMM_Namespace_Spec.pdf file doesn't cover this, does it? Do you know where this content of a NVDIMM is specified? Or did I miss it in that spec? I only briefly skimmed it.

The resulting file can be used as backing store for a QEMU nvdimm device. This is based on the approach that is used for the Kata Container rootfs (https://github.com/kata-containers/osbuilder/blob/dbbf16082da3de37d89af0783e023269210b2c91/image-builder/image_builder.sh) and reuses some of the same code, but also differs from that in some regards: - The start of the partition is aligned a multiple of the 2MiB huge page size (kata-containers/runtime#2262 (comment)). - The size of the QEMU object is the same as the nominal size of the file. In Kata Containers the size is a fixed 128MiB (kata-containers/osbuilder#391 (comment)).

devimc · 2019-12-18T16:37:07Z

@pohly

At that point it would be great if you could also enhance Kata Containers to pass such special volumes into QEMU with an nvdimm device. I'll ping you when PMEM-CSI is ready for that.

sure thing, just let me know when it's ready to use

why is the rootfs at 3MiB (https://github.com/kata-containers/osbuilder/blob/dbbf16082da3de37d89af0783e023269210b2c91/image-builder/image_builder.sh#L98-L102)?

there is not special reason for that, this is the next MB available for use
0 - 2 MB -> MBR#1 + DAX
2 - 3 MB -> MBR#2
but if you want you can shrink the first part, it will look like:
0 - 1 MB -> MBR#1 + DAX
1 - 2 MB -> MBR#2

/dev/pmem0 inside QEMU starts at MBR#2 (right?)

right

and then /dev/pmem0p1 starts a the 1MiB offset relative to that, i.e. it is not aligned properly

yes, it starts there and I think this fall back to 4K page faults (it would be nice to see if alignment to 2M can reduce the number of pages hence reduce memory footprint, what do you think?)

I haven't found the specification for this "MBR#1 + DAX + MBR#2" content. The NVDIMM_Namespace_Spec.pdf file doesn't cover this, does it?

no, it doesn't, and you won't find nothing related to this, since this is kata specific, this hack was used to support all hypervisors using the same image: Kernels and hypervisors that support DAX/NVDIMM read the MBR#2, otherwise MBR#1 is read.

pohly · 2019-12-18T18:13:22Z

but if you want you can shrink the first part, it will look like:
0 - 1 MB -> MBR#1 + DAX
1 - 2 MB -> MBR#2

How does the kernel find the DAX meta information? I was under the impression that it has to be at the fixed offset. Or does it scan all the initial sectors? I really should read that spec carefully... 😁

yes, it starts there and I think this fall back to 4K page faults (it would be nice to see if alignment to 2M can reduce the number of pages hence reduce memory footprint, what do you think?)

It might not matter for the Kata Container rootfs because it's only going to be used for reading and writing files and less for mapping pages. But I am not sure. For PMEM-CSI I'm going to use proper alignment and we'll have to write a test that verifies that huge pages work.

this hack was used to support all hypervisors using the same image: Kernels and hypervisors that support DAX/NVDIMM read the MBR#2, otherwise MBR#1 is read.

So the non-hacky solution would be to drop MBR#1, right? I currently have it in the PMEM-CSI code, but it could also be removed if it turns out to be unnecessary. OTOH, I think it is that the MBR that makes file <imagefile> return useful information, so it might be worthwhile to keep just for that.

devimc · 2019-12-18T19:41:08Z

How does the kernel find the DAX meta information? I was under the impression that it has to be at the fixed offset. Or does it scan all the initial sectors? I really should read that spec carefully... grin

afaik, this is not documented in any spec, even worse there is no tool to set it, the short answers is: the NVDIMM driver looks for the NVDIMM signature [1] at 4Kb, take a look [2]

It might not matter for the Kata Container rootfs because it's only going to be used for reading and writing files and less for mapping pages. But I am not sure. For PMEM-CSI I'm going to use proper alignment and we'll have to write a test that verifies that huge pages work.

heads up, you should check different sizes, not just 128M

So the non-hacky solution would be to drop MBR#1, right?

right

I currently have it in the PMEM-CSI code, but it could also be removed if it turns out to be unnecessary. OTOH, I think it is that the MBR that makes file return useful information, so it might be worthwhile to keep just for that.

yes I recommend you to keep it, otherwise you should specify an offset (losetup -o) to mount it in the host

[1] - https://github.com/kata-containers/osbuilder/blob/master/image-builder/nsdax.gpl.c#L32
[2] - https://github.com/torvalds/linux/blob/2187f215ebaac73ddbd814696d7c7fa34f0c3de0/drivers/nvdimm/pfn_devs.c#L438-L596

pohly · 2019-12-19T12:55:35Z

One more thought about MBRs: is MBR #2 really needed? The alternative is to put the filesystem into the space currently used by MBR#2+rootfs and in the VM mount /dev/pmem0. That should work (right?) and it would be simpler.

MBR #1 can be kept for the sake of convenience (losetup without -o, file command).

The resulting file can be used as backing store for a QEMU nvdimm device. This is based on the approach that is used for the Kata Container rootfs (https://github.com/kata-containers/osbuilder/blob/dbbf16082da3de37d89af0783e023269210b2c91/image-builder/image_builder.sh) and reuses some of the same code, but also differs from that in some regards: - The start of the partition is aligned a multiple of the 2MiB huge page size (kata-containers/runtime#2262 (comment)). - The size of the QEMU object is the same as the nominal size of the file. In Kata Containers the size is a fixed 128MiB (kata-containers/osbuilder#391 (comment)).

devimc · 2019-12-19T14:47:18Z

One more thought about MBRs: is MBR #2 really needed? The alternative is to put the filesystem into the space currently used by MBR#2+rootfs and in the VM mount /dev/pmem0. That should work (right?) and it would be simpler.

yes, you're right, I included a MBR just in case we want to support swap (useful for DinD, I think)

MBR #1 can be kept for the sake of convenience (losetup without -o, file command).

👍

The resulting file can be used as backing store for a QEMU nvdimm device. This is based on the approach that is used for the Kata Container rootfs (https://github.com/kata-containers/osbuilder/blob/dbbf16082da3de37d89af0783e023269210b2c91/image-builder/image_builder.sh) and reuses some of the same code, but also differs from that in some regards: - The start of the partition is aligned a multiple of the 2MiB huge page size (kata-containers/runtime#2262 (comment)). - The size of the QEMU object is the same as the nominal size of the file. In Kata Containers the size is a fixed 128MiB (kata-containers/osbuilder#391 (comment)).

pohly · 2020-01-14T13:13:10Z

intel/pmem-csi#500 and its branch https://github.com/pohly/pmem-CSI/commits/kata-containers contain a functional PoC where a new kataContainers: "true" parameter in a storage class or inline volume spec will result in the following setup:

volume created normally (/dev/mapper/ndbus0region0-1cf34b9421df04b14f1924766c61509182b05770 for LVM mode)
that volume mounted
an image file created inside that volume with the name kata-containers-pmem-csi-vm.img
a loop device bound to that file with the right offset for the actual filesystem
that loop device mounted and returned to Kubernetes (and thus the container runtime) as the volume for the app

Example on a QEMU host:

$ _work/pmem-govm/ssh.3 lsblk --bytes
NAME                                                     MAJ:MIN RM         SIZE RO TYPE MOUNTPOINT
loop0                                                      7:0    0   4123000832  0 loop /var/lib/kubelet/pods/5c29c8ae-00be-4198-882f-8e77942fe79d/volumes/kubernetes.io~csi/pvc-5821622a-2a38-421b-80be-359137463927/mount
vda                                                      252:0    0 429496729600  0 disk 
└─vda1                                                   252:1    0 429495664128  0 part /
vdb                                                      252:16   0       380928  0 disk 
pmem0                                                    259:0    0  32212254720  0 disk 
└─ndbus0region0-1cf34b9421df04b14f1924766c61509182b05770 253:0    0   4294967296  0 lvm  /var/lib/kubelet/pods/5c29c8ae-00be-4198-882f-8e77942fe79d/volumes/kubernetes.io~csi/pvc-5821622a-2a38-421b-80be-359137463927/mount/kata-containers-host-volume

$ _work/pmem-govm/ssh.3 losetup
NAME       SIZELIMIT  OFFSET AUTOCLEAR RO BACK-FILE                                                                                                                                                                                       DIO LOG-SEC
/dev/loop0         0 2097152         0  0 /var/lib/kubelet/pods/5c29c8ae-00be-4198-882f-8e77942fe79d/volumes/kubernetes.io~csi/pvc-5821622a-2a38-421b-80be-359137463927/mount/kata-containers-host-volume/kata-containers-pmem-csi-vm.img   1     512

What Kata Containers needs to do is:

check if a mounted volume is backed by a loop device
check if that loop device is attached to a file called kata-containers-host-volume/kata-containers-pmem-csi-vm.img
if yes:
- unmount the volume
- set up a nvdimm device for that file with no namespace labels and with share=on,pmem=on (-backend-file,id=mem1,share=on,pmem=on,mem-path=/var/lib/kubelet/pods/5c29c8ae-00be-4198-882f-8e77942fe79d/volumes/kubernetes.io~csi/pvc-5821622a-2a38-421b-80be-359137463927/mount/kata-containers-host-volume/kata-containers-pmem-csi-vm.img,size=xxxx -device nvdimm,id=nvdimm1,memdev=mem1)
- inside the VM, mount /dev/pmem0 with -odax

QEMU must have been built with --enable-libpmem.

The size=xxxx is the total length of the file.

Unmounting is necessary a) to avoid accessing the blocks through two different filesystems at the same time (inside QEMU and outside) and b) because the kata-containers-pmem-csi-vm.img is underneath the mounted volume and not accessible while the volume is mounted.

The latter is done because it was a convenient place. If for some reasons unmounting has drawbacks (do we perhaps need to keep it for idempotency?), then I could try to find a different mount point.

We have to mount because Kubernetes expects it. We also cannot pass back any hints for the container runtime; all we can do is pick some unique name such that the checks above are unlikely to match a scenario whether Kata Containers should not do the special passthrough.

It should be possible reproduce the setup above as follows:

check out my kata-containers branch
set up a local Docker registry (https://docs.docker.com/registry/deploying/)
make push-test-image
TEST_DISTRO=fedora make start
set KUBECONFIG
kubectl create -f deploy/common/pmem-storageclass-kata.yaml
kubectl create -f deploy/common/pmem-kata-pvc.yaml
kubectl create -f deploy/common/pmem-kata-app.yaml

To test with Kata Containers, install it in the cluster and edit pmem-kata-app.yaml to have the necessary runtime class.

devimc · 2020-01-14T19:50:34Z

thanks @pohly I have some questions

an image file created inside that volume with the name kata-containers-pmem-csi-vm.img

does this image contain a DAX metadata at 4k offset? one partition, right?

an image file created inside that volume with the name kata-containers-pmem-csi-vm.img

a loop device bound to that file with the right offset for the actual filesystem

that loop device mounted and returned to Kubernetes (and thus the container runtime) as the volume for the app

help me to understand this part,

volume ndbus0region0-1cf34b9421df04b14f1924766c61509182b05770 is created
and monted here /var/lib/kubelet/pods/5c29c8ae-00be-4198-882f-8e77942fe79d/volumes/kubernetes.io~csi/pvc-5821622a-2a38-421b-80be-359137463927/mount/kata-containers-host-volume ?
kata-containers-pmem-csi-vm.img is part of ndbus0region0-1cf34b9421df04b14f1924766c61509182b05770 and a loop device is created for this img file (loop), right?

this is the part that I really do not understand
3. loop0 is mounted at /var/lib/kubelet/pods/5c29c8ae-00be-4198-882f-8e77942fe79d/volumes/kubernetes.io~csi/pvc-5821622a-2a38-421b-80be-359137463927/mount that is the parent directory where ndbus0region0-1cf34b9421df04b14f1924766c61509182b05770 is mounted , so ndbus0region0-1cf34b9421df04b14f1924766c61509182b05770 is mounted in the loop device whose backed file (kata-containers-pmem-csi-vm.img) is part of ndbus0region0-1cf34b9421df04b14f1924766c61509182b05770, right? what kind of black magic is this? which came first the chicken or the egg?

pohly · 2020-01-14T19:59:14Z

does this image contain a DAX metadata at 4k offset? one partition, right?

Yes, and yes: https://github.com/intel/pmem-csi/blob/3b83db4eee92eb12594d0a42d515306fcf871ecc/pkg/imagefile/imagefile.go#L24-L38

This probably will look familiar 😁 I just removed the MBR #2.

kata-containers-pmem-csi-vm.img is part of ndbus0region0-1cf34b9421df04b14f1924766c61509182b05770 and a loop device is created for this img file (loop), right?

Correct.

so ndbus0region0-1cf34b9421df04b14f1924766c61509182b05770 is mounted in the loop device whose backed file (kata-containers-pmem-csi-vm.img) is part of ndbus0region0-1cf34b9421df04b14f1924766c61509182b05770, right? what kind of black magic is this? which came first the chicken or the egg?

First came the kata-containers-pmem-csi-vm.img, then the loop device (which keeps that file open), then the mount at /var/lib/kubelet/pods/5c29c8ae-00be-4198-882f-8e77942fe79d/volumes/kubernetes.io~csi/pvc-5821622a-2a38-421b-80be-359137463927/mount. It doesn't matter that this mount point is a non-empty directory, one can mount on top of it anyway. The result is that the content inside it still exists, but it can't be seen anymore because ...37463927/mount is now the newly mounted loop device (= kata-containers-pmem-csi-vm.img).

devimc · 2020-01-14T20:11:42Z

First came the kata-containers-pmem-csi-vm.img, then the loop device (which keeps that file open), then the mount at /var/lib/kubelet/pods/5c29c8ae-00be-4198-882f-8e77942fe79d/volumes/kubernetes.io~csi/pvc-5821622a-2a38-421b-80be-359137463927/mount. It doesn't matter that this mount point is a non-empty directory, one can mount on top of it anyway. The result is that the content inside it still exists, but it can't be seen anymore because ...37463927/mount is now the newly mounted loop device (= kata-containers-pmem-csi-vm.img).

are you aware that changes in both host and guest filesystems won't be reflected? i.e kubectl cp.. may not work

check if that loop device is attached to a file called kata-containers-host-volume/kata-containers-pmem-csi-vm.img

I don't like this part, since we are forcing to use a specific file name, how about using DAX metadata (does the loop device contain a DAX metadata at 4k offset?) to determinate if the volume should be unmounted and the img file used as backend for an nvdimm device?

pohly · 2020-01-14T20:43:02Z

are you aware that changes in both host and guest filesystems won't be reflected?

What do you mean? Kubernetes only ever gets to see the content of the image file, never of the volume that contains the image file.

I don't like this part, since we are forcing to use a specific file name, how about using DAX metadata (does the loop device contain a DAX metadata at 4k offset?) to determinate if the volume should be unmounted and the img file used as backend for an nvdimm device?

It's harder to implement for you (you will have to implement DAX metadata parsing instead of doing a string compare). I'm undecided whether that gives us a better indicator for "treat this in a special way" than picking a well-known filename, but I don't have any strong objections either.

However, currently this doesn't work: the loop device is attached with a 2MiB offset (i.e. covers just the filesystem) and thus you cannot read the DAX metadata through the loop device. You would have to unmount first, but you don't know yet whether you need to unmount.

Let me check tomorrow whether I can move this internal mount point somewhere else where it isn't shadowed by the final mount.

devimc · 2020-01-14T21:01:49Z

What do you mean? Kubernetes only ever gets to see the content of the image file, never of the volume that contains the image file.

same as devicemapper, copy to and from these volumes won't work because of host and guest don't share a directory, they share a device and changes are not reflected, for example create a file in the guest and this new file should be visible in the host where the volume is mounted. have you tried this?

It's harder to implement for you (you will have to implement DAX metadata parsing instead of doing a string compare)

actually, we don't need a full parser, look for the pfn signature[1] should be enough, what do you think?

However, currently this doesn't work: the loop device is attached with a 2MiB offset (i.e. covers just the filesystem) and thus you cannot read the DAX metadata through the loop device. You would have to unmount first, but you don't know yet whether you need to unmount.

do I need to unmount it to get the backend file (kata-containers-pmem-csi-vm.img)?

Let me check tomorrow whether I can move this internal mount point somewhere else where it isn't shadowed by the final mount.

if we can access to the backend file (img file) this won't be needed

[1] - https://github.com/kata-containers/osbuilder/blob/master/image-builder/nsdax.gpl.c#L32

pohly · 2020-01-15T07:13:29Z

same as devicemapper, copy to and from these volumes won't work because of host and guest don't share a directory, they share a device and changes are not reflected, for example create a file in the guest and this new file should be visible in the host where the volume is mounted.

Only one pod gets access to the volume at any time, so this isn't a problem.

actually, we don't need a full parser, look for the pfn signature[1] should be enough, what do you think?

Yes, might be good enough.

do I need to unmount it to get the backend file (kata-containers-pmem-csi-vm.img)?

Yes.

devimc · 2020-01-15T16:54:05Z

Let me check tomorrow whether I can move this internal mount point somewhere else where it isn't shadowed by the final mount.

ok, let me know if you can move it

pohly · 2020-01-15T19:14:53Z

I pushed one additional commit which moves the image file into something like /var/lib/pmem-csi.intel.com.mount/csi-249d0639d5246da9070cb9da98b9d329ac6d7823f573fa43e2f7ad6dfdf3a72b/kata-containers-pmem-csi-vm.img.

devimc · 2020-01-15T19:17:26Z

I pushed one additional commit which moves the image file into something like /var/lib/pmem-csi.intel.com.mount/csi-249d0639d5246da9070cb9da98b9d329ac6d7823f573fa43e2f7ad6dfdf3a72b/kata-containers-pmem-csi-vm.img.

does this mean that umount loop device is no longer required to get the img file?

pohly · 2020-01-15T19:19:09Z

does this mean that umount loop device is no longer required to get the img file?

Correct.

It might be better to unmount anyway once it has been determined that the special treatment is necessary, just to be on the safe side regarding conflicting writes. Nothing should be using the mounted filesystem on the host side, but who knows what it might write by itself anyway...

pohly · 2020-02-26T15:53:10Z

Hmm, I get the full path:

$ TEST_DISTRO=fedora make start
...
$ export KUBECONFIG=/nvme/gopath/src/github.com/intel/pmem-csi/_work/fedora/kube.config
$ kubectl create -f deploy/common/pmem-app-ephemeral.yaml
$ kubectl get pods -o wide | grep my-csi-app
my-csi-app-inline-volume      1/1     Running   0          25m   10.244.2.2   pmem-csi-pmem-govm-worker1   <none>           <none>
$ _work/pmem-govm/ssh.1 losetup
NAME       SIZELIMIT  OFFSET AUTOCLEAR RO BACK-FILE                                                                                                                              DIO LOG-SEC
/dev/loop0         0 2097152         0  0 /var/lib/pmem-csi.intel.com.mount/csi-2089cfd58dc8909cc7ebd67d1d18ab35cc5c5230b9f0c8b7113e758965369552/kata-containers-pmem-csi-vm.img   1     512

However, this also isn't the path on the host:

$ _work/pmem-govm/ssh.1 ls -l /var/lib/pmem-csi.intel.com.mount/csi-2089cfd58dc8909cc7ebd67d1d18ab35cc5c5230b9f0c8b7113e758965369552/kata-containers-pmem-csi-vm.img
ls: cannot access '/var/lib/pmem-csi.intel.com.mount/csi-2089cfd58dc8909cc7ebd67d1d18ab35cc5c5230b9f0c8b7113e758965369552/kata-containers-pmem-csi-vm.img': No such file or directory

That /var/lib/pmem-csi.intel.com.mount only exists inside the PMEM-CSI driver container. That's a bug, it should also be visible on the host. I'll fix that.

What I can't reproduce is the missing path. Where do you run this losetup command? What's the Linux kernel and OS?

devimc · 2020-02-26T15:59:19Z

I'm using clearlinux + crio

$ TEST_DISTRO=clear TEST_DISTRO_VERSION=31760 CLEAR_IMG_VERSION=31760 TEST_CRI=crio make -e start
$ _work/pmem-govm/ssh.3 losetup
NAME       SIZELIMIT  OFFSET AUTOCLEAR RO BACK-FILE                        DIO LOG-SEC
/dev/loop0         0 2097152         0  0 /kata-containers-pmem-csi-vm.img   1     512

pohly · 2020-02-26T16:33:29Z

I've tried with the same OS, but still get the full path:

$ _work/pmem-govm/ssh.1 losetup
NAME       SIZELIMIT  OFFSET AUTOCLEAR RO BACK-FILE                                                                                                                              DIO LOG-SEC
/dev/loop0         0 2097152         0  0 /var/lib/pmem-csi.intel.com/mount/csi-42debbc8fbc8d22567525b2f190046d7760c9aca26a14cfca053d1edac559644/kata-containers-pmem-csi-vm.img   1     512
$ _work/pmem-govm/ssh.1 uname -a
Linux pmem-csi-pmem-govm-worker1 5.3.13-406.kvm #2 SMP Mon Nov 25 07:48:06 PST 2019 x86_64 GNU/Linux
$ _work/pmem-govm/ssh.1 cat /etc/os-release 
NAME="Clear Linux OS"
VERSION=1
ID=clear-linux-os
ID_LIKE=clear-linux-os
VERSION_ID=31760
PRETTY_NAME="Clear Linux OS"
ANSI_COLOR="1;35"
HOME_URL="https://clearlinux.org"
SUPPORT_URL="https://clearlinux.org"
BUG_REPORT_URL="mailto:dev@lists.clearlinux.org"
PRIVACY_POLICY_URL="http://www.intel.com/privacy"

Note that this is with the current tip of my local branch which changes the path so that it is inside /var/lib/pmem-csi.intel.com/mount. I still have a bit more work to do before the image file itself shows up there; I seem to be missing bi-directional mount propagation for the state directory.

pohly · 2020-02-26T16:34:20Z

FWIW, this was for kubectl create -f deploy/common/pmem-kata-app-ephemeral.yaml which is the fastest way to test volume and pod creation.

pohly · 2020-02-26T16:43:38Z

And now that also works, with commit ba3a5f4f:

$ _work/pmem-govm/ssh.3 losetup
NAME       SIZELIMIT  OFFSET AUTOCLEAR RO BACK-FILE                                                                                                                              DIO LOG-SEC
/dev/loop0         0 2097152         0  0 /var/lib/pmem-csi.intel.com/mount/csi-be3cb5042f61fbc404f12da3300f05bc13309852e841d70a2c2b45890421e630/kata-containers-pmem-csi-vm.img   1     512
$ _work/pmem-govm/ssh.3 ls -l /var/lib/pmem-csi.intel.com/mount/csi-be3cb5042f61fbc404f12da3300f05bc13309852e841d70a2c2b45890421e630/kata-containers-pmem-csi-vm.img
-rw-r--r-- 1 root root 2099249152 Feb 26 16:40 /var/lib/pmem-csi.intel.com/mount/csi-be3cb5042f61fbc404f12da3300f05bc13309852e841d70a2c2b45890421e630/kata-containers-pmem-csi-vm.img

pohly · 2020-02-26T16:54:02Z

@devimc was your pmem-csi container image perhaps a bit older? I don't remember whether it ever used an absolute path, but I can't think of another explanation right now. Anyway, please make push-test-image && make restart after updating to the ba3a5f4f commit and then you should be running the same image as I do.

devimc · 2020-02-26T17:02:24Z

@pohly thanks, let me try again

devimc · 2020-02-26T17:27:29Z

@pohly now I can see a fullpath, but it points to nothing ... ?

$ losetup
NAME SIZELIMIT  OFFSET AUTOCLEAR RO BACK-FILE                                                                                DIO LOG-SEC
/dev/loop0
             0 2097152         0  0 /var/lib/pmem-csi.intel.com/mount/814c070e84c5014a88c72f7164567d986c3db28a/kata-containers-pmem-csi-vm.img
                                                                                                                               1     512
$ sudo ls -l /var/lib/pmem-csi.intel.com/mount/814c070e84c5014a88c72f7164567d986c3db28a/kata-containers-pmem-csi-vm.img
ls: cannot access '/var/lib/pmem-csi.intel.com/mount/814c070e84c5014a88c72f7164567d986c3db28a/kata-containers-pmem-csi-vm.img': No such file or directory

devimc · 2020-02-26T17:29:20Z

@pohly I'm using tip ba3a5f4f kata support: fix exposing image file on host

pohly · 2020-02-26T21:43:33Z

@devimc: sorry, I forgot to mention that you need to re-deploy PMEM-CSI to get the bi-directional mount change.

You can do that by re-running TEST_DISTRO=clear ... make start and then waiting until at least the node pods have been restarted.

Enable libpmem to support PMEM when running under Kubernetes. see kata-containers/runtime#2262 According to QEMU's nvdimm documentation: When 'pmem' is 'on' and QEMU is built with libpmem support, QEMU will take necessary operations to guarantee the persistence of its own writes to the vNVDIMM backend. fixes kata-containers#958 Signed-off-by: Julio Montes <julio.montes@intel.com>

A persistent memory volume MUST meet the following conditions: * A loop device must be mounted in the directory passed as volume * The loop device must have a backing file * The backing file must have the PFN signature at offset 4k [1][2] The backing file is used as backend file for a NVDIMM device in the guest fixes kata-containers#2262 [1] - https://github.com/kata-containers/osbuilder/blob/master/image-builder /nsdax.gpl.c [2] - https://github.com/torvalds/linux/blob/master/drivers/nvdimm/pfn.h Signed-off-by: Julio Montes <julio.montes@intel.com>

bergwolf · 2020-03-11T09:22:50Z

@pohly Sorry for chiming in late, but I'm not sure that PMEM-CSI is the right option for your use case. Did you consider using device plugin to pass the pmem device directly to kata and then kata can plug it to the guest as nvdimm device?

I see that @devimc already put up a PR #2515 but I really don't feel a host loop device is a good option for fast devices like pmem.

pohly · 2020-03-11T10:03:29Z

Did you consider using device plugin to pass the pmem device directly to kata and then kata can plug it to the guest as nvdimm device?

That makes deploying applications harder (because they need to create and mount a filesystem, which implies granting them more privileges than otherwise needed). Splitting up an NVDIMM into smaller pieces for use by more than one app at a time probably also wouldn't work.

I really don't feel a host loop device is a good option for fast devices like pmem.

The loop device is not used when the app runs inside Kata Containers, so I don't think we need to worry about that.

The usage of a loop device is there a) for compatibility with apps not running under Kata Containers (which then currently can't use MAP_SYNC, but at least normal file access works and the Linux kernel might eventually remove that limitation) and b) for practical reasons, because we have to give Kubernetes a mounted filesystem to satisfy the CSI API.

A persistent memory volume MUST meet the following conditions: * A loop device must be mounted in the directory passed as volume * The loop device must have a backing file * The backing file must have the PFN signature at offset 4k [1][2] The backing file is used as backend file for a NVDIMM device in the guest fixes kata-containers#2262 [1] - https://github.com/kata-containers/osbuilder/blob/master/image-builder /nsdax.gpl.c [2] - https://github.com/torvalds/linux/blob/master/drivers/nvdimm/pfn.h Signed-off-by: Julio Montes <julio.montes@intel.com>

The resulting file can be used as backing store for a QEMU nvdimm device. This is based on the approach that is used for the Kata Container rootfs (https://github.com/kata-containers/osbuilder/blob/dbbf16082da3de37d89af0783e023269210b2c91/image-builder/image_builder.sh) and reuses some of the same code, but also differs from that in some regards: - The start of the partition is aligned a multiple of the 2MiB huge page size (kata-containers/runtime#2262 (comment)). - The size of the QEMU object is the same as the nominal size of the file. In Kata Containers the size is a fixed 128MiB (kata-containers/osbuilder#391 (comment)).

pohly added feature New functionality needs-review Needs to be assessed by the team. labels Nov 25, 2019

pohly mentioned this issue Nov 25, 2019

support PMEM inside Kata Containers intel/pmem-csi#303

Closed

devimc self-assigned this Nov 25, 2019

pohly mentioned this issue Dec 16, 2019

image builder: kata-containers-clearlinux-*-agent-*.img to short kata-containers/osbuilder#391

Closed

devimc mentioned this issue Feb 28, 2020

scripts/qemu: enable libpmem kata-containers/packaging#958

Closed

devimc mentioned this issue Feb 28, 2020

scripts/qemu: enable libpmem kata-containers/packaging#959

Merged

devimc mentioned this issue Mar 6, 2020

Support persistent memory volumes #2515

Merged

jcvenegas closed this as completed in #2515 Mar 26, 2020

pohly mentioned this issue Apr 6, 2020

clh: Enable persistent memory hotplug support #2575

Closed

devimc mentioned this issue Jan 18, 2021

[forward-port] support PMEM inside Kata Containers when running under Kubernetes kata-containers/kata-containers#1289

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support PMEM inside Kata Containers when running under Kubernetes #2262

support PMEM inside Kata Containers when running under Kubernetes #2262

pohly commented Nov 25, 2019

devimc commented Nov 25, 2019

pohly commented Dec 10, 2019 •

edited

Loading

grahamwhaley commented Dec 10, 2019

devimc commented Dec 10, 2019

pohly commented Dec 10, 2019

pohly commented Dec 18, 2019

devimc commented Dec 18, 2019

pohly commented Dec 18, 2019

devimc commented Dec 18, 2019

pohly commented Dec 19, 2019

devimc commented Dec 19, 2019

pohly commented Jan 14, 2020

devimc commented Jan 14, 2020

pohly commented Jan 14, 2020

devimc commented Jan 14, 2020 •

edited

Loading

pohly commented Jan 14, 2020

devimc commented Jan 14, 2020 •

edited

Loading

pohly commented Jan 15, 2020

devimc commented Jan 15, 2020

pohly commented Jan 15, 2020

devimc commented Jan 15, 2020

pohly commented Jan 15, 2020

pohly commented Feb 26, 2020

devimc commented Feb 26, 2020

pohly commented Feb 26, 2020

pohly commented Feb 26, 2020

pohly commented Feb 26, 2020

pohly commented Feb 26, 2020

devimc commented Feb 26, 2020

devimc commented Feb 26, 2020

devimc commented Feb 26, 2020

pohly commented Feb 26, 2020

bergwolf commented Mar 11, 2020

pohly commented Mar 11, 2020

support PMEM inside Kata Containers when running under Kubernetes #2262

support PMEM inside Kata Containers when running under Kubernetes #2262

Comments

pohly commented Nov 25, 2019

devimc commented Nov 25, 2019

pohly commented Dec 10, 2019 • edited Loading

grahamwhaley commented Dec 10, 2019

devimc commented Dec 10, 2019

pohly commented Dec 10, 2019

pohly commented Dec 18, 2019

devimc commented Dec 18, 2019

pohly commented Dec 18, 2019

devimc commented Dec 18, 2019

pohly commented Dec 19, 2019

devimc commented Dec 19, 2019

pohly commented Jan 14, 2020

devimc commented Jan 14, 2020

pohly commented Jan 14, 2020

devimc commented Jan 14, 2020 • edited Loading

pohly commented Jan 14, 2020

devimc commented Jan 14, 2020 • edited Loading

pohly commented Jan 15, 2020

devimc commented Jan 15, 2020

pohly commented Jan 15, 2020

devimc commented Jan 15, 2020

pohly commented Jan 15, 2020

pohly commented Feb 26, 2020

devimc commented Feb 26, 2020

pohly commented Feb 26, 2020

pohly commented Feb 26, 2020

pohly commented Feb 26, 2020

pohly commented Feb 26, 2020

devimc commented Feb 26, 2020

devimc commented Feb 26, 2020

devimc commented Feb 26, 2020

pohly commented Feb 26, 2020

bergwolf commented Mar 11, 2020

pohly commented Mar 11, 2020

pohly commented Dec 10, 2019 •

edited

Loading

devimc commented Jan 14, 2020 •

edited

Loading

devimc commented Jan 14, 2020 •

edited

Loading