-
Notifications
You must be signed in to change notification settings - Fork 374
support PMEM inside Kata Containers when running under Kubernetes #2262
Comments
@pohly thanks for raising this, I'll take a look later |
Here are the options that we investigated in combination with the nvdimm device support in QEMU which all don't work and why:
The last option is the one which could be made to work by enhancing the Linux kernel such that /dev/loop supports -odax when it binds a file which itself is on a dax-capable filesystem (or in general supports @devimc: do you have links to documentation and/or code for the Kata Container rootfs? |
@pohly - the best resource for kata rootfs info is going to be over in the osbuilder repo at https://github.com/kata-containers/osbuilder/tree/master/rootfs-builder |
The underlying spec is https://pmem.io/documents/NVDIMM_Namespace_Spec.pdf |
@devimc: I have image creation working in PMEM-CSI such that it works under QEMU. But I am still tweaking the code and then need to hook it into volume creation through Kubernetes. At that point it would be great if you could also enhance Kata Containers to pass such special volumes into QEMU with an nvdimm device. I'll ping you when PMEM-CSI is ready for that. In the meantime, one more question: why is the rootfs at 3MiB (https://github.com/kata-containers/osbuilder/blob/dbbf16082da3de37d89af0783e023269210b2c91/image-builder/image_builder.sh#L98-L102)? https://nvdimm.wiki.kernel.org/2mib_fs_dax states that partitions must be aligned at multiples of 2MiB for huge pages to work. I haven't found the specification for this "MBR#1 + DAX + MBR#2" content. The |
The resulting file can be used as backing store for a QEMU nvdimm device. This is based on the approach that is used for the Kata Container rootfs (https://github.com/kata-containers/osbuilder/blob/dbbf16082da3de37d89af0783e023269210b2c91/image-builder/image_builder.sh) and reuses some of the same code, but also differs from that in some regards: - The start of the partition is aligned a multiple of the 2MiB huge page size (kata-containers/runtime#2262 (comment)). - The size of the QEMU object is the same as the nominal size of the file. In Kata Containers the size is a fixed 128MiB (kata-containers/osbuilder#391 (comment)).
sure thing, just let me know when it's ready to use
there is not special reason for that, this is the next MB available for use
right
yes, it starts there and I think this fall back to 4K page faults (it would be nice to see if alignment to 2M can reduce the number of pages hence reduce memory footprint, what do you think?)
no, it doesn't, and you won't find nothing related to this, since this is kata specific, this hack was used to support all hypervisors using the same image: |
How does the kernel find the DAX meta information? I was under the impression that it has to be at the fixed offset. Or does it scan all the initial sectors? I really should read that spec carefully... 😁
It might not matter for the Kata Container rootfs because it's only going to be used for reading and writing files and less for mapping pages. But I am not sure. For PMEM-CSI I'm going to use proper alignment and we'll have to write a test that verifies that huge pages work.
So the non-hacky solution would be to drop MBR#1, right? I currently have it in the PMEM-CSI code, but it could also be removed if it turns out to be unnecessary. OTOH, I think it is that the MBR that makes |
afaik, this is not documented in any spec, even worse there is no tool to set it, the short answers is: the NVDIMM driver looks for the NVDIMM signature [1] at 4Kb, take a look [2]
heads up, you should check different sizes, not just 128M
right
yes I recommend you to keep it, otherwise you should specify an offset (losetup -o) to mount it in the host [1] - https://github.com/kata-containers/osbuilder/blob/master/image-builder/nsdax.gpl.c#L32 |
One more thought about MBRs: is MBR #2 really needed? The alternative is to put the filesystem into the space currently used by MBR#2+rootfs and in the VM mount /dev/pmem0. That should work (right?) and it would be simpler. MBR #1 can be kept for the sake of convenience (losetup without -o, file command). |
The resulting file can be used as backing store for a QEMU nvdimm device. This is based on the approach that is used for the Kata Container rootfs (https://github.com/kata-containers/osbuilder/blob/dbbf16082da3de37d89af0783e023269210b2c91/image-builder/image_builder.sh) and reuses some of the same code, but also differs from that in some regards: - The start of the partition is aligned a multiple of the 2MiB huge page size (kata-containers/runtime#2262 (comment)). - The size of the QEMU object is the same as the nominal size of the file. In Kata Containers the size is a fixed 128MiB (kata-containers/osbuilder#391 (comment)).
yes, you're right, I included a MBR just in case we want to support
👍 |
The resulting file can be used as backing store for a QEMU nvdimm device. This is based on the approach that is used for the Kata Container rootfs (https://github.com/kata-containers/osbuilder/blob/dbbf16082da3de37d89af0783e023269210b2c91/image-builder/image_builder.sh) and reuses some of the same code, but also differs from that in some regards: - The start of the partition is aligned a multiple of the 2MiB huge page size (kata-containers/runtime#2262 (comment)). - The size of the QEMU object is the same as the nominal size of the file. In Kata Containers the size is a fixed 128MiB (kata-containers/osbuilder#391 (comment)).
intel/pmem-csi#500 and its branch https://github.com/pohly/pmem-CSI/commits/kata-containers contain a functional PoC where a new
Example on a QEMU host:
What Kata Containers needs to do is:
QEMU must have been built with The Unmounting is necessary a) to avoid accessing the blocks through two different filesystems at the same time (inside QEMU and outside) and b) because the The latter is done because it was a convenient place. If for some reasons unmounting has drawbacks (do we perhaps need to keep it for idempotency?), then I could try to find a different mount point. We have to mount because Kubernetes expects it. We also cannot pass back any hints for the container runtime; all we can do is pick some unique name such that the checks above are unlikely to match a scenario whether Kata Containers should not do the special passthrough. It should be possible reproduce the setup above as follows:
To test with Kata Containers, install it in the cluster and edit |
thanks @pohly I have some questions
does this image contain a DAX metadata at 4k offset? one partition, right?
help me to understand this part,
this is the part that I really do not understand |
This probably will look familiar 😁 I just removed the MBR #2.
Correct.
First came the |
are you aware that changes in both host and guest filesystems won't be reflected? i.e
I don't like this part, since we are forcing to use a specific file name, how about using DAX metadata (does the loop device contain a DAX metadata at 4k offset?) to determinate if the volume should be unmounted and the img file used as backend for an nvdimm device? |
What do you mean? Kubernetes only ever gets to see the content of the image file, never of the volume that contains the image file.
It's harder to implement for you (you will have to implement DAX metadata parsing instead of doing a string compare). I'm undecided whether that gives us a better indicator for "treat this in a special way" than picking a well-known filename, but I don't have any strong objections either. However, currently this doesn't work: the loop device is attached with a 2MiB offset (i.e. covers just the filesystem) and thus you cannot read the DAX metadata through the loop device. You would have to unmount first, but you don't know yet whether you need to unmount. Let me check tomorrow whether I can move this internal mount point somewhere else where it isn't shadowed by the final mount. |
same as devicemapper, copy to and from these volumes won't work because of host and guest don't share a directory, they share a device and changes are not reflected, for example create a file in the guest and this new file should be visible in the host where the volume is mounted. have you tried this?
actually, we don't need a full parser, look for the pfn signature[1] should be enough, what do you think?
do I need to unmount it to get the backend file (kata-containers-pmem-csi-vm.img)?
if we can access to the backend file (img file) this won't be needed [1] - https://github.com/kata-containers/osbuilder/blob/master/image-builder/nsdax.gpl.c#L32 |
Only one pod gets access to the volume at any time, so this isn't a problem.
Yes, might be good enough.
Yes. |
ok, let me know if you can move it |
I pushed one additional commit which moves the image file into something like |
does this mean that |
Correct. It might be better to unmount anyway once it has been determined that the special treatment is necessary, just to be on the safe side regarding conflicting writes. Nothing should be using the mounted filesystem on the host side, but who knows what it might write by itself anyway... |
Hmm, I get the full path:
However, this also isn't the path on the host:
That What I can't reproduce is the missing path. Where do you run this |
I'm using clearlinux + crio
|
I've tried with the same OS, but still get the full path:
Note that this is with the current tip of my local branch which changes the path so that it is inside |
FWIW, this was for |
And now that also works, with commit ba3a5f4f:
|
@devimc was your pmem-csi container image perhaps a bit older? I don't remember whether it ever used an absolute path, but I can't think of another explanation right now. Anyway, please |
@pohly thanks, let me try again |
@pohly now I can see a fullpath, but it points to nothing ... ?
|
@pohly I'm using tip |
@devimc: sorry, I forgot to mention that you need to re-deploy PMEM-CSI to get the bi-directional mount change. You can do that by re-running |
Enable libpmem to support PMEM when running under Kubernetes. see kata-containers/runtime#2262 According to QEMU's nvdimm documentation: When 'pmem' is 'on' and QEMU is built with libpmem support, QEMU will take necessary operations to guarantee the persistence of its own writes to the vNVDIMM backend. fixes kata-containers#958 Signed-off-by: Julio Montes <julio.montes@intel.com>
Enable libpmem to support PMEM when running under Kubernetes. see kata-containers/runtime#2262 According to QEMU's nvdimm documentation: When 'pmem' is 'on' and QEMU is built with libpmem support, QEMU will take necessary operations to guarantee the persistence of its own writes to the vNVDIMM backend. fixes kata-containers#958 Signed-off-by: Julio Montes <julio.montes@intel.com>
Enable libpmem to support PMEM when running under Kubernetes. see kata-containers/runtime#2262 According to QEMU's nvdimm documentation: When 'pmem' is 'on' and QEMU is built with libpmem support, QEMU will take necessary operations to guarantee the persistence of its own writes to the vNVDIMM backend. fixes kata-containers#958 Signed-off-by: Julio Montes <julio.montes@intel.com>
A persistent memory volume MUST meet the following conditions: * A loop device must be mounted in the directory passed as volume * The loop device must have a backing file * The backing file must have the PFN signature at offset 4k [1][2] The backing file is used as backend file for a NVDIMM device in the guest fixes kata-containers#2262 [1] - https://github.com/kata-containers/osbuilder/blob/master/image-builder /nsdax.gpl.c [2] - https://github.com/torvalds/linux/blob/master/drivers/nvdimm/pfn.h Signed-off-by: Julio Montes <julio.montes@intel.com>
A persistent memory volume MUST meet the following conditions: * A loop device must be mounted in the directory passed as volume * The loop device must have a backing file * The backing file must have the PFN signature at offset 4k [1][2] The backing file is used as backend file for a NVDIMM device in the guest fixes kata-containers#2262 [1] - https://github.com/kata-containers/osbuilder/blob/master/image-builder /nsdax.gpl.c [2] - https://github.com/torvalds/linux/blob/master/drivers/nvdimm/pfn.h Signed-off-by: Julio Montes <julio.montes@intel.com>
A persistent memory volume MUST meet the following conditions: * A loop device must be mounted in the directory passed as volume * The loop device must have a backing file * The backing file must have the PFN signature at offset 4k [1][2] The backing file is used as backend file for a NVDIMM device in the guest fixes kata-containers#2262 [1] - https://github.com/kata-containers/osbuilder/blob/master/image-builder /nsdax.gpl.c [2] - https://github.com/torvalds/linux/blob/master/drivers/nvdimm/pfn.h Signed-off-by: Julio Montes <julio.montes@intel.com>
@pohly Sorry for chiming in late, but I'm not sure that PMEM-CSI is the right option for your use case. Did you consider using device plugin to pass the pmem device directly to kata and then kata can plug it to the guest as nvdimm device? I see that @devimc already put up a PR #2515 but I really don't feel a host loop device is a good option for fast devices like pmem. |
That makes deploying applications harder (because they need to create and mount a filesystem, which implies granting them more privileges than otherwise needed). Splitting up an NVDIMM into smaller pieces for use by more than one app at a time probably also wouldn't work.
The loop device is not used when the app runs inside Kata Containers, so I don't think we need to worry about that. The usage of a loop device is there a) for compatibility with apps not running under Kata Containers (which then currently can't use |
A persistent memory volume MUST meet the following conditions: * A loop device must be mounted in the directory passed as volume * The loop device must have a backing file * The backing file must have the PFN signature at offset 4k [1][2] The backing file is used as backend file for a NVDIMM device in the guest fixes kata-containers#2262 [1] - https://github.com/kata-containers/osbuilder/blob/master/image-builder /nsdax.gpl.c [2] - https://github.com/torvalds/linux/blob/master/drivers/nvdimm/pfn.h Signed-off-by: Julio Montes <julio.montes@intel.com>
A persistent memory volume MUST meet the following conditions: * A loop device must be mounted in the directory passed as volume * The loop device must have a backing file * The backing file must have the PFN signature at offset 4k [1][2] The backing file is used as backend file for a NVDIMM device in the guest fixes kata-containers#2262 [1] - https://github.com/kata-containers/osbuilder/blob/master/image-builder /nsdax.gpl.c [2] - https://github.com/torvalds/linux/blob/master/drivers/nvdimm/pfn.h Signed-off-by: Julio Montes <julio.montes@intel.com>
A persistent memory volume MUST meet the following conditions: * A loop device must be mounted in the directory passed as volume * The loop device must have a backing file * The backing file must have the PFN signature at offset 4k [1][2] The backing file is used as backend file for a NVDIMM device in the guest fixes kata-containers#2262 [1] - https://github.com/kata-containers/osbuilder/blob/master/image-builder /nsdax.gpl.c [2] - https://github.com/torvalds/linux/blob/master/drivers/nvdimm/pfn.h Signed-off-by: Julio Montes <julio.montes@intel.com>
A persistent memory volume MUST meet the following conditions: * A loop device must be mounted in the directory passed as volume * The loop device must have a backing file * The backing file must have the PFN signature at offset 4k [1][2] The backing file is used as backend file for a NVDIMM device in the guest fixes kata-containers#2262 [1] - https://github.com/kata-containers/osbuilder/blob/master/image-builder /nsdax.gpl.c [2] - https://github.com/torvalds/linux/blob/master/drivers/nvdimm/pfn.h Signed-off-by: Julio Montes <julio.montes@intel.com>
The resulting file can be used as backing store for a QEMU nvdimm device. This is based on the approach that is used for the Kata Container rootfs (https://github.com/kata-containers/osbuilder/blob/dbbf16082da3de37d89af0783e023269210b2c91/image-builder/image_builder.sh) and reuses some of the same code, but also differs from that in some regards: - The start of the partition is aligned a multiple of the 2MiB huge page size (kata-containers/runtime#2262 (comment)). - The size of the QEMU object is the same as the nominal size of the file. In Kata Containers the size is a fixed 128MiB (kata-containers/osbuilder#391 (comment)).
The resulting file can be used as backing store for a QEMU nvdimm device. This is based on the approach that is used for the Kata Container rootfs (https://github.com/kata-containers/osbuilder/blob/dbbf16082da3de37d89af0783e023269210b2c91/image-builder/image_builder.sh) and reuses some of the same code, but also differs from that in some regards: - The start of the partition is aligned a multiple of the 2MiB huge page size (kata-containers/runtime#2262 (comment)). - The size of the QEMU object is the same as the nominal size of the file. In Kata Containers the size is a fixed 128MiB (kata-containers/osbuilder#391 (comment)).
When using PMEM-CSI to manage PMEM storage, individual apps are going to have volumes created for them by Kubernetes with that storage driver and then want to use them like RAM, i.e.
mmap
a file or the entire volume. Depending on the application,MAP_SYNC
and thus additional persistency guarantees will be needed.Filesystem volumes currently get passed into kata-containers with 9p or virtio-fs. Neither of them supports
MAP_SYNC
. While virtio-fs supports mmap, performance is likely to be lower than native access because not all pages can be mapped at once (see below).Block volumes are passed in as SCSI disk devices which is even worse (no mmap).
Describe the solution you'd like
A way to pass in the volume such that all of it can me mapped into an application's address space with
MAP_SYNC
semantic. Once that mapping exists, applications should be able to read and write bytes with native performance (= as if they weren't running under Kata Containers).At this point, the most promising approach for achieving this seems to be to detect such special volumes and map them to QEMU objects and nvdimm device (https://github.com/qemu/qemu/blob/master/docs/nvdimm.txt). How to activate this special behavior is to be decided.
Describe alternatives you've considered
virtio-fs was considered, but doesn't meet all objectives because:
Before raising this feature request
This was discussed on freenode IRC,
#kata-dev
, on 2019-11-25:(11:42:07 AM) pohly: stefanha: hello. I am trying to understand how (and how well) virtio-fs supports mmap. Background: I work on PMEM-CSI, a driver which enables the use of PMEM in Kubernetes. Ultimately the goal is that an application can do mmap(MAP_SYNC) and then do byte read/writes directly to the the underlying hardware. That works without kata-containers involved. I now looked at kata-containers 1.9.1 with the kata-qemu-virtiofs. I can see that this passes the dax-capable filesystem (XFS, in case that this matters) into the qemu instance with virtiofs. A test program can do mmap(MAP_SYNC) on a file.
(11:43:13 AM) pohly: But... it can also do that with 9p as file system and with the container root filesystem served by virtio-fs although that filesystem on the host does not support dax (hosted by plain SSD).
(11:45:00 AM) pohly: I was under the (perhaps mistaken) impression that virtio-fs would somehow support mmap. I though I had read that somewhere. Is that really true?
(11:46:42 AM) pohly: I checked the /proc//maps for the /opt/kata/bin/qemu-virtiofs-system-x86_64 process that runs the pod. It doesn't have any entry for the file that currently is mapped inside the container.
(11:51:52 AM) brtknr: pohly: following this discussion
(11:57:44 AM) davidgiluk: pohly: is the mount mounted with DAX?
(11:58:25 AM) pohly: Yes: kataShared on /data type virtio_fs (rw,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other,dax)
(11:58:35 AM) pohly: That is inside qemu.
(11:59:16 AM) pohly: And also outside of it: /dev/mapper/ndbus0region0fsdax-e7660acd0fd86e6aea32589af51903654f6a4e41 on /var/lib/kubelet/pods/6576fed5-5488-4ee4-a6a2-578c5519ae9c/volumes/kubernetes.io~csi/my-csi-volume/mount type xfs (rw,relatime,attr2,dax,inode64,noquota)
(12:00:53 PM) stefanha: pohly: virtio-fs isn't intended for pmem. QEMU won't use MAP_SYNC.
(12:01:10 PM) stefanha: pohly: If you need MAP_SYNC semantics then QEMU's nvdimm device can do that.
(12:01:37 PM) stefanha: pohly: MAP_SYNC support could be added to virtio-fs but today it doesn't do that.
(12:01:48 PM) pohly: stefanha: if virtio-fs doesn't support MAP_SYNC, shouldn't it then reject the mmap call?
(12:03:02 PM) stefanha: pohly: Probably. Inside the guest the virtio-fs and FUSE code isn't doing anything that violates MAP_SYNC,
(12:03:19 PM) stefanha: but the problem is that the host side doesn't necessarily honor those semantics.
(12:03:48 PM) pohly: But plain mmap works?
(12:03:57 PM) stefanha: pohly: Yep, plain mmap is supported.
(12:04:13 PM) pohly: Should I then see a /proc//maps entry for the file? I don't have that.
(12:04:50 PM) pohly: Or am I checking the wrong process? I looked at qemu-virtiofs-system-x86_64, because that is where the code runs.
(12:05:39 PM) stefanha: pohly: There isn't necessarily a 1:1 mmap relationship between guest application mmaps and host qemu-virtiofs-system-x86_64 mmaps.
(12:05:54 PM) stefanha: pohly: What are you trying to confirm by looking at qemu-virtiofs-system-x86_64 mmaps?
(12:06:26 PM) pohly: Looking more closely I do see one entry that has at least the right size: 7f2f1bffe000-7f2f1bfff000 ---p 00000000 00:00 0
(12:06:40 PM) pohly: But it doesn't have a file name associated with it. Should it have that?
(12:07:06 PM) pohly: I am trying to verify that a file on the host has indeed been mapped into the address space of the process running inside qemu.
(12:07:55 PM) pohly: If that isn't the case, then how does mmap support work?
(12:07:56 PM) stefanha: pohly: The lack of filename could be due to file descriptor passing
(12:08:10 PM) stefanha: The file is opened by virtiofsd and passed to QEMU. Maybe that's why no name is reported.
(12:08:15 PM) stefanha: But that's just a guess.
(12:08:30 PM) pohly: That might be it. Let me remove the mapping inside qemu...
(12:08:31 PM) davidgiluk: the name normally does show up
(12:08:53 PM) davidgiluk: pohly: Have you accessed the mmap'd area, or just done the mmap?
(12:10:20 PM) pohly: Just the mmap. So it's waiting for a page fault before doing anything on the host side? I can add that.
(12:11:12 PM) stefanha: Yes, that sounds likely.
(12:11:52 PM) stefanha: pohly: But again, if your goal is to get pmem semantics then virtio-fs in its current state doesn't guarantee that.
(12:12:15 PM) davidgiluk: pohly: Yes, I think so - remember for virtiofs we only have a fixed sized cache window, so we can't guarantee to mmap the whole region
(12:12:19 PM) stefanha: pohly: QEMU has -device nvdimm and -device virtio-pmem-pci for that.
(12:13:36 PM) pohly: Using those for a mounted filesystem in kata-containers isn't going to be easy.
(12:14:22 PM) pohly: virtio-fs looked much more promising ;-}
(12:15:00 PM) davidgiluk: stefanha: What stops us passing the MAP_SYNC all the way through?
(12:17:06 PM) pohly: davidgiluk: even if you do, "fixed size cache window" sounds like another big roadblock. PMEM comes in higher capacity than DRAM, that's partly why it is appealing for some workloads.
(12:18:10 PM) pohly: MAP_SYNC isn't even needed for all workloads. In fact, most apps currently don't depend on it.
(12:18:33 PM) pohly: So virtio-fs may already be a good step forward and sufficient.
(12:19:09 PM) pohly: OTOH, if it needs to set up and tear down mappings on the host side often, then that may affect performance.
(12:20:07 PM) pohly: memcached uses PMEM as DRAM replacement and stores its data there. Predictable access times for that data probably is important.
(12:20:41 PM) davidgiluk: pohly: Right; if you've got a single PMEM device to pass through then as stefan says using the -device stuff is the right way; if you're trying to pass through files that on the host are mountedon a filesystem that's backed by pmem, then virtiofs might be interesting
(12:22:06 PM) pohly: davidgiluk: we are trying the former. PMEM-CSI basically splits up a single PMEM device and hands out portions of it to individual apps. We cannot assume that only a single app uses that device; that would be rather limiting.
(12:22:26 PM) pohly: Ahem, I meant "we are trying the latter"...
(12:23:23 PM) davidgiluk: pohly: But does the PMEM-CSI portions look like individual block devices that you then put a filesystem on, and is that filesystem built in the host or the guest?
(12:26:10 PM) pohly: davidgiluk: it is a block device. But applications in Kubernetes typically will ask for a filesystem, so PMEM-CSI formats and mounts that device.
(12:26:26 PM) sameo left the room (quit: Ping timeout: 276 seconds).
(12:26:32 PM) pohly: And then Kubernetes passes the directory name of the mounted FS to the runtime.
(12:27:01 PM) pohly: I heard that kata-containers sometimes does tricks like then passing the device into qemu and mounting again inside.
(12:27:38 PM) pohly: That's a bit dirty, because there are two Linux kernels which both might write to the same block device.
(12:27:40 PM) davidgiluk: pohly: OK, if it's a device+filesystem just for that container then it does feel like passing that block device into the container is right rather than passing the filesystem through virtiofs
(12:28:27 PM) pohly: davidgiluk: yes, that would be the better alternative, except for the "is already mounted" part.
(12:29:49 PM) pohly: Also, does it have to be some actual device? Currently the block devices are either LVM logical volumes or PMEM namespaces (/dev/pmem).
(12:30:16 PM) pohly: We can't use PCI device pass-through - it's not even on the PCI bus.
(12:30:32 PM) pohly: Nor do we want to pass in the entire NVDIMM.
(12:39:19 PM) davidgiluk: pohly: Does one of the chunks of a PMEM-CSI look like a pmem device? i.e. would it make sense to pass that in using -device nvdimm or virtio-pmem-pci ?
(12:40:58 PM) pohly: davidgiluk: I need to check what those options expect, but for the LVM case the answer is probably "no" - it's just a logical volume.
(12:43:14 PM) pohly: Hmm, according to https://github.com/qemu/qemu/blob/master/docs/nvdimm.txt the "mem-path" can be an ordinary file. So we could just point that at the block device.
(12:45:06 PM) pohly: But how would kata-containers even recognize that it needs to do this special handling? All it gets is a path to a mounted filesystem or a loop device (block mode, which also works in Kubernetes).
(12:46:03 PM) pohly: This sounds doable, but I fear that it will be rather hacky and I have no idea how many different components need to be adapted to make this work.
(12:47:41 PM) pohly: May I copy this discussion into an issue in https://github.com/kata-containers/runtime/issues? Is that the right tracker for "add PMEM support to kata-containers"?
(12:51:49 PM) davidgiluk: pohly: Yeh probably best to make an issue; I'm also not sure the best way to wire it through - but if it looks like a block device, and that block device is intended just for this container, then treat it as a block device and let the guest handle it
(12:52:58 PM) gwhaley: pohly: include 'devimc' on that Issue, if not already - he'll have a good idea I think of what knitting would be required.
(12:53:27 PM) gwhaley: yes, the hard bit is how to annotate that volume/mount/device to ensure it ends up mapped via the correct route. It may be that 'annotations' are the route.
(12:53:40 PM) gwhaley: oh, amshinde might have good input as well
(12:55:03 PM) gwhaley: so, historically we've always noted that nvdimm/dax could be used to pass items in (kata uses it for iirc the kernel image, or is it the rootfs....) - but, I don't believe there is a defined mechanism to set that all up via the orchestrators and runtime, and I don't think I've ever seen anybody actually using an nvdimm/dax mount/map for themselves ... yet....
(12:59:52 PM) pohly: gwhaley: /opt/kata/share/kata-containers/kata-containers-image_clearlinux_1.9.1_agent_d4bbd8007f.img is passed via "-object" + "-device nvdimm".
(01:00:47 PM) pohly: Looks like the rootfs. There's also "root=/dev/pmem0p1".
(01:00:50 PM) gwhaley: pohly: right, the rootfs for the VM (I can never remember if it is the rootfs or the kernel we do it with ;-) )... so, we use it, we know it works.... now it would be how do we enable 'users' to do it...
(01:03:11 PM) pohly: davidgiluk: to get closure on this: when actually writing into the memory mapped region via virtio-fs, I do see map entries on the host side, including the file name.
(01:03:41 PM) pohly: davidgiluk: how large is this "fixed size cache window"?
(01:04:51 PM) davidgiluk: pohly: It's configurable via an option, normally a few GB
(01:05:08 PM) gwhaley: https://github.com/kata-containers/runtime/blob/master/cli/config/configuration-qemu-virtiofs.toml.in#L118-L131 :-)
(01:05:24 PM) ***davidgiluk disappears for a 2 hours
(01:05:31 PM) ***gwhaley goes for lunch...
(01:06:43 PM) pohly: So a lot less than the hundreds of GB that people may have as PMEM. MIght be worth testing how that affects performance. Thanks!
CC @devimc @GabyCT
The text was updated successfully, but these errors were encountered: