Skip to content
This repository has been archived by the owner on Oct 22, 2024. It is now read-only.

Commit

Permalink
WIP: kata containers
Browse files Browse the repository at this point in the history
  • Loading branch information
pohly committed Jan 13, 2020
1 parent 8328aed commit a2160c4
Show file tree
Hide file tree
Showing 3 changed files with 221 additions and 59 deletions.
55 changes: 55 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -165,6 +165,45 @@ In direct device mode, the driver does not attempt to limit space
use. It also does not mark "own" namespaces. The _Name_ field of a
namespace gets value of the VolumeID.

### Kata Container support

[Kata Containers](https://katacontainers.io) runs applications inside a
virtual machine. This poses a problem for App Direct mode, because
access to the filesystem prepared by PMEM-CSI is provided inside the
virtual machine by the 9p or virtio-fs filesystems. Both do not
support App Direct mode:
- 9p does not support `mmap` at all.
- virtio-fs only supports it when not using `MAP_SYNC`, i.e. without dax
semantic.

This gets solved as follows:
- PMEM-CSI creates a volume as usual, either in direct mode or LVM mode.
- Inside that volume it sets up an ext4 filesystem.
- Inside that filesystem it creates a `pmem-csi-vm.img` file that contains
partition tables, dax metadata and a partition that takes up most of the
space available in the volume.
- That partition is bound to a `/dev/loop` device and the formatted
with the requested filesystem type for the volume.
- When an applications needs access to the volume, PMEM-CSI mounts
that `/dev/loop` device.
- An application not running under Kata Containers then uses
that filesystem normally *but* due to limitations in the Linux
kernel, mounting might have to be done without `-odax` and thus
App Direct access does not work.
- When the Kata Container runtime is asked to provide access to that
filesystem, it will instead pass the underlying `pmem-csi-vm.img`
file into QEMU as a [nvdimm
device](https://github.com/qemu/qemu/blob/master/docs/nvdimm.txt)
and inside the VM mount the `/dev/pmem0p1` partition that the
Linux kernel sets up based on the dax meta data that was placed in the
file by PMEM-CSI. Inside the VM, the App Direct semantic is fully
supported.

Because such volumes can only be used with full dax semantic inside
Kata Containers and because of the additional space overhead, Kata
Container support has to be enabled explicitly via a [storage class
parameter](#usage-on-kubernetes).

### Driver modes

The PMEM-CSI driver supports running in different modes, which can be
Expand Down Expand Up @@ -400,6 +439,21 @@ application](deploy/common/pmem-app-cache.yaml) example.
* A node is only chosen the first time a pod starts. After that it will always restart
on that node, because that is where the persistent volume was created.

[Kata Containers support](#kata-containers-support) gets enabled via
the `kataContainers` storage class parameter. It accepts the following
values:
* `true/1/t/TRUE`
Create the filesystem inside a partition inside a file, try to mount
on the host through a loop device with `-o dax` but proceed without
`-o dax` when the kernel does not support that. Currently Linux up
to and including 5.4 do not support it. In other words, on the host
such volumes are usable, but only without DAX. Inside Kata
Containers, DAX works.
* `false/0/f/FALSE` (default)
Create the filesystem directly on the volume.

Raw block volumes are only supported with `kataContainers: false`.

Volume requests embedded in Pod spec are provisioned as ephemeral volumes. The volume request could use below fields as [`volumeAttributes`](https://kubernetes.io/docs/concepts/storage/volumes/#csi):

|key|meaning|optional|values|
Expand All @@ -409,6 +463,7 @@ Volume requests embedded in Pod spec are provisioned as ephemeral volumes. The v

Check with provided [example application](deploy/kubernetes-1.15/pmem-app-ephemeral.yaml) for
ephemeral volume usage.

## Prerequisites

### Software required
Expand Down
193 changes: 140 additions & 53 deletions pkg/pmem-csi-driver/nodeserver.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,14 +17,17 @@ import (

"github.com/container-storage-interface/spec/lib/go/csi"
"golang.org/x/net/context"
"golang.org/x/sys/unix"
"google.golang.org/grpc"
"google.golang.org/grpc/codes"
"google.golang.org/grpc/status"
"k8s.io/klog"
"k8s.io/kubernetes/pkg/util/mount"

"github.com/intel/pmem-csi/pkg/imagefile"
pmdmanager "github.com/intel/pmem-csi/pkg/pmem-device-manager"
pmemexec "github.com/intel/pmem-csi/pkg/pmem-exec"
"github.com/intel/pmem-csi/pkg/volumepathhandler"
)

const (
Expand Down Expand Up @@ -145,19 +148,28 @@ func (ns *nodeServer) NodePublishVolume(ctx context.Context, req *csi.NodePublis
ephemeral = device == nil && !ok && len(srcPath) == 0
}

var parameters volumeParameters
if ephemeral {
device, err := ns.createEphemeralDevice(ctx, req)
p, err := parseVolumeParameters(ephemeralVolumeParameters, req.GetVolumeContext())
if err != nil {
return nil, status.Error(codes.InvalidArgument, "ephemeral inline volume parameters: "+err.Error())
}
parameters = p

device, err := ns.createEphemeralDevice(ctx, req, parameters)
if err != nil {
// createEphemeralDevice() returns status.Error, so safe to return
return nil, err
}
srcPath = device.Path
mountFlags = append(mountFlags, "dax")
} else {
// Validate parameters. We don't actually use any of them here, but a sanity check is worthwhile anyway.
if _, err := parseVolumeParameters(persistentVolumeParameters, req.GetVolumeContext()); err != nil {
// Validate parameters.
p, err := parseVolumeParameters(persistentVolumeParameters, req.GetVolumeContext())
if err != nil {
return nil, status.Error(codes.InvalidArgument, "persistent volume context: "+err.Error())
}
parameters = p

if device, err = ns.cs.dm.GetDevice(req.VolumeId); err != nil {
if errors.Is(err, pmdmanager.ErrDeviceNotFound) {
Expand All @@ -177,26 +189,12 @@ func (ns *nodeServer) NodePublishVolume(ctx context.Context, req *csi.NodePublis
sort.Strings(mountFlags)
joinedMountFlags := strings.Join(mountFlags[:], ",")

rawBlock := false
switch req.VolumeCapability.GetAccessType().(type) {
case *csi.VolumeCapability_Block:
rawBlock = true
// For block volumes, source path is the actual Device path
srcPath = device.Path
targetDir := filepath.Dir(targetPath)
// Make sure that parent directory of target path is existing, otherwise create it
targetPreExisting, err := ensureDirectory(ns.mounter, targetDir)
if err != nil {
return nil, err
}
f, err := os.OpenFile(targetPath, os.O_CREATE, os.FileMode(0644))
defer f.Close()
if err != nil && !os.IsExist(err) {
if !targetPreExisting {
if rerr := os.Remove(targetDir); rerr != nil {
klog.Warningf("Could not remove created mount target %q: %v", targetDir, rerr)
}
}
return nil, status.Errorf(codes.Internal, "Could not create target device file %q: %v", targetPath, err)
}
case *csi.VolumeCapability_Mount:
if !ephemeral && len(srcPath) == 0 {
return nil, status.Error(codes.FailedPrecondition, "Staging target path missing in request")
Expand Down Expand Up @@ -231,19 +229,99 @@ func (ns *nodeServer) NodePublishVolume(ctx context.Context, req *csi.NodePublis
return nil, status.Error(codes.AlreadyExists, "Volume published but is incompatible")
}
}
}

if err := os.Mkdir(targetPath, os.FileMode(0755)); err != nil {
// Kubernetes is violating the CSI spec and creates the
// directory for us
// (https://github.com/kubernetes/kubernetes/issues/75535). We
// allow that by ignoring the "already exists" error.
if !os.IsExist(err) {
return nil, status.Error(codes.Internal, "make target dir: "+err.Error())
}
if !parameters.getKataContainers() {
// A normal volume: just bind-mount the filesystem or device.
if err := ns.mount(srcPath, targetPath, mountFlags, rawBlock); err != nil {
return nil, status.Error(codes.Internal, err.Error())
}
ns.volInfo[req.VolumeId] = volumeInfo{readOnly: readOnly, targetPath: targetPath, mountFlags: joinedMountFlags}
return &csi.NodePublishVolumeResponse{}, nil
}

// When we get here, the volume is for Kata Containers and the
// following has already happened:
// - persistent filesystem volumes: formatted and mounted at the staging path (= srcPath)
// - persistent raw block volumes: formatted
// - ephemeral inline filesystem volumes: formatted
//
// To support Kata Containers for ephemeral and persistent filesystem volumes,
// we now need to:
// - mount ephemeral inline volume
// - create the image file inside the mounted volume
// - bind-mount the partition inside that file to a loop device
// - mount the loop device instead of the original volume
//
// All of that has to be idempotent (because we might get
// killed while working on this) *and* we have to undo it when
// returning a failure (because then NodePublishVolume might
// not be called again - see in particular the final errors in
// https://github.com/kubernetes/kubernetes/blob/ca532c6fb2c08f859eca13e0557f3b2aec9a18e0/pkg/volume/csi/csi_client.go#L627-L649).

if rawBlock {
// We cannot pass block devices with DAX semantic into QEMU.
// TODO: add validation of CreateVolumeRequest.VolumeCapabilities and already detect the problem there.
return nil, status.Error(codes.InvalidArgument, "raw block volumes are incompatible with Kata Containers")
}

hostMount := srcPath
if ephemeral {
// Create a mount point inside the target path (chosen because it is unique, a directory, deterministic,
// and can be created by us.
if err := os.Mkdir(targetPath, os.FileMode(0755)); err != nil && !os.IsExist(err) {
return nil, status.Error(codes.Internal, "create target directory: "+err.Error())
}

// Mount the volume there.
hostMount := filepath.Join(targetPath, "ephemeral-volume")
if err := ns.mount(srcPath, hostMount, []string{} /* mount flags */, false /* raw block */); err != nil {
return nil, status.Error(codes.Internal, "mount ephemeral volume: "+err.Error())
}
}

// There's some overhead for the imagefile inside the host filesystem, but that should be small
// relative to the size of the volumes, so we simply create an image file that is as large as
// the mounted filesystem allows. Create() is not idempotent, so we have to check for the
// file before overwriting something that was already created earlier.
imageFile := filepath.Join(hostMount, "pmem-csi-vm.img")
if _, err := os.Stat(imageFile); err != nil {
if !os.IsNotExist(err) {
return nil, status.Error(codes.Internal, "unexpected error while checking for image file: "+err.Error())
}
var stat unix.Statfs_t
if err := unix.Statfs(hostMount, &stat); err != nil {
return nil, status.Error(codes.Internal, "unexpected error while checking the volume statistics: "+err.Error())
}
imageSize := imagefile.Bytes(stat.Bfree * uint64(stat.Bsize))
var imageFsType imagefile.FsType
switch fsType {
case "xfs":
imageFsType = imagefile.Xfs
case "":
fallthrough
case "ext4":
imageFsType = imagefile.Ext4
default:
return nil, status.Error(codes.InvalidArgument, fmt.Sprintf("fsType %q not supported for Kata Containers", fsType))
}
if err := imagefile.Create(imageFile, imageSize, imageFsType); err != nil {
return nil, status.Error(codes.Internal, "create Kata Container image file: "+err.Error())
}
}

// If this offset ever changes, then we have to make future versions of PMEM-CSI more
// flexible and dynamically determine the offset. For now we assume that the
// file was created by the current version and thus use the fixed offset.
offset := int64(imagefile.HeaderSize)
handler := volumepathhandler.VolumePathHandler{}
loopDev, err := handler.AttachFileDeviceWithOffset(imageFile, offset)
if err != nil {
return nil, status.Error(codes.Internal, "create loop device: "+err.Error())
}

if err := ns.mount(srcPath, targetPath, mountFlags); err != nil {
// TODO: Try to mount with dax first, fall back to mount without it if not supported.
if err := ns.mount(loopDev, targetPath, []string{}, false); err != nil {
return nil, status.Error(codes.Internal, err.Error())
}

Expand Down Expand Up @@ -286,18 +364,24 @@ func (ns *nodeServer) NodeUnpublishVolume(ctx context.Context, req *csi.NodeUnpu

// Check if the target path is really a mount point. If its not a mount point do nothing
if notMnt, err := ns.mounter.IsLikelyNotMountPoint(targetPath); notMnt || err != nil && !os.IsNotExist(err) {
klog.V(5).Infof("NodeUnpublishVolume: %s is not mount point, skip", targetPath)
return &csi.NodeUnpublishVolumeResponse{}, nil
klog.V(5).Infof("NodeUnpublishVolume: %s is not mount point, skipping unmount", targetPath)
} else {
// Unmounting the image
klog.V(3).Infof("NodeUnpublishVolume: unmount %s", targetPath)
if err := ns.mounter.Unmount(targetPath); err != nil {
return nil, status.Error(codes.Internal, err.Error())
}
klog.V(5).Infof("NodeUnpublishVolume: volume id:%s targetpath:%s has been unmounted", req.VolumeId, targetPath)
}

// Unmounting the image
klog.V(3).Infof("NodeUnpublishVolume: unmount %s", targetPath)
if err := ns.mounter.Unmount(targetPath); err != nil {
return nil, status.Error(codes.Internal, err.Error())
if err := os.Remove(targetPath); err != nil {
return nil, status.Error(codes.Internal, "unexpected error while removing target path: "+err.Error())
}
klog.V(5).Infof("NodeUnpublishVolume: volume id:%s targetpath:%s has been unmounted", req.VolumeId, targetPath)

os.Remove(targetPath) // nolint: gosec, errorchk
if parameters.getKataContainers() {
// TODO

}

if parameters.getPersistency() == persistencyEphemeral {
if _, err := ns.cs.DeleteVolume(ctx, &csi.DeleteVolumeRequest{VolumeId: vol.ID}); err != nil {
Expand Down Expand Up @@ -373,7 +457,7 @@ func (ns *nodeServer) NodeStageVolume(ctx context.Context, req *csi.NodeStageVol

mountOptions = append(mountOptions, "dax")

if err = ns.mount(device.Path, stagingtargetPath, mountOptions); err != nil {
if err = ns.mount(device.Path, stagingtargetPath, mountOptions, false /* raw block */); err != nil {
return nil, status.Error(codes.Internal, err.Error())
}

Expand Down Expand Up @@ -433,12 +517,7 @@ func (ns *nodeServer) NodeExpandVolume(context.Context, *csi.NodeExpandVolumeReq

// createEphemeralDevice creates new pmem device for given req.
// On failure it returns one of status errors.
func (ns *nodeServer) createEphemeralDevice(ctx context.Context, req *csi.NodePublishVolumeRequest) (*pmdmanager.PmemDeviceInfo, error) {
parameters, err := parseVolumeParameters(ephemeralVolumeParameters, req.GetVolumeContext())
if err != nil {
return nil, status.Error(codes.InvalidArgument, "ephemeral inline volume parameters: "+err.Error())
}

func (ns *nodeServer) createEphemeralDevice(ctx context.Context, req *csi.NodePublishVolumeRequest, parameters volumeParameters) (*pmdmanager.PmemDeviceInfo, error) {
// Create new device, using the same code that the normal CreateVolume also uses,
// so internally this volume will be tracked like persistent volumes.
volumeID, _, err := ns.cs.createVolumeInternal(ctx, parameters, req.VolumeId,
Expand All @@ -463,8 +542,8 @@ func (ns *nodeServer) createEphemeralDevice(ctx context.Context, req *csi.NodePu
return device, nil
}

// provisionDevice initializes the device with requested filesystem
// and mounts at given targetPath.
// provisionDevice initializes the device with requested filesystem.
// It can be called multiple times for the same device (idempotent).
func (ns *nodeServer) provisionDevice(device *pmdmanager.PmemDeviceInfo, fsType string) error {
if fsType == "" {
// Empty FsType means "unspecified" and we pick default, currently hard-coded to ext4
Expand Down Expand Up @@ -493,7 +572,8 @@ func (ns *nodeServer) provisionDevice(device *pmdmanager.PmemDeviceInfo, fsType
return nil
}

func (ns *nodeServer) mount(sourcePath, targetPath string, mountOptions []string) error {
// mount creates the target path (parent must exist) and mounts the source there. It is idempotent.
func (ns *nodeServer) mount(sourcePath, targetPath string, mountOptions []string, rawBlock bool) error {
notMnt, err := ns.mounter.IsLikelyNotMountPoint(targetPath)
if err != nil && !os.IsNotExist(err) {
return fmt.Errorf("failed to determain if '%s' is a valid mount point: %s", targetPath, err.Error())
Expand All @@ -502,13 +582,20 @@ func (ns *nodeServer) mount(sourcePath, targetPath string, mountOptions []string
return nil
}

if err := os.Mkdir(targetPath, os.FileMode(0755)); err != nil {
// Kubernetes is violating the CSI spec and creates the
// directory for us
// (https://github.com/kubernetes/kubernetes/issues/75535). We
// allow that by ignoring the "already exists" error.
if !os.IsExist(err) {
return fmt.Errorf("failed to create '%s': %s", targetPath, err.Error())
// Create target path, using a file for raw block bind mounts
// or a directory for filesystems. Might already exist from a
// previous call or because Kubernetes erronously created it
// for us.
if rawBlock {
f, err := os.OpenFile(targetPath, os.O_CREATE, os.FileMode(0644))
if err == nil {
defer f.Close()
} else if !os.IsExist(err) {
return fmt.Errorf("create target device file: %w", err)
}
} else {
if err := os.Mkdir(targetPath, os.FileMode(0755)); err != nil && !os.IsExist(err) {
return fmt.Errorf("create target directory: %w", err)
}
}

Expand Down
Loading

0 comments on commit a2160c4

Please sign in to comment.