-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-4639: adding OCI VolumeSource #4642
Conversation
sallyom
commented
May 17, 2024
•
edited by saschagrunert
Loading
edited by saschagrunert
- One-line PR description: adding OCI image and/or artifact VolumeSource
- VolumeSource: OCI Artifact and/or Image #4639
da2605a
to
872e7a5
Compare
Can this feature be provided out-of-tree (for example, using CSI)? |
<!-- | ||
What other approaches did you consider, and why did you rule them out? These do | ||
not need to be as detailed as the proposal, but should include enough | ||
information to express the idea and why it was not acceptable. | ||
--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If CSI is an option for this, explain how that alternative could look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will add this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A CSI driver would be a better approach IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Runtimes already pull and handle images using the CRI. I'm wondering how a CSI parallel pull, auth and storage implementation is better than extending an existing solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a project that took this approach. https://github.com/converged-computing/oras-csi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vsoch very cool! is it possible to gate pulls based on signature verification & policy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry missed this! Not at the moment, but I don't see why not.
We propose to add a new `VolumeSource` that supports OCI images and/or artifacts. This `VolumeSource` will allow users to mount an OCI image or artifact directly into a pod, | ||
making the files within the image accessible to the containers without the need for a shell or additional image manipulation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which kinds of artifacts? Not all OCI artifacts represent filesystems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here I'm referencing files - the equivalent of an OCI scratch image - for container volume mounts - an artifact that can be represented as a file is all that makes sense here I think - but really, anything that can be mounted within a pod's or container's filesystem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will the volume driver know what kind of OCI artifact it's trying to pull and mount? I know “container image” is one of the options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we handle artifacts/images for pods shared across containers in a completely different way than artifacts/images of individual containers... hmm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this does end up in-tree, I'd hope to see a dedicated artifact type that means “here are files, don't try to run them”; this is just data.
An example: you run an antimalware service with scanning patterns for email messages. The scanner uses a blue/green approach where the same application container image gets deployed alongside an updated scanning patterns image, n times an hour. Both the application image artifact and the scanning pattern artifact are signed. The files in that signature image consist of compiled patterns that would work as a volume source, but there's no executable code in there.
(I do like the idea of making signature verification an eventual goal; that's a useful feature to leave room for).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me what needs clarifying is:
- are we planning to support OCI container images as a readonly volume type, using CRI to handle those images but not to run any code?
- are we planning to support arbitrary OCI artifacts as volume sources, include OCI container images as one of potentially many kinds of artifact you can mount?
If ① then CRI feels like the better fit; that way we can use the Pod's RuntimeClass to fetch the image, in case it matters. If ② then we can't assume CRI knows about that kind of artifact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either way we should document the alternative, even if we rule it out. It's good for KEPs to mention a sketch of the design(s) we didn't pick as well as the one we did.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the arbitrary OCI artifacts note.. needs investigation/discussion.. We are talking about setting up a call to discuss next week.
There are a number of ways artifacts have been implemented pre. and post oci 1.1 specs.
@SFTM to your 1. proposal question: What does it mean to provide a mount volume into the pod's containers but not to run any code? I get that we are not expanding a pod to be an arbitrary container image/artifact with it's own command to run (yet). Sure. But what does it mean to mount a container image or artifact, but not to run any code there or from there? Does the volume include platforms/versions or just the one image the oci tooling would've chosen if it was being run on the node as a container by a runtime engine, does the volume include the entire index manifest tree/chain, manifest configs layers and all? Or just the layers that would otherwise makeup the container's rootfs?
To your 2. proposal we also can't assume any CSI driver would know about all particular kinds of artifacts, registries and how to pull, parse & mount them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the arbitrary OCI artifacts note.. needs investigation/discussion.. We are talking about setting up a call to discuss next week.
The OCI volume source should be a mountable artifact in the first place.
Arbitrary OCI artifacts may not work very well. ORAS supports pushing any object to the registry and generating its OCI artifacts, but not all of them can be mounted as volumes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If ① then CRI feels like the better fit; that way we can use the Pod's RuntimeClass to fetch the image, in case it matters
This may matter for VM-based runtimes that use block devices to expose the image rootfs.
But what does it mean to mount a container image or artifact, but not to run any code there or from there? Does the volume include platforms/versions or just the one image the oci tooling would've chosen if it was being run on the node as a container by a runtime engine, does the volume include the entire index manifest tree/chain, manifest configs layers and all? Or just the layers that would otherwise makeup the container's rootfs?
As a basic idea here: the one image the container runtime would have chosen (i.e., matching platforms in the case of an index) as a union (or flattened), but without the config (since the config does not affect any files). IOW the union of the layers that would otherwise make up the container's rootfs (and probably without any runtime-generated files such as /etc/hosts
and without runtime-generated mounts like /proc
or /sys
).
If ② then we can't assume CRI knows about that kind of artifact.
To your 2. proposal we also can't assume any CSI driver would know about all particular kinds of artifacts, registries and how to pull, parse & mount them.
An argument in favor of the CSI approach is that a cluster admin could deploy a CSI driver that matches whatever artifact types they expect to have used in the cluster. On the other hand, another approach that we could take on the runtime side is to build out an extension mechanism such that the runtime delegates to some sort of mount-provider program based on the mediatype that is found after pulling the artifact.
Anyone know what happened with this? I'll add this as an Alternative, I'm gathering info csi-driver-image-populator |
Signed-off-by: sallyom <somalley@redhat.com>
872e7a5
to
b00c454
Compare
Signed-off-by: sallyom <somalley@redhat.com>
b00c454
to
e19f0a2
Compare
Signed-off-by: sallyom <somalley@redhat.com>
d3a6089
to
fd15ec5
Compare
Noticing that there are a lot of groups working on "similar" things. |
- Update Kubelet and CRI to recognize and handle new media types associated with OCI artifacts. | ||
- Ensure that pulling and storing these artifacts is as efficient and secure as with OCI images. | ||
|
||
**Lifecycling and Garbage Collection:** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
### [KEP 1495: Volume Populators](https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/1495-volume-populators) | ||
|
||
The volume-populators API extension allows you to populate a volume with data from an external data source when the volume is created. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Volume populator with a datasource pointing to a OCI artifact on a registry could still be an option. But with all the options k8s has, prudent to pick one.
### Custom CSI Plugin | ||
|
||
See [https://github.com/warm-metal/container-image-csi-driver](https://github.com/warm-metal/container-image-csi-driver) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
volumes and container images. The in-tree OCI VolumeSource will utilize these existing mechanisms. | ||
|
||
2. **Integration with Kubernetes:** | ||
- **Optimal Performance:** Deep integration with the scheduler and kubelet ensures optimal performance and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More details needed about how one would handle large OCI artifacts - AI/ML inference data use case for example. The real win we are looking for here is the latency/network cost of pulling said artifact and not needing to repeat this across multiple pod restarts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://kubernetes.io/docs/concepts/storage/volumes/#gitrepo
^ could be similar to this, deprecated though in favor on initContainer/emptyDir
We likely need a "persistent" (cache?) volume source.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://kubernetes.io/docs/concepts/storage/volumes/#gitrepo ^ could be similar to this, deprecated though in favor on initContainer/emptyDir
We likely need a "persistent" (cache?) volume source.
On-demand loading with a separate standalone cache can handle large artifact sizes very well. We have done some POC about it.
If the artifacts support lazy loading, then only some metadata needs to be saved locally, and there is no need to worry too much about garbage collection or layer data reuse. We only need a local cache system to save temporary data, and the management of images/artifacts can be easier.
an additional operational burden. For a generic, vendor-agnostic, and widely-adopted solution this would not make sense. | ||
- External CSI plugins implement their own lifecycle management and garbage collection mechanisms, | ||
yet these already exist in-tree for OCI images. | ||
- Performance: There is additional overhead with an out-of-tree CSI plugin, especially in scenarios requiring frequent image pulls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you expand more on what is this additional overhead?
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
fb66d1b
to
9546ed9
Compare
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
9546ed9
to
e23bc03
Compare
Requesting approval from @kubernetes/sig-node-feature-requests |
Co-authored-by: Brandon Mitchell <git@bmitch.net>
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
excellent work here! /lgtm |
/approve |
@kubernetes/prod-readiness-reviewers PTAL |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mrunalp, sallyom, soltysh The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
||
To clear old volumes, all workloads using the `VolumeSource` needs to be | ||
recreated after restarting the kubelets. The kube-apiserver does only the API | ||
validation whereas the kubelets serve the implementation. This means means that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
s/means means/means/