Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support storing Ollama [non-]OCI image layers #2075

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

yeahdongcn
Copy link

@yeahdongcn yeahdongcn commented Aug 26, 2024

Background:

Kubernetes 1.31 introduced a new feature: Read-Only Volumes Based on OCI Artifacts. I believe this feature could be very useful for deploying a dedicated model alongside Ollama in Kubernetes.

Ollama has introduced several new media types (e.g. application/vnd.ollama.image.model) for storing GGUF models, system prompts, and more. Each layer is essentially a file and does not need to be untarred.

A PR for containers/image has added layerFilename to addedLayerInfo, and this PR handles the layer creation through overlay driver.

Please see the following logs for instructions on how to mount the Ollama image as a volume:

# Copied from testdata and added mounts information
❯ cat container.json
{
  "metadata": {
    "name": "podsandbox-sleep"
  },
  "image": {
    "image": "registry.docker.com/ollama/ollama:latest"
  },
  "command": [
    "/bin/sleep",
    "6000"
  ],
  "args": [
    "6000"
  ],
  "working_dir": "/",
  "envs": [
    {
      "key": "PATH",
      "value": "/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
    },
    {
      "key": "GLIBC_TUNABLES",
      "value": "glibc.pthread.rseq=0"
    }
  ],
  "annotations": {
    "pod": "podsandbox"
  },
  "log_path": "",
  "stdin": false,
  "stdin_once": false,
  "tty": false,
  "linux": {
    "security_context": {
      "namespace_options": {
        "pid": 1
      },
      "readonly_rootfs": false
    },
    "resources": {
      "cpu_period": 10000,
      "cpu_quota": 20000,
      "cpu_shares": 512,
      "oom_score_adj": 30,
      "memory_limit_in_bytes": 268435456
    }
  },
  "mounts": [
    {
      "host_path": "",
      "container_path": "/volume",
      "image": {
        "image": "registry.ollama.ai/library/tinyllama:latest"
      },
      "readonly": true
    }
  ]
}
# copied from testdata
❯ cat sandbox_config.json
{
        "metadata": {
                "name": "podsandbox1",
                "uid": "redhat-test-crio",
                "namespace": "redhat.test.crio",
                "attempt": 1
        },
        "hostname": "crictl_host",
        "log_directory": "",
        "dns_config": {
                "servers": [
                        "8.8.8.8"
                ]
        },
        "port_mappings": [],
        "resources": {
                "cpu": {
                        "limits": 3,
                        "requests": 2
                },
                "memory": {
                        "limits": 50000000,
                        "requests": 2000000
                }
        },
        "labels": {
                "group": "test"
        },
        "annotations": {
                "owner": "hmeng",
                "security.alpha.kubernetes.io/seccomp/pod": "unconfined",
                "com.example.test": "sandbox annotation"
        },
        "linux": {
                "cgroup_parent": "pod_123-456.slice",
                "security_context": {
                        "namespace_options": {
                                "network": 2,
                                "pid": 1,
                                "ipc": 0
                        },
                        "selinux_options": {
                                "user": "system_u",
                                "role": "system_r",
                                "type": "svirt_lxc_net_t",
                                "level": "s0:c4,c5"
                        }
                }
        }
}
❯ sudo crictl --timeout=200s --runtime-endpoint unix:///run/crio/crio.sock run ./container.json ./sandbox_config.json
INFO[0005] Pulling container image: registry.docker.com/ollama/ollama:latest 
INFO[0005] Pulling image registry.ollama.ai/library/tinyllama:latest to be mounted to container path: /volume 
7e437894449f6429799cc5ef236c4a4570a69e3769bf324bbf700045e383cae8
❯ sudo crictl --timeout=200s --runtime-endpoint unix:///run/crio/crio.sock ps
CONTAINER           IMAGE                                        CREATED             STATE               NAME                ATTEMPT             POD ID              POD
7e437894449f6       registry.docker.com/ollama/ollama:latest     8 seconds ago       Running             podsandbox-sleep    0                   4d1766fdf286b       unknown
❯ sudo crictl --timeout=200s --runtime-endpoint unix:///run/crio/crio.sock exec -it 7e437894449f6 bash
root@crictl_host:/# cd volume/
root@crictl_host:/volume# ls -l
total 622772
-rw-r--r-- 1 root root 637699456 Aug 26 08:32 model
-rw-r--r-- 1 root root        98 Aug 26 08:32 params
-rw-r--r-- 1 root root        31 Aug 26 08:32 system
-rw-r--r-- 1 root root        70 Aug 26 08:32 template
root@crictl_host:/volume# 

Copy link
Contributor

openshift-ci bot commented Aug 26, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: yeahdongcn
Once this PR has been reviewed and has the lgtm label, please assign flouthoc for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
@rhatdan
Copy link
Member

rhatdan commented Aug 26, 2024

@baude is working on storing and producing artifacts, Currently looking at storing these at a higher level then containers/storage, but under the container storage directory tree.

I will point him to this issue.

BTW Have you looked at ramalama, an alternative to ollama.

@baude
Copy link
Member

baude commented Aug 26, 2024

yes, we are working on something called libartifact that will mimic some of the behaviors of c/s but will be singularly purposed for oci artifacts. that work has started but is in its infancy.

@cgwalters
Copy link
Contributor

yes, we are working on something called libartifact

Is there a tracking/design issue for this? I have Opinions on this.

Copy link
Contributor

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!

@@ -790,6 +790,23 @@ func supportsOverlay(home string, homeMagic graphdriver.FsMagic, rootUID, rootGI
return supportsDType, fmt.Errorf("'overlay' not found as a supported filesystem on this host. Please ensure kernel is new enough and has overlay support loaded.: %w", graphdriver.ErrNotSupported)
}

func cp(r io.Reader, dest string, filename string) error {
if seeker, ok := r.(io.Seeker); ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is something that should have generally been set up by the caller.

}
}

f, err := os.Create(filepath.Join(dest, filename))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use ioutils.NewAtomicFileWriter to ensure we have more transactional/idempotent semantics (see its various uses in this repo).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Layer creation already has transaction semantics via incompleteFlag; doing that for individual files inside a layer is unnecessary.

InUserNS: unshare.IsRootless(),
}); err != nil {
return 0, err
if options.LayerFilename != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a real expert in c/storage but it is a fact that the codebase predates the creation/concept of artifacts, and is very much designed around storing layers.

I suspect that this options.LayerFilename conditional thing could use a bit more design bikeshedding. I haven't looked...but for example, I think it may make more sense to actually have a separate datastore path entirely for artifacts that just happens to share code with c/storage, instead of trying to co-locate artifacts.

Copy link
Collaborator

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

We need to have a proper design discussion about storing non-image artifacts, and what does that mean.

Do not merge until that happens.

(Also ☠️ privilege escalation vulnerability.)


The “image volume” feature explicitly talks about image volumes. I don’t see that arbitrarily extending it to also accept other content is obviously desirable.

And even if it were desirable at the Kubernetes level, that doesn’t at all imply that the data should be stored in “images” with “layers” and OverlayFS and that it should be possible to podman run the thing.


… and even if we wanted to present this data as storage.Image/storage.Layer objects, I don’t know why we would want to use overlay this way; that just adds overhead.

Maybe we need that, maybe we don’t; but that requires actually thinking about that.


All of this is at best a sketch of an implementation. It corrupts data. It silently does the wrong thing on other code paths.

It doesn’t allow pushing the consumed data afterwards, and it’s not obvious how the design for that would work within the existing .Diff API.

@@ -72,6 +72,7 @@ type ApplyDiffOpts struct {
MountLabel string
IgnoreChownErrors bool
ForceMask *os.FileMode
LayerFilename *string
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens in all of the other storage drivers which weren’t modified?

Comment on lines +2480 to +2481
layer.CompressedSize = size
layer.UncompressedSize = size
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this just doesn’t work; the data was already interpreted as a tar file, and uncompressed. This is at best pretending that didn’t happen.

return 0, err
if options.LayerFilename != nil {
logrus.Debugf("Applying file in %s", applyDir)
err := cp(options.Diff, applyDir, *options.LayerFilename)
Copy link
Collaborator

@mtrmac mtrmac Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

☠️

This is an unrestricted path traversal vulnerability, run as root.

return 0, err
if options.LayerFilename != nil {
logrus.Debugf("Applying file in %s", applyDir)
err := cp(options.Diff, applyDir, *options.LayerFilename)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

options.Diff is an uncompressed stream (that the caller is interpreting as a tar). This just doesn’t work for arbitrary blobs.

@mtrmac
Copy link
Collaborator

mtrmac commented Aug 26, 2024

Compare also containers/skopeo#2395 (and, I think, ollama/ollama#6510 ): “Ollama” itself is not even remotely an OCI-compliant registry right now.

What is the ecosystem and community situation for this data format? Is this a multi-faceted interoperability effort, where all of these things are going to happily fall into place, or is this setting us up on an adversarial interoperability path?

Cc: @baude , another thing to have a very clear position on.

@cgwalters
Copy link
Contributor

This is an unrestricted path traversal vulnerability, run as root.

Right, good call out. While we should probably debate all of this in a proper design issue, my strawman starting point would be to support storing OCI artifacts in a possibly tweaked version of the standard OCI image layout in a standard way in a subdirectory of a containers-storage, i.e. /var/lib/containers/storage/artifacts or so.

@mtrmac
Copy link
Collaborator

mtrmac commented Aug 26, 2024

I think a useful general intuition is “remote OCI artifact is no more specific than an URL; storing OCI artifacts is no more specific than storing files”.

It is obviously possible to work at that level of abstraction (to pull URLs, to store files), and it happens all the time; but it’s extremely rarely the most convenient abstraction for applications to work with, with no other support.


Applications don’t inherently want to traverse OCI structures. So maybe we should hard-code some specific artifact MIME types, and provide “native” storage / filesystem formats? (This pair of PR is already very strongly going in that direction, hard-coding aspects of one specific structure, which is not even OCI.)


To me, the primary point is that container runtimes don’t have to be involved in consuming artifacts. For any proposed artifact use:

  • A container can pull that artifact at runtime into ephemeral storage or a volume. Why is doing that by the container runtime better?
  • An image creator can pull the artifact’s data into an ordinary image layer as files. (Perhaps exactly one layer per one artifact, to maximize reuse.) Why is native support by a container runtime better?

And we should only be considering use cases where the two alternatives above are not suitable, i.e. where a container-runtime native feature adds a clearly-identified value.

What are we doing here?

  • Squeezing out every bit of performance, and mounting filesystems which don’t use overlay? (Is that material? What do the numbers say?)
  • Giving users an independent name for some data, separate from the application? That runs directly counter to the “a container image is a whole application in one, what you test in CI is what you run in production” proposition of containers.
  • Naming the data because it is so large that only one copy can ever fit on a node, and we are going to be scheduling applications to k8s nodes based on data locality? I guess… really?
  • “Someone in position of authority said that we must support pulling artifacts”? Not specific enough of a reason.

@cgwalters
Copy link
Contributor

Giving users an independent name for some data, separate from the application? That runs directly counter to the “a container image is a whole application in one, what you test in CI is what you run in production” proposition of containers.

We absolutely expect people to dynamically link (i.e. runtime) containers with things like ConfigMaps, and that also makes sense for "data" like AI models and other things.

A container can pull that artifact at runtime into ephemeral storage or a volume. Why is
doing that by the container runtime better?

I think it's more that there are some common needs between "artifacts" and runnable container images. For example, signing support, runtime integrity, an API to name and version them, support for concurrent reads/writes, garbage collection, etc. Whether those needs are met by literally having them in the same codebase as "the container runtime" or whether they're better met by the two things using shared libraries is more of an implementation debate - but the need for some common functionality is IMO clear.

On the topic of runtime integrity for examlpe I am strongly of the opinion that the underlying technical heart for us to use to store "files" (container images and artifact content that can be mounted) is composefs - it allows robust online verification, deduplication, etc. And there is some composefs-related code in this repository (though, I definitely want to change some of how it works...more on that later).

@yeahdongcn
Copy link
Author

yeahdongcn commented Aug 27, 2024

I think a useful general intuition is “remote OCI artifact is no more specific than an URL; storing OCI artifacts is no more specific than storing files”.

It is obviously possible to work at that level of abstraction (to pull URLs, to store files), and it happens all the time; but it’s extremely rarely the most convenient abstraction for applications to work with, with no other support.

Applications don’t inherently want to traverse OCI structures. So maybe we should hard-code some specific artifact MIME types, and provide “native” storage / filesystem formats? (This pair of PR is already very strongly going in that direction, hard-coding aspects of one specific structure, which is not even OCI.)

To me, the primary point is that container runtimes don’t have to be involved in consuming artifacts. For any proposed artifact use:

  • A container can pull that artifact at runtime into ephemeral storage or a volume. Why is doing that by the container runtime better?
  • An image creator can pull the artifact’s data into an ordinary image layer as files. (Perhaps exactly one layer per one artifact, to maximize reuse.) Why is native support by a container runtime better?

And we should only be considering use cases where the two alternatives above are not suitable, i.e. where a container-runtime native feature adds a clearly-identified value.

What are we doing here?

  • Squeezing out every bit of performance, and mounting filesystems which don’t use overlay? (Is that material? What do the numbers say?)
  • Giving users an independent name for some data, separate from the application? That runs directly counter to the “a container image is a whole application in one, what you test in CI is what you run in production” proposition of containers.
  • Naming the data because it is so large that only one copy can ever fit on a node, and we are going to be scheduling applications to k8s nodes based on data locality? I guess… really?
  • “Someone in position of authority said that we must support pulling artifacts”? Not specific enough of a reason.

I'm sending out these two PRs (my initial work) to bring your attention to using OCI images to deliver artifacts like GGUF models and integrate them into the Kubernetes ecosystem. Considering the original purpose of containers/image and containers/storage, which may be more specific to Docker (container) images rather than OCI images, it might not be appropriate to add vendor-specific code to them.

@yeahdongcn
Copy link
Author

@baude is working on storing and producing artifacts, Currently looking at storing these at a higher level then containers/storage, but under the container storage directory tree.

I will point him to this issue.

BTW Have you looked at ramalama, an alternative to ollama.

Ramalama looks cool! But I'm looking for a way to avoid reinventing the CLI and instead leverage existing infrastructure.

@mtrmac
Copy link
Collaborator

mtrmac commented Aug 27, 2024

We absolutely expect people to dynamically link (i.e. runtime) containers with things like ConfigMaps

That’s a very fair point.


I think it's more that there are some common needs between "artifacts" and runnable container images. For example, signing support,

Fair, and that sort of implies the same on-registry storage, but it doesn’t imply storing them in the same overlay-based format, or even under the same namespace, as images.

runtime integrity

We need to define local storage mechanism first. Maybe there will be nothing in common.

an API to name and version them

(I wouldn’t say that what we have with container images, basically two plain text strings, is the best that can be done. It isn’t even clearly defined what is the “application” that is being versioned and what is the version number, see OpenShift shipping dozens of functionally-different images all in one repo. And that, in turn, is one of the blockers to a meaningful concept of “an application update”.)

, support for concurrent reads/writes

That’s ~not a native Kubernetes feature, IIRC (“use NFS”), not shared with containers, and ~incompatible with the “runtime integrity” desire.


On the topic of runtime integrity for examlpe I am strongly of the opinion that the underlying technical heart for us to use to store "files" (container images and artifact content that can be mounted) is composefs

I think that’s possible but at this point it’s not obvious to me the details will be shared — for images we want runtime integrity not just at a layer level, but for the whole layer sequence, and that might not apply to other data.

@mtrmac
Copy link
Collaborator

mtrmac commented Aug 27, 2024

I'm sending out these two PRs (my initial work) to bring your attention to using OCI images to deliver artifacts like GGUF models and integrate them into the Kubernetes ecosystem. Considering the original purpose of containers/image and containers/storage, which may be more specific to Docker (container) images rather than OCI images, it might not be appropriate to add vendor-specific code to them.

c/storage and c/image does target OCI images, not just the ~frozen schema[12] formats. And I think prototyping / figuring out how to distribute large data (of any kind) in the OCI ecosystem is clearly in scope of the project. But starting that conversation with directly using a ~completely OCI-foreign format is a very surprising place to start. This PR, as proposed, has ~nothing in common with, as you say, “using OCI images to deliver…”.

E.g. I could well imagine that somebody somewhere defines an OCI artifact MIME type + format for storing this data. Maybe that format should have a specific implementation in c/image + c/storage, separate from a general OCI artifact feature. In such a hypothetical ecosystem, it might well make sense to have a utility that converts the Ollama data into the OCI artifact format, as a compatiblity/interoperability mechanism.

Or, to say this another way, if the Ollama format should become the way to “use OCI images”, that needs to be proposed to the OCI image-spec maintainers. (And considering that the OCI artifact format, with its full generality, exists, I don’t know why they would be inclined to support that proposal, but it’s also not up to me.)

@mtrmac
Copy link
Collaborator

mtrmac commented Aug 27, 2024

I think it's more that there are some common needs between "artifacts" and runnable container images. For example, signing support, runtime integrity, an API to name and version them, support for concurrent reads/writes, garbage collection, etc.

On second thought — name/version the AI data on the build system side, not on the consumer side. Build the AI data into the container image as a (bit-for-bit reproducible) layer. Net effect: Most of the features listed features are already there.

  • Signing: Inherited from image support.
  • Runtime integrity: Inherited from image support (if any)
  • API to name/version: Becomes unnecessary, included in the application. If you want a new version of the data, deploy a new version of the application, on-registry blobs will be shared.
  • Garbage collection: inherited from image support, the need to do that separately completely goes away.

The one thing that I see missing from the above is the efficiency of avoiding overlay (does that matter??). For that, I could plausibly see an OCI image format extension, where any layer can:

  • opt out of overlay, instead be mounted directly into some path nested inside the running container
  • optionally use some other non-tar file format for the data, maybe native EROFS, or something optimized for rdiff updates.

@mtrmac mtrmac changed the title Support storing Ollama OCI image layers Support storing Ollama [non-]OCI image layers Aug 27, 2024
@cgwalters
Copy link
Contributor

, support for concurrent reads/writes
That’s ~not a native Kubernetes feature, IIRC (“use NFS”), not shared with containers, and ~incompatible with the “runtime integrity” desire.

I more meant support for pulling multiple artifacts at a time and garbage collection, not for mutating individual files in artifacts. We're not talking about log files or databases as artifacts, we're talking about configmaps, AI models, dpkg/rpm/language-package-manager-etc content (this is a big one).

@mtrmac
Copy link
Collaborator

mtrmac commented Aug 27, 2024

Make the data layers in the application image, that inherits existing concurrent pulling.

@cgwalters
Copy link
Contributor

(There's probably a better place to discuss a design for OCI artifacts, not totally sure where; I did create https://github.com/cgwalters/composefs-oci recently which touches on this and starts with composefs as the heart of things, for general interest)

Make the data layers in the application image, that inherits existing concurrent pulling.

By "application image" here you mean an OCI image (not an artifact)? If so your suggestion seems tantamount to "have a build process that transforms OCI artifacts into a tarball which can become a layer" which is basically saying "don't use OCI artifacts natively" as far as I can tell.

The use cases I have in mind very much want to maintain the identify of an artifact end-to-end from the registry to the client system.

@mtrmac
Copy link
Collaborator

mtrmac commented Sep 11, 2024

By "application image" here you mean an OCI image (not an artifact)? If so your suggestion seems tantamount to "have a build process that transforms OCI artifacts into a tarball which can become a layer" which is basically saying "don't use OCI artifacts natively" as far as I can tell.

Close enough. OCI artifacts are by nature immutable, they are not mutable volumes.

The difference between a pair of (immutable OCI image used as a root FS, immutable OCI artifact mounted into the root FS), and (immutable OCI image also containing the contents of the artifact) is… just in whether the metadata is combined at build time or at runtime. To combine the two, we are talking about conceptually cheap metadata operations either way, although aligning all inputs to allow that cheap metadata operation to happen can be pretty complex.

If we have to implement something complex, I’d rather do it in the concentrated build pipeline than at runtime in every single cluster node. That removes complexity from the runtimes; removes complexity from whole application deployment pipeline; removes complexity from the application tracking — everything has a single “image reference” that uniquely references the whole thing, instead of having to add a new “and also mount these artifacts” feature all over the ecosystem. A hypothetical fancy AI model management product can build a single clearly-identified application image as an output, and that application image can then be deployed to ~existing clusters.

(I do acknowledge that ConfigMaps are a thing, but K8s imposes severe size and count limitations on them (because they are implemented by storing them in the etcd consensus-maintained database); and they naturally follow a split between “application vendor” and “sysadmin deploying the application”, that might be entirely different organizations. What is the role split between application author and model data creator that would warrant a similarly separate OCI artifact?

@cgwalters
Copy link
Contributor

This discussion also relates to e.g. opencontainers/image-spec#1197 (comment) and the linked to opencontainers/image-spec#1190

And perhaps what would make the most sense is for "us" to form a consensus that we take to the spec?

The difference between a pair of (immutable OCI image used as a root FS, immutable OCI artifact mounted into the root FS), and (immutable OCI image also containing the contents of the artifact) is… just in whether the metadata is combined at build time or at runtime.

I don't agree. I think tooling that uses OCI artifacts "natively" would include maintaining their independent identity all the way to the client end - for example, key metadata such as an org.opencontainers.image.version label would (if implemented naively) just get lost when translating the artifact to a tar layer. (And if implemented not-naively, some interesting design questions arise as to the mapping of artifact to tar)

everything has a single “image reference” that uniquely references the whole thing, instead of having to add a new “and also mount these artifacts” feature all over the ecosystem.

There's always a long-running tension between "split into lots of individually updatable bits" and "stitch things together with higher level snapshots". The way OCP does the "release image" is in this vein, as is "big bag of free-floating RPMs" and a "compose".

Recently in bootc we also added logically bound images which are very much in this space, and note it's actually an OCI image that has references to child images, but tied together as you say into one super "image reference". (Except it has the downside today that bootc doesn't know how much data it will need to download to succeed at an update, xref containers/bootc#128 (comment) )

I agree with you that the problem domain of "mount these artifacts" is a bit open ended and would require some standardization work that could be trivially bypassed by just materializing them as tar layers in a container image context.

A hypothetical fancy AI model management product can build a single clearly-identified application image as an output, and that application image can then be deployed to ~existing clusters.

When one is talking about Large Data like AI models (but not exclusively), I think in many operational contexts it will be very much desired for them to be decoupled from the application and "dynamically linked" in effect.

I do acknowledge that ConfigMaps are a thing, but K8s imposes severe size and count limitations on them (because they are implemented by storing them in the etcd consensus-maintained database)

Yes but that's just because OCI artifacts didn't exist when ConfigMaps were invented! I think it would make total sense to store them as artifacts in a registry - do you agree?

On the bootc side I also want to support configmaps attached to the host that can be rev'd independently from the base OS image and dynamically updated without restart (if desired) for basically exactly the same reason Kubernetes does.

@cgwalters
Copy link
Contributor

cgwalters commented Sep 11, 2024

Also I wanted to say another downside IMO of flatting artifacts to tar layers is that because OCI images are architecture dependent, if we're talking about Large artifacts, one needs to be sure those tar layers are generated "reproducibly" and not include random junk data with floating timestamps and the like, so you don't duplicate your artifact data across N architectures (and also lacking reproducibility would mean it just gets duplicated each rebuild, even if the data didn't change).

@cgwalters
Copy link
Contributor

Make the data layers in the application image, that inherits existing concurrent pulling.
....
instead of having to add a new “and also mount these artifacts” feature all over the ecosystem.

OK just to play out your suggestion a bit more...if "we" wanted to encourage that approach I think there'd need to be some recommended tooling at least integrated with a Containerfile-style build flow as a default entrypoint.

I think a strawman for that would be a lot like COPY --link except also with support for mapping artifact (non-tar) layers to tar...which still has all the open questions about how exactly artifact layers appear as files in a generic context but maybe a simple strawman would be COPY --link --from=aimodel / /aimodel and one would end up with e.g. /aimodel/layers/[0..n] that are integer indicies into the layers. Hmm and we should probably also have /aimodel/manifest.json so an artifact-aware tool can parse that for metadata. And at this point we're just defining a standard serialization of an OCI artifact to a single filesystem tree which is very much what I was also looking at in containers/composefs#294 as a way to ensure that metadata + data is covered by fsverity.

@mtrmac
Copy link
Collaborator

mtrmac commented Sep 11, 2024

A hypothetical fancy AI model management product can build a single clearly-identified application image as an output, and that application image can then be deployed to ~existing clusters.

When one is talking about Large Data like AI models (but not exclusively), I think in many operational contexts it will be very much desired for them to be decoupled from the application and "dynamically linked" in effect.

“Dynamically linking” by uploading a new manifest to a registry… is different but not obviously inferior, of course as long as that does not trigger a transfer of gigabytes of the raw data. (Compare your proposal to replace config maps with on-registry data, that’s the same thing!) I can see an argument that it’s a “worse is better” solution, giving up on explicitly modeling the versioning of the data, to get the benefit of much easier integration into existing platforms.

I do acknowledge that ConfigMaps are a thing, but K8s imposes severe size and count limitations on them (because they are implemented by storing them in the etcd consensus-maintained database)

Yes but that's just because OCI artifacts didn't exist when ConfigMaps were invented! I think it would make total sense to store them as artifacts in a registry - do you agree?

That’s very unclear to me; it would force giving every user a writeable registry space, and something would need to manage credentials to that registry space, both for users and for nodes deploying the configuration; vs. using native Kubernetes RBAC. And there is some overhead to talking to the registry, even assuming the data is stored directly inside the manifest and to avoid extra per-item roundtrips. I suppose most, or all, of that, could be hidden by tooling.

@mtrmac
Copy link
Collaborator

mtrmac commented Sep 11, 2024

OK just to play out your suggestion a bit more...if "we" wanted to encourage that approach I think there'd need to be some recommended tooling at least integrated with a Containerfile-style build flow as a default entrypoint.

Yes, definitely. Well, users might want to use some higher-level tools instead, but the underlying capability needs to be there.

I think a strawman for that would be a lot like COPY --link

Yes, vaguely.

except also with support for mapping artifact (non-tar) layers to tar

This is still assuming that artifacts play a role at all. Maybe?

...which still has all the open questions about how exactly artifact layers appear as files in a generic context but maybe a simple strawman would be COPY --link --from=aimodel / /aimodel and one would end up with e.g. /aimodel/layers/[0..n] that are integer indicies into the layers. Hmm and we should probably also have /aimodel/manifest.json so an artifact-aware tool can parse that for metadata.

That sounds technically correct but inconvenient to use in applications, forcing them to include an OCI parser. And in a naive implementation in would result in two on-registry representations, one as an artifact with raw data, and one as a tar containing the /layers/0 files.


My strawman would be

LAYER quay.io/models/summarize:v1 /model

which:

  • parses the summarize image (or a very image-like artifact with ordinary tar layers)
  • Finds[1] a layer which contains /model; there must be only one, and it must not contain anything else
  • Links exactly that layer into the current image as a new layer; and arranges things so that both the local storage, and the compressed on-registry storage, is certainly reused (not guaranteed for on-registry storage today)
  • Marks the created layer with an annotation that this layer is for the /model directory [2]
  • (Statically guarantees that the resulting image has no other layers affecting the /model directory?)
  • (Annotates the created layer with the originating registry and version, perhaps, why not. That models the version explicitly, OTOH it might leak internal host names.)

Where the [1] step finding the right layer would probably rely on the annotation created in [2]. (We’d also need a “bootstrapping” version that creates the annotation on ordinary COPY.)

This allows

  • guaranteed on-registry layer reuse
  • any arbitrary filesystem layout the application likes, as long as it is confined to a directory subtree
  • directly reusing data from other images
  • distributing models directly with applications, or in non-executable artifacts
  • possibly migrating to other filesystem formats, like EROFS / … ; whenever the container runtime adds support for such a format, no other changes to the feature need to happen.

@cgwalters
Copy link
Contributor

That sounds technically correct but inconvenient to use in applications, forcing them to include an OCI parser. And in a naive implementation in would result in two on-registry representations, one as an artifact with raw data, and one as a tar containing the /layers/0 files.

The app wouldn't need to be OCI aware exactly, assuming the artifact has just one layer it could just call open(/aimodel/layers/0, O_RDONLY | O_CLOEXEC) to get the data. Reading the manifest (or config, if relevant for the artifact) are optional things present if it wants to e.g. access version metadata.

My strawman would be

I don't understand how your proposal avoids the double representation (once as tar, once not) either (mine doesn't), but fundamentally OCI images need to be tarballs today.

I guess bigger picture it probably does make sense to just try to standardize the "ship ai models as uncompressed tar" plus we probably need standard for "this is architecture independent data". It's just a bit ugly.

@tarilabs
Copy link
Member

Hi 👋
found this thread while working on a OCI Artifact use-case.

We're looking into the usage of OCI Artifacts as a first-class citizen for a complementary storage solution for Model Registry.
We're using OCI Artifacts by following the OCI spec.
We have the necessary file(s) in layer, with_out_ necessarily tar-ring them.

I would like to receive guidance please if in the short term is best for me to:

  • consider a "transitioning flag" in the tooling, where for the time being it would be more pragmatic for me to push OCI "Artifacts" but then each file/layer is actually a tarball and using a bespoke config?,
  • or, can I fully rely on the elements from the spec above

Some example use-cases summarized in this website and some demos.
Thank you

@mtrmac
Copy link
Collaborator

mtrmac commented Sep 12, 2024

I don't understand how your proposal avoids the double representation (once as tar, once not) either

The way I’m thinking about it, the registry upload would only be as tar. The data might exist as non-tar in other places, sure.

@mtrmac
Copy link
Collaborator

mtrmac commented Sep 12, 2024

I would like to receive guidance please if in the short term is best for me to:

  • … each file/layer is actually a tarball and using a bespoke config?,
  • or, can I fully rely on the elements from the spec above

I think a useful framing is that “publishing OCI artifacts” is pretty close to like “publishing directories of files”; just the fact that it is an OCI artifact makes no promise of interoperability. Interoperability comes from consumers and producers agreeing on these other details.

Very short-term, the only thing I’m aware that is in(?) some kind of production is https://kubernetes.io/blog/2024/08/16/kubernetes-1-31-image-volume-source/ , which requires an ordinary image, I think an artifact would not work at all.

Beyond that… I think this is one of the many places where the design is being hashed out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants