Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Advanced filtering for unpack #391

Closed
gorkem opened this issue Jul 4, 2024 · 9 comments · Fixed by #446
Closed

Advanced filtering for unpack #391

gorkem opened this issue Jul 4, 2024 · 9 comments · Fixed by #446
Assignees
Labels
enhancement New feature or request

Comments

@gorkem
Copy link
Contributor

gorkem commented Jul 4, 2024

Describe the problem you're trying to solve
It should be possible to unpack only named artifacts from an artifact group. For instance, it should be possible to unpack only README.md from docs.

Describe the solution you'd like
Add a new flag to unpack command like --filter the value of the filter should be able to indicate the artifact type and a name/patch to match. Any names/paths that partially or fully to the filter should be extracted.

And example would be kit unpack --filter=code:*config.* which would extract all the files that has "config." in the path names

  • The flag should support glob patterns for flexibility.
  • Ensure that the implementation is backward compatible. It should be possible to do --filter:docs:* to extract all docs layers.
@gorkem gorkem added the enhancement New feature or request label Jul 4, 2024
@amisevsk
Copy link
Contributor

amisevsk commented Jul 4, 2024

Do we want the filter to work per-layer, or per-file? Using the yet to be merged docs layers as an example, if we have a kitfile

docs
  - name: model-documentation
    path: docs/

can my glob (--filter=docs:README*) extract all readme files? How do we handle deeply-nested files?

My concern in this case is that we're writing a fairly complicated filter spec to handle files that the user is not necessarily familiar with (what are the filenames of files that are of interest to me? Is it named README.md or readme.md?). If I hand you a modelkit, will you be able to meaningfully use filters like this to get what you want?

My initial conception of this sort of feature would be that it works more on layers: you have to include more context in the kitfile, but if you have something like

docs:
  - name: main-documentation
    path: docs/
  - name: readme
    path: README.md
  - name: changelog
    path: CHANGELOG.md

you could use filters as follows:

  • --filter=docs to extract all docs layers
  • --filter=docs:readme to extract only the readme
  • Something like --filter=docs:readme,changelog to extract the readme and changelog

This is simpler but has some benefits:

  1. Kit doesn't need to unpack all layers in order to do the unpack (with filepath globs, we'd have to look in the big main-documentation layer even if you just want the readme)
  2. It's arguably more useful for sharing modelkits, as you can at-a-glance see what's relevant for extracting
  3. It's simpler to understand and doesn't require understanding how unix-style file globbing works.
  4. It avoids some strange edge cases that are otherwise tricky (e.g. unpacking some/deeply/nested/README.md that references other files and is broken if you only have it)

@rmtuckerphx
Copy link

rmtuckerphx commented Jul 18, 2024

@amisevsk I like your simplified filter idea based on name.

I was thinking about this in relation to datasets where I might have training data files but also have a hyperparameter JSON file that is used during inference. The training files could be really large but the JSON file would like be really small.

I would want the most efficient way possible to unpack only the params.

datasets:
  - description: training data
    name: training-data
    path: ./data/forum-to-2023-train.csv

  - description: validation data
    name: validation-data
    path: ./data/test.csv

  - description: hyperparameters 
    name: params-config
    path: ./data/params.json

   // alternatively
  - description: hyperparameters 
    name: params-config
    path: ./data/params.json
    mediaType: application/vnd.myorg.oci.params.v1.tar

Would there need to be a way to specify if a Kitfile entry gets it's own entry in ModelKit's manifest under layers?
Maybe we could assign a custom mediaType and only unpack that type.
--filter=mediatype:application/vnd.myorg.oci.params.v1.tar

@gorkem
Copy link
Contributor Author

gorkem commented Jul 18, 2024

@rmtuckerphx I have considered custom media types before but communicating them to a person who had just received a modelkit felt like more trouble than it's worth.

@amisevsk I like the idea of simpler filtering without the globs. If I understand correctly, it means the filter will match to the path exactly? For instance --filter=docs:README.md,some/deeply/nested/README.md would be result extracting 2 files.

@rmtuckerphx
Copy link

@gorkem Custom media types might not be something that beginners choose to use but I know exactly how I'd use it and the fact that I can unpack only a specific, named mediaType is so powerful.

Here is my idea for custom media types that fits with the simplified named filter idea: --filter=media:<name>

Add a new media section where each section can include a list of files:

media
  - description: files needed by service calling infer on model
    name: service
    mediaType: application/vnd.myorg.oci.service.v1.tar # custom media type
    parts:
      - description: hyperparameters
        path: ./service/params.json
      - description: another file
        path: ./path/to/files

@amisevsk
Copy link
Contributor

amisevsk commented Jul 19, 2024

I find the idea of custom media types interesting, we should split that into a separate issue for consideration. It might make sense as an advanced user feature, but I worry that it would introduce some complexity/strangeness and might be a non-goal -- ultimately, Kit will have to handle custom media types identically to some other media type (if it can even parse them correctly), so they're not meaningfully unique media types, in a functional sense. Using media types in this way feels like a potential misuse of what a media type is intended to convey (it's for the machine to know how to handle the data, not for the user). In other words, can this functionality be captured in a sort of "group" overlay on top of existing layers?

If I understand correctly, it means the filter will match to the path exactly? For instance --filter=docs:README.md,some/deeply/nested/README.md would be result extracting 2 files.

I'm thinking more restricted than that; you can't filter for actual files inside the tarballs, you can only filter for entire layers (and it's on the creator of the Kitfile to make those layers meaningful). For the hyperparameters example above

datasets:
  - description: training data
    name: training-data
    path: ./data/forum-to-2023-train.csv

  - description: validation data
    name: validation-data
    path: ./data/test.csv

  - description: hyperparameters 
    name: params-config
    path: ./data/params.json

   // alternatively
  - description: hyperparameters 
    name: params-config
    path: ./data/params.json
    mediaType: application/vnd.myorg.oci.params.v1.tar

you would use kit unpack --filter=datasets:params-config to get just the hyperparameters layer, with everything it includes.

This approach saves us work (we wouldn't even have to download the other dataset layers). With something like --filter=docs:README.md,some/deeply/nested/README.md, there's no great way of knowing which docs layer might contain some/deeply/nested/README.md, so we'd effectively have to pull and unpack all docs layers and discard 90% of the work.

@rmtuckerphx
Copy link

@amisevsk My understanding was that all datasets were included in the same layer. If each named dataset is a separate layer and I can get to my 2k params-config quickly without having to download (worry about) my 550MB training-data then I'm fine.

@rmtuckerphx
Copy link

I proved to myself that 2 data files will get 2 layers:

Kitfile

manifestVersion: v1.0.0

package:
  name: rain-kit-test
  version: 1.0.0
  description: Kit Test

docs:
  - path: ./README.md
    description: Readme file

datasets:
  - name: train-data
    path: data/train.json
  - name: config
    path: data/config.json

$ kit inspect myreg/rain-kit-test:1.0.1

{
  "digest": "sha256:281f07e19706860a54ae452e4e30488bf816eec8a98cbd1f9b9d05801be78884",
  "cliVersion": "0.3.0-1653de5",
  "kitfile": {
    "manifestVersion": "v1.0.0",
    "package": {
      "name": "rain-kit-test",
      "version": "1.0.0",
      "description": "Kit Test"
    },
    "datasets": [
      {
        "name": "train-data",
        "path": "data/train.json"
      },
      {
        "name": "config",
        "path": "data/config.json"
      }
    ],
    "docs": [
      {
        "path": "./README.md",
        "description": "Readme file"
      }
    ]
  },
  "manifest": {
    "schemaVersion": 2,
    "config": {
      "mediaType": "application/vnd.kitops.modelkit.config.v1+json",
      "digest": "sha256:f0745adda55d9f5abea9f689cb35e02dc9c7e1159e550169fdda78d6ddcb804b",
      "size": 270
    },
    "layers": [
      {
        "mediaType": "application/vnd.kitops.modelkit.dataset.v1.tar",
        "digest": "sha256:e0d71d869f0d196d8d97716424a5649fb7136ab43322e2c0f1687ae732a26df9",
        "size": 2048
      },
      {
        "mediaType": "application/vnd.kitops.modelkit.dataset.v1.tar",
        "digest": "sha256:1d770b1f7a362e8cb2245ef1e18a06ef81d7378efb0a3d8f70dfe63960e154b0",
        "size": 2048
      },
      {
        "mediaType": "application/vnd.kitops.modelkit.docs.v1.tar",
        "digest": "sha256:8ddea24d6194ead2e0532475b99eb343670e83496e57708842af9ab8507305c4",
        "size": 2048
      }
    ],
    "annotations": {
      "ml.kitops.modelkit.cli-version": "0.3.0-1653de5"
    }
  }
}

Is there a way to only pull from the registry the specific dataset (layer) for config?
That would prevent network traffic of extra layers.

Let's say that I have a model, train data, test data and config all in the same ModelKit stored in a registry.
For the data scientists on my team, they care about model, train data and test data.
For deployment, there is a service that dynamically loads the model from the registry when it starts and also needs the config. The service doesn't care about the large train data file used during model training. It only cares about the model.

Should there be 2 separage ModelKits in the registry for each of those purposes with only the files needed for the context or can we use a single ModelKit?

@gorkem
Copy link
Contributor Author

gorkem commented Jul 24, 2024

This enhancement request is exactly for the issue you are pointing to. The current filter implementation only allows us to download all layers of the same media type (namely docs, code, model, dataset). This enhancement is to introduce an advanced filtering to select one or more layers with the same media type.

@amisevsk
Copy link
Contributor

Sorry, I was away for a few weeks. You're right that each entry in a Kitfile (dataset, code, docs, or model) is saved as a separate layer in the modelkit (so each element can be downloaded independently). However, for our initial implementation for kit unpack filters, we restricted the filters based only on type (e.g. all datasets, all code, etc.). While this works great for "I want just the model and none of the code that comes with it", it would be great to also allow "I just want this dataset instead of all of them" as in your example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants