-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Advanced filtering for unpack #391
Comments
Do we want the filter to work per-layer, or per-file? Using the yet to be merged docs layers as an example, if we have a kitfile docs
- name: model-documentation
path: docs/ can my glob ( My concern in this case is that we're writing a fairly complicated filter spec to handle files that the user is not necessarily familiar with (what are the filenames of files that are of interest to me? Is it named README.md or readme.md?). If I hand you a modelkit, will you be able to meaningfully use filters like this to get what you want? My initial conception of this sort of feature would be that it works more on layers: you have to include more context in the kitfile, but if you have something like docs:
- name: main-documentation
path: docs/
- name: readme
path: README.md
- name: changelog
path: CHANGELOG.md you could use filters as follows:
This is simpler but has some benefits:
|
@amisevsk I like your simplified filter idea based on I was thinking about this in relation to I would want the most efficient way possible to unpack only the
Would there need to be a way to specify if a Kitfile entry gets it's own entry in ModelKit's manifest under |
@rmtuckerphx I have considered custom media types before but communicating them to a person who had just received a modelkit felt like more trouble than it's worth. @amisevsk I like the idea of simpler filtering without the globs. If I understand correctly, it means the filter will match to the |
@gorkem Custom media types might not be something that beginners choose to use but I know exactly how I'd use it and the fact that I can unpack only a specific, named mediaType is so powerful. Here is my idea for custom media types that fits with the simplified named filter idea: Add a new
|
I find the idea of custom media types interesting, we should split that into a separate issue for consideration. It might make sense as an advanced user feature, but I worry that it would introduce some complexity/strangeness and might be a non-goal -- ultimately, Kit will have to handle custom media types identically to some other media type (if it can even parse them correctly), so they're not meaningfully unique media types, in a functional sense. Using media types in this way feels like a potential misuse of what a media type is intended to convey (it's for the machine to know how to handle the data, not for the user). In other words, can this functionality be captured in a sort of "group" overlay on top of existing layers?
I'm thinking more restricted than that; you can't filter for actual files inside the tarballs, you can only filter for entire layers (and it's on the creator of the Kitfile to make those layers meaningful). For the hyperparameters example above datasets:
- description: training data
name: training-data
path: ./data/forum-to-2023-train.csv
- description: validation data
name: validation-data
path: ./data/test.csv
- description: hyperparameters
name: params-config
path: ./data/params.json
// alternatively
- description: hyperparameters
name: params-config
path: ./data/params.json
mediaType: application/vnd.myorg.oci.params.v1.tar you would use This approach saves us work (we wouldn't even have to download the other dataset layers). With something like |
@amisevsk My understanding was that all datasets were included in the same layer. If each named dataset is a separate layer and I can get to my 2k params-config quickly without having to download (worry about) my 550MB training-data then I'm fine. |
I proved to myself that 2 data files will get 2 layers: Kitfile
$ kit inspect myreg/rain-kit-test:1.0.1
Is there a way to only pull from the registry the specific dataset (layer) for config? Let's say that I have a model, train data, test data and config all in the same ModelKit stored in a registry. Should there be 2 separage ModelKits in the registry for each of those purposes with only the files needed for the context or can we use a single ModelKit? |
This enhancement request is exactly for the issue you are pointing to. The current filter implementation only allows us to download all layers of the same media type (namely docs, code, model, dataset). This enhancement is to introduce an advanced filtering to select one or more layers with the same media type. |
Sorry, I was away for a few weeks. You're right that each entry in a Kitfile (dataset, code, docs, or model) is saved as a separate layer in the modelkit (so each element can be downloaded independently). However, for our initial implementation for |
Describe the problem you're trying to solve
It should be possible to unpack only named artifacts from an artifact group. For instance, it should be possible to unpack only README.md from docs.
Describe the solution you'd like
Add a new flag to
unpack
command like--filter
the value of the filter should be able to indicate the artifact type and a name/patch to match. Any names/paths that partially or fully to the filter should be extracted.And example would be
kit unpack --filter=code:*config.*
which would extract all the files that has "config." in the path names--filter:docs:*
to extract all docs layers.The text was updated successfully, but these errors were encountered: