-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(stable) support for multiple artifact version in the same stage #250
Comments
Linking some issues where we've discussed this in the past: |
сс @omesser @dmpetrov - this is basically a request for "Free-form labels mechanics" instead complimentary to "Envs mechanics" we have in place. I made it in fact possible in CLI with some optional args, see this README section for an example. @tibor-mach, is it enough now currently? I mean, is it ok to provide this option ( |
@aguschin hmm, I can imagine using both approaches as you mention. Although to me, those would then probably be two conceptually different registries...mixing both approaches in a single registry with a single gto instance seems kind of messy, sort of like mixing gitflow and trunk based development in a single repository. It might make sense to use both approaches by different teams or to handle different types of artifacts but then I'd like to see it separated into different repositories. So I think having an option to enable this workflow is enough (at least for me). The main reason why I brought this up was to add an argument to the discussion for why this is a valid option and so should not be outright deprecated. The kanban-style GTO also makes sense as an option, I'd say. |
I have been thinking about using GTO in a data registry (if data used in ML model training). For these purposes, I find the support of multiple versions in the same stage quite handy.
Reasoning:
Machine learning models are implicitly coupled with data engineering pipelines that produce their features (and other pipelines which produce labels for training in situations where this is not done manually). It is crucial to keep the features in training in sync with those used in production, so it is a good idea to version the datasets used in training with the versions of the relevant data engineering pipeline. So far so good, GTO can do this and one stage per version is sufficient.
However, this coupling also means that ML experiments really already start at the data engineering level. For example, I might have an idea that a specific new aggregation of a customer's purchasing history might be a beneficial new feature for our model(s). So I create a new feature branch of our data pipeline where I implement this new feature. This pipeline also generates a new version of a training dataset which is assigned (e.g.) to "dev" stage in the data registry. A colleague has a similar idea and wants to test out her new feature at the same time. Now we end up with 2 dataset versions in "dev" (or 3 in case we don't use a kanban workflow and the dataset generated by the current production data pipeline is also assigned to "dev"). This is fine - we don't want to pollute our production data pipeline with new transformations (which create those new features) unless we can show they are be useful for anything. But to show that we need to be able to run ML experiments with those "candidate" datasets.
The text was updated successfully, but these errors were encountered: