Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(stable) support for multiple artifact version in the same stage #250

Closed
tibor-mach opened this issue Sep 5, 2022 · 3 comments
Closed

Comments

@tibor-mach
Copy link

I have been thinking about using GTO in a data registry (if data used in ML model training). For these purposes, I find the support of multiple versions in the same stage quite handy.

Reasoning:

Machine learning models are implicitly coupled with data engineering pipelines that produce their features (and other pipelines which produce labels for training in situations where this is not done manually). It is crucial to keep the features in training in sync with those used in production, so it is a good idea to version the datasets used in training with the versions of the relevant data engineering pipeline. So far so good, GTO can do this and one stage per version is sufficient.

However, this coupling also means that ML experiments really already start at the data engineering level. For example, I might have an idea that a specific new aggregation of a customer's purchasing history might be a beneficial new feature for our model(s). So I create a new feature branch of our data pipeline where I implement this new feature. This pipeline also generates a new version of a training dataset which is assigned (e.g.) to "dev" stage in the data registry. A colleague has a similar idea and wants to test out her new feature at the same time. Now we end up with 2 dataset versions in "dev" (or 3 in case we don't use a kanban workflow and the dataset generated by the current production data pipeline is also assigned to "dev"). This is fine - we don't want to pollute our production data pipeline with new transformations (which create those new features) unless we can show they are be useful for anything. But to show that we need to be able to run ML experiments with those "candidate" datasets.

@aguschin
Copy link
Contributor

aguschin commented Sep 7, 2022

Linking some issues where we've discussed this in the past:

@aguschin
Copy link
Contributor

aguschin commented Sep 7, 2022

сс @omesser @dmpetrov - this is basically a request for "Free-form labels mechanics" instead complimentary to "Envs mechanics" we have in place. I made it in fact possible in CLI with some optional args, see this README section for an example.

@tibor-mach, is it enough now currently? I mean, is it ok to provide this option (--versions-per-stage) each time in gto show? As I see this, it may not be enough in case you want to have both current approach and "Free-form labels mechanics" approach used, then you need to somehow distinguish between them.

@tibor-mach
Copy link
Author

@aguschin hmm, I can imagine using both approaches as you mention. Although to me, those would then probably be two conceptually different registries...mixing both approaches in a single registry with a single gto instance seems kind of messy, sort of like mixing gitflow and trunk based development in a single repository. It might make sense to use both approaches by different teams or to handle different types of artifacts but then I'd like to see it separated into different repositories.

So I think having an option to enable this workflow is enough (at least for me). The main reason why I brought this up was to add an argument to the discussion for why this is a valid option and so should not be outright deprecated. The kanban-style GTO also makes sense as an option, I'd say.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants