(stable) support for multiple artifact version in the same stage #250

tibor-mach · 2022-09-05T16:34:49Z

I have been thinking about using GTO in a data registry (if data used in ML model training). For these purposes, I find the support of multiple versions in the same stage quite handy.

Reasoning:

Machine learning models are implicitly coupled with data engineering pipelines that produce their features (and other pipelines which produce labels for training in situations where this is not done manually). It is crucial to keep the features in training in sync with those used in production, so it is a good idea to version the datasets used in training with the versions of the relevant data engineering pipeline. So far so good, GTO can do this and one stage per version is sufficient.

However, this coupling also means that ML experiments really already start at the data engineering level. For example, I might have an idea that a specific new aggregation of a customer's purchasing history might be a beneficial new feature for our model(s). So I create a new feature branch of our data pipeline where I implement this new feature. This pipeline also generates a new version of a training dataset which is assigned (e.g.) to "dev" stage in the data registry. A colleague has a similar idea and wants to test out her new feature at the same time. Now we end up with 2 dataset versions in "dev" (or 3 in case we don't use a kanban workflow and the dataset generated by the current production data pipeline is also assigned to "dev"). This is fine - we don't want to pollute our production data pipeline with new transformations (which create those new features) unless we can show they are be useful for anything. But to show that we need to be able to run ML experiments with those "candidate" datasets.

aguschin · 2022-09-07T08:18:23Z

Linking some issues where we've discussed this in the past:

aguschin · 2022-09-07T08:35:05Z

сс @omesser @dmpetrov - this is basically a request for "Free-form labels mechanics" instead complimentary to "Envs mechanics" we have in place. I made it in fact possible in CLI with some optional args, see this README section for an example.

@tibor-mach, is it enough now currently? I mean, is it ok to provide this option (--versions-per-stage) each time in gto show? As I see this, it may not be enough in case you want to have both current approach and "Free-form labels mechanics" approach used, then you need to somehow distinguish between them.

tibor-mach · 2022-09-07T13:34:26Z

@aguschin hmm, I can imagine using both approaches as you mention. Although to me, those would then probably be two conceptually different registries...mixing both approaches in a single registry with a single gto instance seems kind of messy, sort of like mixing gitflow and trunk based development in a single repository. It might make sense to use both approaches by different teams or to handle different types of artifacts but then I'd like to see it separated into different repositories.

So I think having an option to enable this workflow is enough (at least for me). The main reason why I brought this up was to add an argument to the discussion for why this is a valid option and so should not be outright deprecated. The kanban-style GTO also makes sense as an option, I'd say.

aguschin closed this as completed Aug 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(stable) support for multiple artifact version in the same stage #250

(stable) support for multiple artifact version in the same stage #250

tibor-mach commented Sep 5, 2022

aguschin commented Sep 7, 2022

aguschin commented Sep 7, 2022

tibor-mach commented Sep 7, 2022

(stable) support for multiple artifact version in the same stage #250

(stable) support for multiple artifact version in the same stage #250

Comments

tibor-mach commented Sep 5, 2022

aguschin commented Sep 7, 2022

aguschin commented Sep 7, 2022

tibor-mach commented Sep 7, 2022