Skip to content

Taking a step back and looking at the big picture #3549

@elgehelge

Description

@elgehelge

I would like to discuss my view on how a good ML engineering tool should work. I am not saying that my view is the best view, I am inviting you to discuss it with me so that we all can get one step closer to understanding what the perfect tool would be like.

The two things where my views conflict most with DVC is in regards to parameterized pipelines and everything related to handling caching and pipeline definition as separate things.

Okay, and here comes my lengthy thoughts, roughly grouped into a handful of different concepts:

Pipeline:
What ML engineering is actually about. You might think ML is about building a model, but actually it is about building a pipeline that outputs models. This might not be true for a quick ad-hoc data analysis, but in academia and production software reproducibility is a requirement. Reproducibility boils down to being able to create a new model of the same kind. When we can do this, we can also begin to iteratively build better models or update models with newer data. For this reason it is important that pipelines are "deterministic", meaning that you should be able to reproduce the same-ish model over and over again. This also means that pipelines should have no side effects and that all steps should be dependent on each other in a Directed Acyclic Graph (DAG). Everything we are building is part of the DAG. Also, since many models are stochastic and we do not want to risk overfitting to a random seed, we need a loose definition of "deterministic" which allows models to vary slightly. Validating that it is the "same" model can be done by verifying that specific metrics are within an acceptable tolerance. A pipeline consists of code and raw data.

Experiments:
An experiment is a specific version of a pipeline (code and raw data). The goal of an experiment is to learn something. That is, code up an alternative reality, produce an output, study it, and take a decision about whether this alternative reality is the preferred one or not. Since you might want to abandon an unsuccessful experiment, Git branches works well for representing experiments. This enables side-by-side comparison of code and is already the default in software engineering but since the world is much more predictable in software engineering, experiments succeeds often and we might not realise that they can be understood as experiments.

Raw data:
Raw data is the raw material which combined with code will produce our model. So just like code, raw data needs to be version controlled if we want reproducibility. If the code and the data get out of sync by updating the data (schema or distribution) or rolling back the code to a previous point in time, we might end in a broken state. Broken could mean errors in the code, or it could mean a (maybe unnoticed) decrease in model performance. Raw data is often big, so ordinary version control systems like Git does not handle this well. A simple solution to data versioning (that works in many companies today I suppose) is an append-only data store. This could be as simple as creating a new subfolder called "v2" and then editing the code to read the data from a new path, or it could be querying an ever growing append-only database using a timestamp to get the same dataset each time, and updating the timestamp in a controlled manner when ever a newer model needs to be trained. A better solution is to use tools that exists today like DVC.

Artifacts:
Artifacts is everything that is produced by a run of the pipeline, including modified data, models, metrics and logs. Artifacts are deterministic(ish) outputs that can be reproduced anytime. For this reason they should not be part of any versioning control system, since they are inherently versioned by the input dependencies. We want to cache the output of each pipeline step to save time reproducing artifacts, and we might want a shared cache that can be accessed by each compute instance running the pipelines, like each team member, the CI build server and your cloud compute instances for model training. Caching becomes increasingly important when the time for computing pipeline steps increases.

Run (of a pipeline):
A run of a pipeline is like a function call. In this metaphor the function represents everything that is static and version controlled; the pipeline (code and raw data). And the input to the function represents everything that is variable; the pipeline parameters (which could include model parameters), as well as the environment. Keeping track of which pipeline parameters (and sometimes environment variables) was used for producing a specific artifact is absolutely crucial for reproducibility, as well as for making the caching of artifacts work properly. An important point to make here is that multiple runs and thus also multiple versions of the artifacts can be created from the same version of a pipeline/experiment, eg. if you want to asses how much your produced models vary in terms of prediction quality. If you want to asses this variance, then you want to run the exact same pipeline several times and record all metrics for comparison against each other. A simple way solve the problem of reproducibility is to dump the arguments/parameters to a file and thus the information gets included into the output artifacts. Another way is to track the pipeline parameters as any other metric, like sending it off the a metric server. Both are hacks, but to my knowledge no great tool that can handle this exists. I would love DVC to go in this direction.

Model and metric overview:
We need some way to keep track of models and their metrics. Metrics are a special kind of artifact. They are also the product of a pipeline but is special in the sense that they represent knowledge about a previous step in the pipeline. Being able to compare these is what matters. Often the comparison is aimed at the commit where the current branch was branched out from, but sometimes you might want to compare with other branches as well. DVC handles simple metric comparison, but does not come with a nice UI, however MLFlow does not support the Git tree well and quickly becomes a mess.

Local development while maintaining reproducibility:
Sometimes you want to iterate locally without requiring review from your peers and without wanting to pollute the your git history. Ideally this should be possible while still making use of cacheing both globally and locally. For instance, you might still use the old preprocessing in your new uncommitted local experimentation, so a cache of that would still be useful. You also might want to compare your local yet globally unrecorded metrics with the old globally recorded ones. As in ordinary software engineering, local environment variations and temporary uncommitted code changes might be a threat to reproducibility. One simple hack I have encountered to avoid producing models and metrics from uncommitted code changes is to have a script that simply checks that Git is in a clean state. Another hacky solution could be to add a "clean commit" label as a metric. But to my knowledge no tools lets you keep your local changes for yourself while still enabling you to compare with older metrics. A solution that gains a higher level of reproducibility and also solves the problem of environment variation is to run everything in Docker. To my knowledge no great tool that can handle caching of parameterised pipeline steps exists. In my view DVC actually makes reproducibility a little harder by allowing me to have data and code out of sync.


I would love to get your opinions on these concepts.

The two things where my views conflict most with DVC is in regards to parameterized pipelines and everything related to handling caching and pipeline definition as separate things.

These are the dvc issues I know about regarding parameterization:

These are the dvc issues I know about regarding separation of caching and pipeline:

Metadata

Metadata

Assignees

No one assigned

    Labels

    discussionrequires active participation to reach a conclusion

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions