Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Taking a step back and looking at the big picture #3549

Closed
elgehelge opened this issue Mar 29, 2020 · 7 comments
Closed

Taking a step back and looking at the big picture #3549

elgehelge opened this issue Mar 29, 2020 · 7 comments
Labels
discussion requires active participation to reach a conclusion

Comments

@elgehelge
Copy link
Contributor

I would like to discuss my view on how a good ML engineering tool should work. I am not saying that my view is the best view, I am inviting you to discuss it with me so that we all can get one step closer to understanding what the perfect tool would be like.

The two things where my views conflict most with DVC is in regards to parameterized pipelines and everything related to handling caching and pipeline definition as separate things.

Okay, and here comes my lengthy thoughts, roughly grouped into a handful of different concepts:

Pipeline:
What ML engineering is actually about. You might think ML is about building a model, but actually it is about building a pipeline that outputs models. This might not be true for a quick ad-hoc data analysis, but in academia and production software reproducibility is a requirement. Reproducibility boils down to being able to create a new model of the same kind. When we can do this, we can also begin to iteratively build better models or update models with newer data. For this reason it is important that pipelines are "deterministic", meaning that you should be able to reproduce the same-ish model over and over again. This also means that pipelines should have no side effects and that all steps should be dependent on each other in a Directed Acyclic Graph (DAG). Everything we are building is part of the DAG. Also, since many models are stochastic and we do not want to risk overfitting to a random seed, we need a loose definition of "deterministic" which allows models to vary slightly. Validating that it is the "same" model can be done by verifying that specific metrics are within an acceptable tolerance. A pipeline consists of code and raw data.

Experiments:
An experiment is a specific version of a pipeline (code and raw data). The goal of an experiment is to learn something. That is, code up an alternative reality, produce an output, study it, and take a decision about whether this alternative reality is the preferred one or not. Since you might want to abandon an unsuccessful experiment, Git branches works well for representing experiments. This enables side-by-side comparison of code and is already the default in software engineering but since the world is much more predictable in software engineering, experiments succeeds often and we might not realise that they can be understood as experiments.

Raw data:
Raw data is the raw material which combined with code will produce our model. So just like code, raw data needs to be version controlled if we want reproducibility. If the code and the data get out of sync by updating the data (schema or distribution) or rolling back the code to a previous point in time, we might end in a broken state. Broken could mean errors in the code, or it could mean a (maybe unnoticed) decrease in model performance. Raw data is often big, so ordinary version control systems like Git does not handle this well. A simple solution to data versioning (that works in many companies today I suppose) is an append-only data store. This could be as simple as creating a new subfolder called "v2" and then editing the code to read the data from a new path, or it could be querying an ever growing append-only database using a timestamp to get the same dataset each time, and updating the timestamp in a controlled manner when ever a newer model needs to be trained. A better solution is to use tools that exists today like DVC.

Artifacts:
Artifacts is everything that is produced by a run of the pipeline, including modified data, models, metrics and logs. Artifacts are deterministic(ish) outputs that can be reproduced anytime. For this reason they should not be part of any versioning control system, since they are inherently versioned by the input dependencies. We want to cache the output of each pipeline step to save time reproducing artifacts, and we might want a shared cache that can be accessed by each compute instance running the pipelines, like each team member, the CI build server and your cloud compute instances for model training. Caching becomes increasingly important when the time for computing pipeline steps increases.

Run (of a pipeline):
A run of a pipeline is like a function call. In this metaphor the function represents everything that is static and version controlled; the pipeline (code and raw data). And the input to the function represents everything that is variable; the pipeline parameters (which could include model parameters), as well as the environment. Keeping track of which pipeline parameters (and sometimes environment variables) was used for producing a specific artifact is absolutely crucial for reproducibility, as well as for making the caching of artifacts work properly. An important point to make here is that multiple runs and thus also multiple versions of the artifacts can be created from the same version of a pipeline/experiment, eg. if you want to asses how much your produced models vary in terms of prediction quality. If you want to asses this variance, then you want to run the exact same pipeline several times and record all metrics for comparison against each other. A simple way solve the problem of reproducibility is to dump the arguments/parameters to a file and thus the information gets included into the output artifacts. Another way is to track the pipeline parameters as any other metric, like sending it off the a metric server. Both are hacks, but to my knowledge no great tool that can handle this exists. I would love DVC to go in this direction.

Model and metric overview:
We need some way to keep track of models and their metrics. Metrics are a special kind of artifact. They are also the product of a pipeline but is special in the sense that they represent knowledge about a previous step in the pipeline. Being able to compare these is what matters. Often the comparison is aimed at the commit where the current branch was branched out from, but sometimes you might want to compare with other branches as well. DVC handles simple metric comparison, but does not come with a nice UI, however MLFlow does not support the Git tree well and quickly becomes a mess.

Local development while maintaining reproducibility:
Sometimes you want to iterate locally without requiring review from your peers and without wanting to pollute the your git history. Ideally this should be possible while still making use of cacheing both globally and locally. For instance, you might still use the old preprocessing in your new uncommitted local experimentation, so a cache of that would still be useful. You also might want to compare your local yet globally unrecorded metrics with the old globally recorded ones. As in ordinary software engineering, local environment variations and temporary uncommitted code changes might be a threat to reproducibility. One simple hack I have encountered to avoid producing models and metrics from uncommitted code changes is to have a script that simply checks that Git is in a clean state. Another hacky solution could be to add a "clean commit" label as a metric. But to my knowledge no tools lets you keep your local changes for yourself while still enabling you to compare with older metrics. A solution that gains a higher level of reproducibility and also solves the problem of environment variation is to run everything in Docker. To my knowledge no great tool that can handle caching of parameterised pipeline steps exists. In my view DVC actually makes reproducibility a little harder by allowing me to have data and code out of sync.


I would love to get your opinions on these concepts.

The two things where my views conflict most with DVC is in regards to parameterized pipelines and everything related to handling caching and pipeline definition as separate things.

These are the dvc issues I know about regarding parameterization:

These are the dvc issues I know about regarding separation of caching and pipeline:

@Pierre-Bartet
Copy link

Also, since many models are stochastic and we do not want to risk overfitting to a random seed, we need a loose definition of "deterministic" which allows models to vary slightly.

I strongly disagree on this one, if you run the same thing on the same input, you want exactly the same output. If you are afraid of overfitting on a specific single seed, then try multiple seeds, but that is not a version control issue.

Sometimes you want to iterate locally without requiring review from your peers and without wanting to pollute the your git history. [...] For instance, you might still use the old preprocessing in your new uncommitted local experimentation

Then I would say make a branch and checkout the old preprocessing you want to use.

Model and metric overview

Nice to have, but I haven't yet found what I need just for the data version control part, so I'm afraid shiny features could hide useful (in my opinion) ones.

IMHO DVC or any maintainable equivalent should be as close as possible to the simplest possible combination of:

  1. Git
  2. The ability to use hashes instead of the whole data for objects that are too large
  3. The ability to smartly cache these objects

I think all the complexity stems from these simple points, for example it automatically creates a distinction between raw data that would be lost forever if you have only its hash and what you call artifacts.

@elgehelge
Copy link
Contributor Author

elgehelge commented Jun 10, 2020

Thanks for sharing @Pierre-Bartet

Also, since many models are stochastic and we do not want to risk overfitting to a random seed, we need a loose definition of "deterministic" which allows models to vary slightly.

I strongly disagree on this one, if you run the same thing on the same input, you want exactly the same output. If you are afraid of overfitting on a specific single seed, then try multiple seeds, but that is not a version control issue.

You might be right. But I like the way Sacred generates a new seed for you for each run, which is then tracked. DVC also tracks parameters values, so something similar could be thought of.

Sometimes you want to iterate locally without requiring review from your peers and without wanting to pollute the your git history. [...] For instance, you might still use the old preprocessing in your new uncommitted local experimentation

Then I would say make a branch and checkout the old preprocessing you want to use.

Think you might be missing the point, or maybe I am. I am not talking about the preprocessing code, I am talking about the data that is cached.

Model and metric overview

Nice to have, but I haven't yet found what I need just for the data version control part, so I'm afraid shiny features could hide useful (in my opinion) ones.

I agree. But I was trying to dream big. Right now Sacred or MLFlow does an acceptable job at this. I am just longing for something better 😊

@efiop efiop added the discussion requires active participation to reach a conclusion label Aug 6, 2020
@triage-new-issues triage-new-issues bot removed the triage Needs to be triaged label Aug 6, 2020
@efiop efiop added the awaiting response we are waiting for your reply, please respond! :) label Aug 6, 2020
@skshetry skshetry removed the awaiting response we are waiting for your reply, please respond! :) label Jan 4, 2021
@efiop
Copy link
Contributor

efiop commented Oct 8, 2021

closing as stale.

@efiop efiop closed this as completed Oct 8, 2021
@zmbc
Copy link

zmbc commented Aug 1, 2024

First of all, I want to say that I completely agree with @elgehelge, except about deterministic pipelines: I believe it should be exact, and seeds etc should be inputs just like any other.

Fortunately, a lot of the functionality described here is now present in DVC. However, there is one piece that I think is still missing, related to what @elgehelge called "artifacts."

Currently if I understand correctly, DVC makes no distinction between cached data that was produced by code and cached data that was not (e.g. raw pipeline inputs). That means there is no way to run garbage collection that will always keep original input data. Garbage collecting intermediate or final outputs is a relatively cheap, non-destructive operation; if you really wanted them, you could always go back and generate them again. Whereas, garbage collecting raw input data means you have permanently lost something. There should be a --only-generated flag or something to dvc gc.

Does that sound reasonable? Should I make a separate issue?

@shcheklein
Copy link
Member

@zmbc you could consider using the data registry workflow in such cases - https://dvc.org/doc/use-cases/data-registry Thus input data, datasets and even some pre-processing can be detached from the actual consumers of the data and they can have different lifecycle policies.

@zmbc
Copy link

zmbc commented Aug 2, 2024

@shcheklein Thanks, that is really interesting. It does feel like a bit of a hack -- using multiple DVC repositories because you can't have more than one lifecycle within a single one. For example, I could easily imagine this extending to three DVC repositories: a raw data registry that is never garbage collected; a pre-processing repo for data cleaning that is likely to be shared between multiple projects, but doesn't need to be kept forever since it can be easily re-run; and the actual project repo.

That said, it does make sense to me why this wouldn't be a priority, given that there is a viable workaround 👍

@shcheklein
Copy link
Member

@zmbc yep, also worth noting that you could create a few DVC projects within a single Git repo as well (--subdir option AFAIR), so they don't have to be separated into multiple Git repos. That might make it convenient to work with pre-processing data within the same "consumer" repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion requires active participation to reach a conclusion
Projects
None yet
Development

No branches or pull requests

6 participants