repro: use build cache for deterministic stages #1234

BartekRoszak · 2018-10-17T22:10:55Z

In my experiment I run a few different preprocessing steps which create a different CSV file, then I am modeling this data and also checking different parameters.
When I want to run the same experiment on 4 different machines (dvc is connected to the same remote cache).
Running every type of preprocessing will be done on every machine which takes a lot of time and could be omitted by running dvc pull before dvc repro and dvc push after it.
It could work with one command like dvc repro --remote

The text was updated successfully, but these errors were encountered:

efiop · 2018-10-17T22:43:32Z

Hi @AratorField !

Thank you for sharing your scenario! It would be indeed a very useful feature! I think that dvc checkout would've benefited from it as well. How about something like a autofetch option for dvc config, that will tell dvc that it should fetch cache before performing operations such as repro and checkout? Would that be suitable for you?

Thanks,
Ruslan

dmpetrov · 2018-10-18T05:49:23Z

@efiop the config option will change API. It won't be clear what exactly dvc repro does from code. An option dvc repro --fetch might be a good option instead of global config param.

BartekRoszak · 2018-10-18T06:36:29Z

I am running dvc repro in a loop so something like -f -p / --fetch --push will be ok for me.
Setting in config autofetch and autopush is ok too.

efiop · 2018-10-18T14:53:13Z

@AratorField If you are running it in a loop, why can't you simply call dvc pull before it and dvc push after it? I was thinking about an interactive scenario, where such new options make sense, but in a loop it is easy enough to just call pull/push in your script.

BartekRoszak · 2018-10-18T17:16:50Z

I am doing it like you said when I run an experiment on a server where I run most of my experiments. But then I am reproducing manually some of the best experiments locally.

efiop · 2018-10-18T17:29:47Z

Ah, I see. So it looks something like:

git pull
dvc pull
dvc repro
git commit ...
git push
dvc push

Right? If so, I think a more suitable approach would be to add git hooks that will call dvc pull after git pull and dvc push after git push, similar to what we currently have with dvc checkout, where you can call dvc install to install a post-checkout hook for you so dvc checkout is automatically called after git checkout. Would that suit you?

Currently dvc install only installs post-checkout hook, so we will have to add support for post-pull and post-push hooks as well. Which totally fits within our current API architecture.

BartekRoszak · 2018-10-18T18:31:52Z

Oh... So it works totally differently than I thought. Now I know why my algorithm is doing preprocessing every time.

Let me explain you a bit more.
I am building about 100 models with different parameters (standard hyperparameter optimization).
Before data modeling there is preprocessing, so in dvc repro there are two steps, preprocessing and modeling.
I am working with text and in preprocessing I have script which can do two things:

convert all to lowercase
remove punctuation

So there could be 4 different types of preprocessing:

None
convert all to lowercase
remove punctuation
convert all to lowercase and remove punctuation

I am setting the type of preprocessing in a config file which is a dependency of preprocessing.py which is in pipeline.

What I thought the DVC is working is:
After running dvc repro with some type of preprocessing and then run dvc repro again with another type of preprocessing, it won't do preprocessing again if I run it for the third time with the first type of preprocessing because DVC will say:

Hay! We have already done this type of preprocessing and stored the output file in the cache so now I can just take this file from the cache instead of running preprocessing again!

But it looks like it stores only the last version of the file, right?

efiop · 2018-10-18T19:20:16Z

You are right, dvc doesn't currently remember every dependencies + command = outputs combination that have occurred in the past, so it can't just pull up outputs from the cache this way. That being said, I have been thinking about this scenario a lot in the past and it would be indeed extremely useful to have such feature. The best way we could implement this is to utilize git(or any other scm system that the repository is based on) to get a table of dependencies + command = outputs values, that will help us quickly identify if this combination of dependencies has been already processed by this particular command, so we could pull up appropriate outputs without recomputing. I would call this something like a "build cache" as apposed to the current "data cache" and would indeed utilize it with a special option for dvc repro, e.g. something like dvc repro --use-build-cache that will tell dvc to check every stage to see if this combination(of deps and command) has been built before and if it was then it is okay for dvc to just pull up the result without recomputing. It is also worth pointing out that since dvc doesn't keep 100% track of the environment you run your pipeline in(e.g. system libs versions and so on), there is always a chance that you won't get the same result if you build something in two environments. That said, with a little bit of care from the user(e.g. probabilistic models will always produce different result, so if you were to use --use-build-cache option, you need to be aware of that it will pull up the old result which will not be equal to the one that you would've got by actually rebuilding) , this feature should be extremely useful. Is this something that you would be interested in? If you are interested, we can up the priority for this one.

BartekRoszak · 2018-10-18T19:35:57Z

It would love to see it!
As you said, one has to be careful because not every script is deterministic, so maybe a better way to do that is to add an adequate option to dvc run like --deterministic. So when you will run dvc repro it will know which part of the pipeline could be taken from the cache.

efiop · 2018-10-18T19:47:02Z

Makes sense! I will look into it soon. Thank you so much for the feedback!

efiop · 2018-11-28T20:53:49Z

NOTE: we can use git history as build cache by searching for existing committed dvc files in history. We could also cache that operation to only parse git history once. Also need to store local build cache for uncommited changes. Let's start with the latter one.

dmpetrov · 2018-12-31T11:25:57Z

This Git (any SCM) history based solution from @efiop seems good. However, it is specific to SCM and won't work if a user does not commit changes (which might be natural for hyper param search).

It might be benifitial to support more "dynamic" data structure which is not tight to SCM histroy and stores build\run caches after each run (even without commits). This structure can be still populated from the Git history if the histroy exists.

One possible solution... support "build cache" directory with symlinks\hardlinks to outputs in the cache. Link: md5(dependencies)_md5(command)_outputname --> output_in_cache. So, if dvc run finds a command with a corresponded cache it creates outputs without rerunning the command.

vasinkd · 2020-02-02T20:45:16Z

I totally agree with @dmpetrov on that! It would be cool to have the ability to match stage inputs to outputs. I can imagine several scenarios when this feature would be extremely useful:

You just want to play around with data locally, and caring about branches/cache is just too bothering.
You or your teammate run an experiment which has been done before but everyone has already forgotten about that.
You want to create several models using dvc pipeline by changing some inputs in the beginning of the pipeline and put results aside. BTW, this is a very common usecase for me.

I understand that there might be some caveats related to using different environments but I believe that most of the people, who use DVC as a tool that guarantees reproducibility of experiments, do freeze their dependencies. If someone updates his working environment on purpose, he should just reset build-cache manually or run stage/pipeline with --force flag.

dmpetrov · 2020-02-02T21:23:49Z

Thank you, @vasinkd, for the insights and the scenarios.

The third scenario is especially interesting. I’d appreciate if you could provide more details on this scenario and explain the difference with the 2nd. I feel this scenario and pain it solves but more solid use case can help to define requirements.

Yeah, DVC definitely lacks this features.

vasinkd · 2020-02-03T10:56:21Z

Actually, all three scenarios are the same: we run the same experiment several times and do not want to recalculate outputs if they are available in local/remote cache.

Second option is more about checking data in the remote cache. This is going to be helpful during experimentation phase.
Third option is related to retraining of models: e.g. I run the same pipeline for different input data each month. I do it sequentially, in Docker container containing dvc pipeline and required source code. Some stages inputs in the middle of pipeline might happen to be the same for different pipeline input data. Local build-cache might be helpful in that situation.

BTW, I think it is not so difficult to implement. We could create a folder build-cache inside .dvc folder and store full .dvc files (or just the part related to outputs) under a hash of inputs. Therefore, it would be possible to merge branches painlessly if outputs are ordered in a deterministic order. Merge conflicts will signal that something is wrong with an experiment setup on one of the machines.

efiop · 2020-02-03T23:36:09Z

@vasinkd Yeah, we've thought about fulling a similar type of cache through parsing git repo history to find all previously ran dvc-files plus some non-commited that were ran in this local repo. I might be missing something, but I think we could totally start with the approach you've suggested. 👍 Maybe you would like to contribute a patch? 🙂

vasinkd · 2020-02-04T16:59:46Z

Yes, I will definitely to give it a try)

This patch introduces `.dvc/cache/stages` that is used to store previous runs and their results, which could then be reused later when we stumble upon the same command with the same deps and outs. Format of build cache entries is single-line json, which is readable by humans and might also be used for lock files discussed in iterative#1871. Related to iterative#1871 Local part of iterative#1234

This patch introduces `.dvc/cache/stages` that is used to store previous runs and their results, which could then be reused later when we stumble upon the same command with the same deps and outs. Format of build cache entries is single-line json, which is readable by humans and might also be used for lock files discussed in #1871. Related to #1871 Local part of #1234

Fixes iterative#1234

Fixes #1234

BartekRoszak changed the title ~~Checking remote whren running repro~~ Checking remote when running repro Oct 17, 2018

efiop added the enhancement Enhances DVC label Oct 17, 2018

efiop added this to the Queue milestone Oct 17, 2018

efiop changed the title ~~Checking remote when running repro~~ repro: use build cache for deterministic stages Oct 19, 2018

vernt mentioned this issue Dec 2, 2018

Add dvc run --deterministic option. #1400

Merged

This was referenced Jan 7, 2019

Reconfigurable pipelines #1462

Closed

Reconfigurable modules #1472

Closed

efiop mentioned this issue Jan 23, 2019

Add dry-run option for garbage collection #1511

Closed

efiop added the p4-not-important label Jul 23, 2019

efiop removed this from the Queue milestone Sep 25, 2019

efiop added p3-nice-to-have It should be done this or next sprint and removed p4 labels Sep 25, 2019

dmpetrov mentioned this issue Feb 7, 2020

store whole DAG in one DVC-file #1871

Closed

weekly-digest bot mentioned this issue May 3, 2020

Weekly Digest (26 April, 2020 - 3 May, 2020) #3723

Closed

efiop added a commit to efiop/dvc that referenced this issue May 8, 2020

push/pull: properly collect run cache

c121c33

Fixes iterative#1234

efiop mentioned this issue May 8, 2020

push/pull: properly collect run cache #3768

Merged

3 tasks

efiop closed this as completed in #3768 May 8, 2020

efiop added a commit that referenced this issue May 8, 2020

push/pull: properly collect run cache (#3768)

d62a54c

Fixes #1234

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

repro: use build cache for deterministic stages #1234

repro: use build cache for deterministic stages #1234

BartekRoszak commented Oct 17, 2018 •

edited

Loading

efiop commented Oct 17, 2018 •

edited

Loading

dmpetrov commented Oct 18, 2018

BartekRoszak commented Oct 18, 2018

efiop commented Oct 18, 2018 •

edited

Loading

BartekRoszak commented Oct 18, 2018

efiop commented Oct 18, 2018 •

edited

Loading

BartekRoszak commented Oct 18, 2018 •

edited

Loading

efiop commented Oct 18, 2018 •

edited

Loading

BartekRoszak commented Oct 18, 2018 •

edited

Loading

efiop commented Oct 18, 2018 •

edited

Loading

efiop commented Nov 28, 2018 •

edited

Loading

dmpetrov commented Dec 31, 2018 •

edited

Loading

vasinkd commented Feb 2, 2020 •

edited

Loading

dmpetrov commented Feb 2, 2020

vasinkd commented Feb 3, 2020 •

edited

Loading

efiop commented Feb 3, 2020 •

edited

Loading

vasinkd commented Feb 4, 2020

repro: use build cache for deterministic stages #1234

repro: use build cache for deterministic stages #1234

Comments

BartekRoszak commented Oct 17, 2018 • edited Loading

efiop commented Oct 17, 2018 • edited Loading

dmpetrov commented Oct 18, 2018

BartekRoszak commented Oct 18, 2018

efiop commented Oct 18, 2018 • edited Loading

BartekRoszak commented Oct 18, 2018

efiop commented Oct 18, 2018 • edited Loading

BartekRoszak commented Oct 18, 2018 • edited Loading

efiop commented Oct 18, 2018 • edited Loading

BartekRoszak commented Oct 18, 2018 • edited Loading

efiop commented Oct 18, 2018 • edited Loading

efiop commented Nov 28, 2018 • edited Loading

dmpetrov commented Dec 31, 2018 • edited Loading

vasinkd commented Feb 2, 2020 • edited Loading

dmpetrov commented Feb 2, 2020

vasinkd commented Feb 3, 2020 • edited Loading

efiop commented Feb 3, 2020 • edited Loading

vasinkd commented Feb 4, 2020

BartekRoszak commented Oct 17, 2018 •

edited

Loading

efiop commented Oct 17, 2018 •

edited

Loading

efiop commented Oct 18, 2018 •

edited

Loading

efiop commented Oct 18, 2018 •

edited

Loading

BartekRoszak commented Oct 18, 2018 •

edited

Loading

efiop commented Oct 18, 2018 •

edited

Loading

BartekRoszak commented Oct 18, 2018 •

edited

Loading

efiop commented Oct 18, 2018 •

edited

Loading

efiop commented Nov 28, 2018 •

edited

Loading

dmpetrov commented Dec 31, 2018 •

edited

Loading

vasinkd commented Feb 2, 2020 •

edited

Loading

vasinkd commented Feb 3, 2020 •

edited

Loading

efiop commented Feb 3, 2020 •

edited

Loading