Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

repro: use build cache for deterministic stages #1234

Closed
BartekRoszak opened this issue Oct 17, 2018 · 23 comments · Fixed by #3768
Closed

repro: use build cache for deterministic stages #1234

BartekRoszak opened this issue Oct 17, 2018 · 23 comments · Fixed by #3768
Assignees
Labels
enhancement Enhances DVC p0-critical Critical issue. Needs to be fixed ASAP.

Comments

@BartekRoszak
Copy link

BartekRoszak commented Oct 17, 2018

In my experiment I run a few different preprocessing steps which create a different CSV file, then I am modeling this data and also checking different parameters.
When I want to run the same experiment on 4 different machines (dvc is connected to the same remote cache).
Running every type of preprocessing will be done on every machine which takes a lot of time and could be omitted by running dvc pull before dvc repro and dvc push after it.
It could work with one command like dvc repro --remote

@BartekRoszak BartekRoszak changed the title Checking remote whren running repro Checking remote when running repro Oct 17, 2018
@efiop
Copy link
Contributor

efiop commented Oct 17, 2018

Hi @AratorField !

Thank you for sharing your scenario! It would be indeed a very useful feature! I think that dvc checkout would've benefited from it as well. How about something like a autofetch option for dvc config, that will tell dvc that it should fetch cache before performing operations such as repro and checkout? Would that be suitable for you?

Thanks,
Ruslan

@efiop efiop added the enhancement Enhances DVC label Oct 17, 2018
@efiop efiop added this to the Queue milestone Oct 17, 2018
@dmpetrov
Copy link
Member

@efiop the config option will change API. It won't be clear what exactly dvc repro does from code. An option dvc repro --fetch might be a good option instead of global config param.

@BartekRoszak
Copy link
Author

I am running dvc repro in a loop so something like -f -p / --fetch --push will be ok for me.
Setting in config autofetch and autopush is ok too.

@efiop
Copy link
Contributor

efiop commented Oct 18, 2018

@AratorField If you are running it in a loop, why can't you simply call dvc pull before it and dvc push after it? I was thinking about an interactive scenario, where such new options make sense, but in a loop it is easy enough to just call pull/push in your script.

@BartekRoszak
Copy link
Author

I am doing it like you said when I run an experiment on a server where I run most of my experiments. But then I am reproducing manually some of the best experiments locally.

@efiop
Copy link
Contributor

efiop commented Oct 18, 2018

Ah, I see. So it looks something like:

git pull
dvc pull
dvc repro
git commit ...
git push
dvc push

Right? If so, I think a more suitable approach would be to add git hooks that will call dvc pull after git pull and dvc push after git push, similar to what we currently have with dvc checkout, where you can call dvc install to install a post-checkout hook for you so dvc checkout is automatically called after git checkout. Would that suit you?

Currently dvc install only installs post-checkout hook, so we will have to add support for post-pull and post-push hooks as well. Which totally fits within our current API architecture.

@BartekRoszak
Copy link
Author

BartekRoszak commented Oct 18, 2018

Oh... So it works totally differently than I thought. Now I know why my algorithm is doing preprocessing every time.

Let me explain you a bit more.
I am building about 100 models with different parameters (standard hyperparameter optimization).
Before data modeling there is preprocessing, so in dvc repro there are two steps, preprocessing and modeling.
I am working with text and in preprocessing I have script which can do two things:

  1. convert all to lowercase
  2. remove punctuation

So there could be 4 different types of preprocessing:

  • None
  • convert all to lowercase
  • remove punctuation
  • convert all to lowercase and remove punctuation

I am setting the type of preprocessing in a config file which is a dependency of preprocessing.py which is in pipeline.

What I thought the DVC is working is:
After running dvc repro with some type of preprocessing and then run dvc repro again with another type of preprocessing, it won't do preprocessing again if I run it for the third time with the first type of preprocessing because DVC will say:

Hay! We have already done this type of preprocessing and stored the output file in the cache so now I can just take this file from the cache instead of running preprocessing again!

But it looks like it stores only the last version of the file, right?

@efiop
Copy link
Contributor

efiop commented Oct 18, 2018

You are right, dvc doesn't currently remember every dependencies + command = outputs combination that have occurred in the past, so it can't just pull up outputs from the cache this way. That being said, I have been thinking about this scenario a lot in the past and it would be indeed extremely useful to have such feature. The best way we could implement this is to utilize git(or any other scm system that the repository is based on) to get a table of dependencies + command = outputs values, that will help us quickly identify if this combination of dependencies has been already processed by this particular command, so we could pull up appropriate outputs without recomputing. I would call this something like a "build cache" as apposed to the current "data cache" and would indeed utilize it with a special option for dvc repro, e.g. something like dvc repro --use-build-cache that will tell dvc to check every stage to see if this combination(of deps and command) has been built before and if it was then it is okay for dvc to just pull up the result without recomputing. It is also worth pointing out that since dvc doesn't keep 100% track of the environment you run your pipeline in(e.g. system libs versions and so on), there is always a chance that you won't get the same result if you build something in two environments. That said, with a little bit of care from the user(e.g. probabilistic models will always produce different result, so if you were to use --use-build-cache option, you need to be aware of that it will pull up the old result which will not be equal to the one that you would've got by actually rebuilding) , this feature should be extremely useful. Is this something that you would be interested in? If you are interested, we can up the priority for this one.

@BartekRoszak
Copy link
Author

BartekRoszak commented Oct 18, 2018

It would love to see it!
As you said, one has to be careful because not every script is deterministic, so maybe a better way to do that is to add an adequate option to dvc run like --deterministic. So when you will run dvc repro it will know which part of the pipeline could be taken from the cache.

@efiop
Copy link
Contributor

efiop commented Oct 18, 2018

Makes sense! I will look into it soon. Thank you so much for the feedback!

@efiop efiop changed the title Checking remote when running repro repro: use build cache for deterministic stages Oct 19, 2018
@efiop
Copy link
Contributor

efiop commented Nov 28, 2018

NOTE: we can use git history as build cache by searching for existing committed dvc files in history. We could also cache that operation to only parse git history once. Also need to store local build cache for uncommited changes. Let's start with the latter one.

@dmpetrov
Copy link
Member

dmpetrov commented Dec 31, 2018

This Git (any SCM) history based solution from @efiop seems good. However, it is specific to SCM and won't work if a user does not commit changes (which might be natural for hyper param search).

It might be benifitial to support more "dynamic" data structure which is not tight to SCM histroy and stores build\run caches after each run (even without commits). This structure can be still populated from the Git history if the histroy exists.

One possible solution... support "build cache" directory with symlinks\hardlinks to outputs in the cache. Link: md5(dependencies)_md5(command)_outputname --> output_in_cache. So, if dvc run finds a command with a corresponded cache it creates outputs without rerunning the command.

This was referenced Jan 7, 2019
@efiop efiop removed this from the Queue milestone Sep 25, 2019
@efiop efiop added p3-nice-to-have It should be done this or next sprint and removed p4 labels Sep 25, 2019
@vasinkd
Copy link

vasinkd commented Feb 2, 2020

I totally agree with @dmpetrov on that! It would be cool to have the ability to match stage inputs to outputs. I can imagine several scenarios when this feature would be extremely useful:

  1. You just want to play around with data locally, and caring about branches/cache is just too bothering.
  2. You or your teammate run an experiment which has been done before but everyone has already forgotten about that.
  3. You want to create several models using dvc pipeline by changing some inputs in the beginning of the pipeline and put results aside. BTW, this is a very common usecase for me.

I understand that there might be some caveats related to using different environments but I believe that most of the people, who use DVC as a tool that guarantees reproducibility of experiments, do freeze their dependencies. If someone updates his working environment on purpose, he should just reset build-cache manually or run stage/pipeline with --force flag.

@dmpetrov
Copy link
Member

dmpetrov commented Feb 2, 2020

Thank you, @vasinkd, for the insights and the scenarios.

The third scenario is especially interesting. I’d appreciate if you could provide more details on this scenario and explain the difference with the 2nd. I feel this scenario and pain it solves but more solid use case can help to define requirements.

Yeah, DVC definitely lacks this features.

@vasinkd
Copy link

vasinkd commented Feb 3, 2020

Actually, all three scenarios are the same: we run the same experiment several times and do not want to recalculate outputs if they are available in local/remote cache.

Second option is more about checking data in the remote cache. This is going to be helpful during experimentation phase.
Third option is related to retraining of models: e.g. I run the same pipeline for different input data each month. I do it sequentially, in Docker container containing dvc pipeline and required source code. Some stages inputs in the middle of pipeline might happen to be the same for different pipeline input data. Local build-cache might be helpful in that situation.

BTW, I think it is not so difficult to implement. We could create a folder build-cache inside .dvc folder and store full .dvc files (or just the part related to outputs) under a hash of inputs. Therefore, it would be possible to merge branches painlessly if outputs are ordered in a deterministic order. Merge conflicts will signal that something is wrong with an experiment setup on one of the machines.

@efiop
Copy link
Contributor

efiop commented Feb 3, 2020

@vasinkd Yeah, we've thought about fulling a similar type of cache through parsing git repo history to find all previously ran dvc-files plus some non-commited that were ran in this local repo. I might be missing something, but I think we could totally start with the approach you've suggested. 👍 Maybe you would like to contribute a patch? 🙂

@vasinkd
Copy link

vasinkd commented Feb 4, 2020

Yes, I will definitely to give it a try)

efiop added a commit to efiop/dvc that referenced this issue Apr 22, 2020
This patch introduces `.dvc/cache/stages` that is used to store previous
runs and their results, which could then be reused later when we stumble
upon the same command with the same deps and outs.

Format of build cache entries is single-line json, which is readable by
humans and might also be used for lock files discussed in iterative#1871.

Related to iterative#1871
Local part of iterative#1234
efiop added a commit to efiop/dvc that referenced this issue Apr 28, 2020
This patch introduces `.dvc/cache/stages` that is used to store previous
runs and their results, which could then be reused later when we stumble
upon the same command with the same deps and outs.

Format of build cache entries is single-line json, which is readable by
humans and might also be used for lock files discussed in iterative#1871.

Related to iterative#1871
Local part of iterative#1234
efiop added a commit to efiop/dvc that referenced this issue Apr 29, 2020
This patch introduces `.dvc/cache/stages` that is used to store previous
runs and their results, which could then be reused later when we stumble
upon the same command with the same deps and outs.

Format of build cache entries is single-line json, which is readable by
humans and might also be used for lock files discussed in iterative#1871.

Related to iterative#1871
Local part of iterative#1234
efiop added a commit that referenced this issue Apr 29, 2020
This patch introduces `.dvc/cache/stages` that is used to store previous
runs and their results, which could then be reused later when we stumble
upon the same command with the same deps and outs.

Format of build cache entries is single-line json, which is readable by
humans and might also be used for lock files discussed in #1871.

Related to #1871
Local part of #1234
efiop added a commit to efiop/dvc that referenced this issue May 8, 2020
efiop added a commit that referenced this issue May 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhances DVC p0-critical Critical issue. Needs to be fixed ASAP.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants