-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
repro: use build cache for deterministic stages #1234
Comments
Hi @AratorField ! Thank you for sharing your scenario! It would be indeed a very useful feature! I think that Thanks, |
@efiop the config option will change API. It won't be clear what exactly |
I am running |
@AratorField If you are running it in a loop, why can't you simply call |
I am doing it like you said when I run an experiment on a server where I run most of my experiments. But then I am reproducing manually some of the best experiments locally. |
Ah, I see. So it looks something like:
Right? If so, I think a more suitable approach would be to add git hooks that will call Currently |
Oh... So it works totally differently than I thought. Now I know why my algorithm is doing preprocessing every time. Let me explain you a bit more.
So there could be 4 different types of preprocessing:
I am setting the type of preprocessing in a config file which is a dependency of preprocessing.py which is in pipeline. What I thought the DVC is working is:
But it looks like it stores only the last version of the file, right? |
You are right, dvc doesn't currently remember every |
It would love to see it! |
Makes sense! I will look into it soon. Thank you so much for the feedback! |
NOTE: we can use git history as build cache by searching for existing committed dvc files in history. We could also cache that operation to only parse git history once. Also need to store local build cache for uncommited changes. Let's start with the latter one. |
This Git (any SCM) history based solution from @efiop seems good. However, it is specific to SCM and won't work if a user does not commit changes (which might be natural for hyper param search). It might be benifitial to support more "dynamic" data structure which is not tight to SCM histroy and stores build\run caches after each run (even without commits). This structure can be still populated from the Git history if the histroy exists. One possible solution... support "build cache" directory with symlinks\hardlinks to outputs in the cache. Link: |
I totally agree with @dmpetrov on that! It would be cool to have the ability to match stage inputs to outputs. I can imagine several scenarios when this feature would be extremely useful:
I understand that there might be some caveats related to using different environments but I believe that most of the people, who use DVC as a tool that guarantees reproducibility of experiments, do freeze their dependencies. If someone updates his working environment on purpose, he should just reset build-cache manually or run stage/pipeline with --force flag. |
Thank you, @vasinkd, for the insights and the scenarios. The third scenario is especially interesting. I’d appreciate if you could provide more details on this scenario and explain the difference with the 2nd. I feel this scenario and pain it solves but more solid use case can help to define requirements. Yeah, DVC definitely lacks this features. |
Actually, all three scenarios are the same: we run the same experiment several times and do not want to recalculate outputs if they are available in local/remote cache. Second option is more about checking data in the remote cache. This is going to be helpful during experimentation phase. BTW, I think it is not so difficult to implement. We could create a folder build-cache inside .dvc folder and store full .dvc files (or just the part related to outputs) under a hash of inputs. Therefore, it would be possible to merge branches painlessly if outputs are ordered in a deterministic order. Merge conflicts will signal that something is wrong with an experiment setup on one of the machines. |
@vasinkd Yeah, we've thought about fulling a similar type of cache through parsing git repo history to find all previously ran dvc-files plus some non-commited that were ran in this local repo. I might be missing something, but I think we could totally start with the approach you've suggested. 👍 Maybe you would like to contribute a patch? 🙂 |
Yes, I will definitely to give it a try) |
This patch introduces `.dvc/cache/stages` that is used to store previous runs and their results, which could then be reused later when we stumble upon the same command with the same deps and outs. Format of build cache entries is single-line json, which is readable by humans and might also be used for lock files discussed in iterative#1871. Related to iterative#1871 Local part of iterative#1234
This patch introduces `.dvc/cache/stages` that is used to store previous runs and their results, which could then be reused later when we stumble upon the same command with the same deps and outs. Format of build cache entries is single-line json, which is readable by humans and might also be used for lock files discussed in iterative#1871. Related to iterative#1871 Local part of iterative#1234
This patch introduces `.dvc/cache/stages` that is used to store previous runs and their results, which could then be reused later when we stumble upon the same command with the same deps and outs. Format of build cache entries is single-line json, which is readable by humans and might also be used for lock files discussed in iterative#1871. Related to iterative#1871 Local part of iterative#1234
This patch introduces `.dvc/cache/stages` that is used to store previous runs and their results, which could then be reused later when we stumble upon the same command with the same deps and outs. Format of build cache entries is single-line json, which is readable by humans and might also be used for lock files discussed in #1871. Related to #1871 Local part of #1234
In my experiment I run a few different preprocessing steps which create a different CSV file, then I am modeling this data and also checking different parameters.
When I want to run the same experiment on 4 different machines (
dvc
is connected to the same remote cache).Running every type of preprocessing will be done on every machine which takes a lot of time and could be omitted by running
dvc pull
beforedvc repro
anddvc push
after it.It could work with one command like
dvc repro --remote
The text was updated successfully, but these errors were encountered: