-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
exp run
: --temp
--rev
does not properly resume from the target revision
#7813
Comments
exp run --temp --rev
: does not properly resume from the target revisionexp run
: --temp
--rev
does not properly resume from the target revision
I'm able to confirm. @pmrowla @karajan1001 Any thoughts? |
@dberenbaum initially I thought this was a bug, but after looking into this a bit more I think the problem is that the expected resume behavior is still kind of ambiguous (this goes back to the old discussion in #5814). The problem is that the current intended behavior is to allow for resuming an experiment with additional modifications (applied from the workspace). In workspace runs this is (mostly) intuitive, but when using
So basically DVC runs the current workspace (and extends the checkpoint experiment specified via So to resume an experiment without modification at all, the DVC expected workflow here is actually
(which is clearly not very intuitive) We can change the behavior so that Also related: #6104 (regarding whether or not checkpoints should be resumable at all) |
Thanks for the response! @pmrowla if the workspace is always applied even to
|
Reading the linked conversations, it seems like I have stumbled into a bit of a checkpoint behaviour minefield! I won't pretend to fully understand all the target use cases, but it sounds like you are trying to accomodate a lot of disparate things with this behaviour. Perhaps I can help most at this point by fully explaining our current use case. We want to use
I think we are finding in particular that the combination of parallel exp jobs, checkpoint, and remote instance training is quite tricky! |
Yup, I think you are entering the danger zone @mattlbeck 😄 . We are working towards simplifying this workflow ourselves (cc @casperdcl). Would you be interested in running each job on a different spot instance so that failures remain independent? |
One job per instance sounds quite sensible, but I think I would have to deploy with TPI for each experiment I want to run, is that correct? It will be a fair bit more engineering to do it this way - the nice thing about the current workflow was we could let DVC worry about scheduling jobs correctly given a fixed number of parallel jobs. |
Makes sense. I was asking because we want to work towards integrating DVC and TPI to make it easier to deploy a TPI instance per DVC experiment. Hopefully we can make your workflow easier in the coming months, and hearing your issues will help!
It may be possible to sync the temp experiment data to avoid needing to push/pull checkpoints. @pmrowla @casperdcl What do you think?
Has this resolved your immediate issue? |
Getting back to the actual issue, I think we should make this change. I hit a related issue today: $ dvc exp run --rev 81de285 --queue
ERROR: Cannot resume from '81de285' as it is not derived from your current workspace.: Experiment derived from 'cee358c', expected 'e24a469'. The whole point of |
It looks like it could be a solution but I will have to find some time to test it properly |
This sounds ideal for some workflows, but for others it will still be more convenient to run multiple experiments on the same instance e.g. due to account instance limits or to best optimize the CPU capacity of one instance when your indivicual jobs are not multiprocessing. |
Would it make sense to simply preserve the state of the specified In either case, it seems the documentation may need to expand on how the
doesn't specify that the state of the current workspace would influence such experiment. I think documentation of
|
|
Bug Report
Description
The command appears to use whatever is checked out in the workspace to resume from, despite providing a different
--rev
Reproduce
Clone the worked example repository and run the reprduction script
Expected
On the second experiment, it should try to resume counting from 3 and exit immediately. Observed behaviour is that is just starts from 0 again.
Environment information
The text was updated successfully, but these errors were encountered: