Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exp run: --temp --rev does not properly resume from the target revision #7813

Closed
mattlbeck opened this issue May 26, 2022 · 12 comments
Closed
Labels
A: experiments Related to dvc exp bug Did we break something? p3-nice-to-have It should be done this or next sprint

Comments

@mattlbeck
Copy link
Contributor

Bug Report

Description

The command appears to use whatever is checked out in the workspace to resume from, despite providing a different --rev

Reproduce

Clone the worked example repository and run the reprduction script

git clone https://github.com/mattlbeck/dvc-exp-resume-checkpoint-issue
cd dvc-exp-resume-checkpoint-issue
pip install -r requirements.txt
bash ./reproduce_issue.sh

Expected

On the second experiment, it should try to resume counting from 3 and exit immediately. Observed behaviour is that is just starts from 0 again.

Environment information

DVC version: 2.10.2 (pip)
---------------------------------
Platform: Python 3.10.4 on macOS-12.3.1-x86_64-i386-64bit
Supports:
        webhdfs (fsspec = 2022.2.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        s3 (s3fs = 2022.2.0, boto3 = 1.20.24)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk1s1s1
Caches: local
Remotes: None
Workspace directory: apfs on /dev/disk1s1s1
Repo: dvc, git
@mattlbeck mattlbeck changed the title exp run --temp --rev: does not properly resume from the target revision exp run: --temp --rev does not properly resume from the target revision May 26, 2022
@pared pared added A: experiments Related to dvc exp bug Did we break something? labels May 26, 2022
@dberenbaum
Copy link
Collaborator

I'm able to confirm. @pmrowla @karajan1001 Any thoughts?

@pmrowla
Copy link
Contributor

pmrowla commented Jun 7, 2022

@dberenbaum initially I thought this was a bug, but after looking into this a bit more I think the problem is that the expected resume behavior is still kind of ambiguous (this goes back to the old discussion in #5814).

The problem is that the current intended behavior is to allow for resuming an experiment with additional modifications (applied from the workspace). In workspace runs this is (mostly) intuitive, but when using --temp and --rev it is less obvious. What happens when these options are combined is:

  1. DVC git checkout's --rev
  2. DVC applies the user's current workspace state on top of that (as a diff)
  3. DVC then runs that experiment inside a --temp run

So basically DVC runs the current workspace (and extends the checkpoint experiment specified via --rev)

So to resume an experiment without modification at all, the DVC expected workflow here is actually

# run any experiment at all (whether via `--temp` or in the workspace)
git reset --hard # (ensure clean workspace)
dvc exp apply $(dvc exp list --names-only)
dvc exp run --temp --rev $(dvc exp list --names-only)

(which is clearly not very intuitive)

We can change the behavior so that exp run --temp --rev just resumes from --rev directly (which is what @mattlbeck is expecting), but this will mean that you cannot make any modifications when resuming with --rev (so we will have different behavior/feature support for workspace runs vs temp/queue runs)

Also related: #6104 (regarding whether or not checkpoints should be resumable at all)

@mattlbeck
Copy link
Contributor Author

Thanks for the response! @pmrowla if the workspace is always applied even to --temp runs, it's not clear to me why your workaround would be better over the simpler:

dvc exp apply <rev>
dvc exp run --temp

@mattlbeck
Copy link
Contributor Author

Reading the linked conversations, it seems like I have stumbled into a bit of a checkpoint behaviour minefield! I won't pretend to fully understand all the target use cases, but it sounds like you are trying to accomodate a lot of disparate things with this behaviour. Perhaps I can help most at this point by fully explaining our current use case.

We want to use dvc exp functionality with queue and parallel jobs to run hyperparameter tuning experiments on a remote AWS spot instance. We are looking at using CML/TPI to help us with this as well, and our aim is to set up easy deployment of long hyperparameter tuning jobs that overcome spot interruptions. The workflow we are hoping to achieve is:

  • Add functionality to training jobs to create DVC checkpoints routinely
  • make a script to --queue various experiments with different parameters, then --run-all --jobs. The script is run from a deployed spot instance
  • When the spot instance terminates, CML/TPI should be able to re-deploy a new spot instance. They appear to also sync intermediary data between instances, but I am not sure that this is useful in the case of multiple --temp experiments.
  • Instead, we need to pull the latest checkpoint for each job and resume from that experiment revision. This is possible within the same script that we are using to initiate multiple jobs by checking if there are existing experiments at the current git revision.

I think we are finding in particular that the combination of parallel exp jobs, checkpoint, and remote instance training is quite tricky!

@dberenbaum
Copy link
Collaborator

Yup, I think you are entering the danger zone @mattlbeck 😄 . We are working towards simplifying this workflow ourselves (cc @casperdcl).

Would you be interested in running each job on a different spot instance so that failures remain independent?

@mattlbeck
Copy link
Contributor Author

One job per instance sounds quite sensible, but I think I would have to deploy with TPI for each experiment I want to run, is that correct? It will be a fair bit more engineering to do it this way - the nice thing about the current workflow was we could let DVC worry about scheduling jobs correctly given a fixed number of parallel jobs.

@dberenbaum
Copy link
Collaborator

Makes sense. I was asking because we want to work towards integrating DVC and TPI to make it easier to deploy a TPI instance per DVC experiment. Hopefully we can make your workflow easier in the coming months, and hearing your issues will help!

  • When the spot instance terminates, CML/TPI should be able to re-deploy a new spot instance. They appear to also sync intermediary data between instances, but I am not sure that this is useful in the case of multiple --temp experiments.

It may be possible to sync the temp experiment data to avoid needing to push/pull checkpoints. @pmrowla @casperdcl What do you think?

dvc exp apply <rev>
dvc exp run --temp

Has this resolved your immediate issue?

@dberenbaum
Copy link
Collaborator

We can change the behavior so that exp run --temp --rev just resumes from --rev directly (which is what @mattlbeck is expecting), but this will mean that you cannot make any modifications when resuming with --rev (so we will have different behavior/feature support for workspace runs vs temp/queue runs)

Getting back to the actual issue, I think we should make this change. I hit a related issue today:

$ dvc exp run --rev 81de285 --queue
ERROR: Cannot resume from '81de285' as it is not derived from your current workspace.: Experiment derived from 'cee358c', expected 'e24a469'.

The whole point of --rev is to not have to apply it to your current workspace, isn't it? I don't think it matters if behavior is different from workspace runs since --rev isn't supported for workspace runs. Otherwise, we should deprecate --rev because I can't see any other use case for it.

@mattlbeck
Copy link
Contributor Author

dvc exp apply <rev>
dvc exp run --temp

Has this resolved your immediate issue?

It looks like it could be a solution but I will have to find some time to test it properly

@mattlbeck
Copy link
Contributor Author

Makes sense. I was asking because we want to work towards integrating DVC and TPI to make it easier to deploy a TPI instance per DVC experiment. Hopefully we can make your workflow easier in the coming months, and hearing your issues will help!

This sounds ideal for some workflows, but for others it will still be more convenient to run multiple experiments on the same instance e.g. due to account instance limits or to best optimize the CPU capacity of one instance when your indivicual jobs are not multiprocessing.

@aschuh-hf
Copy link

aschuh-hf commented Nov 24, 2022

@dberenbaum initially I thought this was a bug, but after looking into this a bit more I think the problem is that the expected resume behavior is still kind of ambiguous (this goes back to the old discussion in #5814).

The problem is that the current intended behavior is to allow for resuming an experiment with additional modifications (applied from the workspace). In workspace runs this is (mostly) intuitive, but when using --temp and --rev it is less obvious. What happens when these options are combined is:

  1. DVC git checkout's --rev
  2. DVC applies the user's current workspace state on top of that (as a diff)
  3. DVC then runs that experiment inside a --temp run

[...]

We can change the behavior so that exp run --temp --rev just resumes from --rev directly (which is what @mattlbeck is expecting), but this will mean that you cannot make any modifications when resuming with --rev (so we will have different behavior/feature support for workspace runs vs temp/queue runs)

Also related: #6104 (regarding whether or not checkpoints should be resumable at all)

Would it make sense to simply preserve the state of the specified --rev only for stage output files (outs, metrics, and plots) declared in dvc.yaml? These are certainly what should be from the previous revision / checkpoint from which one wants to resume. Code changes would still be fine to apply from the workspace given these are either undeclared in dvc.yaml stage or defined as deps. Maybe the outs would be limited to those which are considered a circular dependency of DVC checkpoints.

In either case, it seems the documentation may need to expand on how the --rev option works, because

resume an experiment from a specific checkpoint name or hash (commit) in --queue or --temp runs.

doesn't specify that the state of the current workspace would influence such experiment.

I think documentation of --rev for other commands such as dvc exp remove is more clear regarding the meaning of this parameter. Similar for dvc exp show. Namely the mention of

experiments derived from the specified < commit > as baseline

@dberenbaum dberenbaum added the p1-important Important, aka current backlog of things to do label Feb 17, 2023
@dberenbaum dberenbaum added p3-nice-to-have It should be done this or next sprint and removed p1-important Important, aka current backlog of things to do labels Mar 31, 2023
@dberenbaum
Copy link
Collaborator

--rev is no longer an available option, so closing this one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: experiments Related to dvc exp bug Did we break something? p3-nice-to-have It should be done this or next sprint
Projects
None yet
Development

No branches or pull requests

5 participants