`exp run`: `--temp` `--rev` does not properly resume from the target revision #7813

mattlbeck · 2022-05-26T11:13:51Z

Bug Report

Description

The command appears to use whatever is checked out in the workspace to resume from, despite providing a different --rev

Reproduce

Clone the worked example repository and run the reprduction script

git clone https://github.com/mattlbeck/dvc-exp-resume-checkpoint-issue
cd dvc-exp-resume-checkpoint-issue
pip install -r requirements.txt
bash ./reproduce_issue.sh

Expected

On the second experiment, it should try to resume counting from 3 and exit immediately. Observed behaviour is that is just starts from 0 again.

Environment information

DVC version: 2.10.2 (pip)
---------------------------------
Platform: Python 3.10.4 on macOS-12.3.1-x86_64-i386-64bit
Supports:
        webhdfs (fsspec = 2022.2.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        s3 (s3fs = 2022.2.0, boto3 = 1.20.24)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk1s1s1
Caches: local
Remotes: None
Workspace directory: apfs on /dev/disk1s1s1
Repo: dvc, git

The text was updated successfully, but these errors were encountered:

dberenbaum · 2022-06-06T20:06:38Z

I'm able to confirm. @pmrowla @karajan1001 Any thoughts?

pmrowla · 2022-06-07T03:42:25Z

@dberenbaum initially I thought this was a bug, but after looking into this a bit more I think the problem is that the expected resume behavior is still kind of ambiguous (this goes back to the old discussion in #5814).

The problem is that the current intended behavior is to allow for resuming an experiment with additional modifications (applied from the workspace). In workspace runs this is (mostly) intuitive, but when using --temp and --rev it is less obvious. What happens when these options are combined is:

DVC git checkout's --rev
DVC applies the user's current workspace state on top of that (as a diff)
DVC then runs that experiment inside a --temp run

So basically DVC runs the current workspace (and extends the checkpoint experiment specified via --rev)

So to resume an experiment without modification at all, the DVC expected workflow here is actually

# run any experiment at all (whether via `--temp` or in the workspace)
git reset --hard # (ensure clean workspace)
dvc exp apply $(dvc exp list --names-only)
dvc exp run --temp --rev $(dvc exp list --names-only)

(which is clearly not very intuitive)

We can change the behavior so that exp run --temp --rev just resumes from --rev directly (which is what @mattlbeck is expecting), but this will mean that you cannot make any modifications when resuming with --rev (so we will have different behavior/feature support for workspace runs vs temp/queue runs)

Also related: #6104 (regarding whether or not checkpoints should be resumable at all)

mattlbeck · 2022-06-07T09:28:02Z

Thanks for the response! @pmrowla if the workspace is always applied even to --temp runs, it's not clear to me why your workaround would be better over the simpler:

dvc exp apply <rev>
dvc exp run --temp

mattlbeck · 2022-06-07T09:39:20Z

Reading the linked conversations, it seems like I have stumbled into a bit of a checkpoint behaviour minefield! I won't pretend to fully understand all the target use cases, but it sounds like you are trying to accomodate a lot of disparate things with this behaviour. Perhaps I can help most at this point by fully explaining our current use case.

We want to use dvc exp functionality with queue and parallel jobs to run hyperparameter tuning experiments on a remote AWS spot instance. We are looking at using CML/TPI to help us with this as well, and our aim is to set up easy deployment of long hyperparameter tuning jobs that overcome spot interruptions. The workflow we are hoping to achieve is:

Add functionality to training jobs to create DVC checkpoints routinely
make a script to --queue various experiments with different parameters, then --run-all --jobs. The script is run from a deployed spot instance
When the spot instance terminates, CML/TPI should be able to re-deploy a new spot instance. They appear to also sync intermediary data between instances, but I am not sure that this is useful in the case of multiple --temp experiments.
Instead, we need to pull the latest checkpoint for each job and resume from that experiment revision. This is possible within the same script that we are using to initiate multiple jobs by checking if there are existing experiments at the current git revision.

I think we are finding in particular that the combination of parallel exp jobs, checkpoint, and remote instance training is quite tricky!

dberenbaum · 2022-06-07T19:18:30Z

Yup, I think you are entering the danger zone @mattlbeck 😄 . We are working towards simplifying this workflow ourselves (cc @casperdcl).

Would you be interested in running each job on a different spot instance so that failures remain independent?

mattlbeck · 2022-06-08T09:35:56Z

One job per instance sounds quite sensible, but I think I would have to deploy with TPI for each experiment I want to run, is that correct? It will be a fair bit more engineering to do it this way - the nice thing about the current workflow was we could let DVC worry about scheduling jobs correctly given a fixed number of parallel jobs.

dberenbaum · 2022-06-08T12:49:10Z

Makes sense. I was asking because we want to work towards integrating DVC and TPI to make it easier to deploy a TPI instance per DVC experiment. Hopefully we can make your workflow easier in the coming months, and hearing your issues will help!

When the spot instance terminates, CML/TPI should be able to re-deploy a new spot instance. They appear to also sync intermediary data between instances, but I am not sure that this is useful in the case of multiple --temp experiments.

It may be possible to sync the temp experiment data to avoid needing to push/pull checkpoints. @pmrowla @casperdcl What do you think?

dvc exp apply <rev>
dvc exp run --temp

Has this resolved your immediate issue?

dberenbaum · 2022-06-08T19:24:46Z

We can change the behavior so that exp run --temp --rev just resumes from --rev directly (which is what @mattlbeck is expecting), but this will mean that you cannot make any modifications when resuming with --rev (so we will have different behavior/feature support for workspace runs vs temp/queue runs)

Getting back to the actual issue, I think we should make this change. I hit a related issue today:

$ dvc exp run --rev 81de285 --queue
ERROR: Cannot resume from '81de285' as it is not derived from your current workspace.: Experiment derived from 'cee358c', expected 'e24a469'.

The whole point of --rev is to not have to apply it to your current workspace, isn't it? I don't think it matters if behavior is different from workspace runs since --rev isn't supported for workspace runs. Otherwise, we should deprecate --rev because I can't see any other use case for it.

mattlbeck · 2022-06-09T09:24:28Z

dvc exp apply <rev>
dvc exp run --temp
Has this resolved your immediate issue?

It looks like it could be a solution but I will have to find some time to test it properly

mattlbeck · 2022-06-09T09:28:59Z

Makes sense. I was asking because we want to work towards integrating DVC and TPI to make it easier to deploy a TPI instance per DVC experiment. Hopefully we can make your workflow easier in the coming months, and hearing your issues will help!

This sounds ideal for some workflows, but for others it will still be more convenient to run multiple experiments on the same instance e.g. due to account instance limits or to best optimize the CPU capacity of one instance when your indivicual jobs are not multiprocessing.

aschuh-hf · 2022-11-24T14:23:24Z

@dberenbaum initially I thought this was a bug, but after looking into this a bit more I think the problem is that the expected resume behavior is still kind of ambiguous (this goes back to the old discussion in #5814).

The problem is that the current intended behavior is to allow for resuming an experiment with additional modifications (applied from the workspace). In workspace runs this is (mostly) intuitive, but when using --temp and --rev it is less obvious. What happens when these options are combined is:

DVC git checkout's --rev

DVC applies the user's current workspace state on top of that (as a diff)

DVC then runs that experiment inside a --temp run

[...]

We can change the behavior so that exp run --temp --rev just resumes from --rev directly (which is what @mattlbeck is expecting), but this will mean that you cannot make any modifications when resuming with --rev (so we will have different behavior/feature support for workspace runs vs temp/queue runs)

Also related: #6104 (regarding whether or not checkpoints should be resumable at all)

Would it make sense to simply preserve the state of the specified --rev only for stage output files (outs, metrics, and plots) declared in dvc.yaml? These are certainly what should be from the previous revision / checkpoint from which one wants to resume. Code changes would still be fine to apply from the workspace given these are either undeclared in dvc.yaml stage or defined as deps. Maybe the outs would be limited to those which are considered a circular dependency of DVC checkpoints.

In either case, it seems the documentation may need to expand on how the --rev option works, because

resume an experiment from a specific checkpoint name or hash (commit) in --queue or --temp runs.

doesn't specify that the state of the current workspace would influence such experiment.

I think documentation of --rev for other commands such as dvc exp remove is more clear regarding the meaning of this parameter. Similar for dvc exp show. Namely the mention of

experiments derived from the specified < commit > as baseline

dberenbaum · 2023-08-14T12:30:31Z

--rev is no longer an available option, so closing this one

mattlbeck changed the title ~~exp run --temp --rev: does not properly resume from the target revision~~ exp run: --temp --rev does not properly resume from the target revision May 26, 2022

pared added A: experiments Related to dvc exp bug Did we break something? labels May 26, 2022

dberenbaum mentioned this issue Nov 29, 2022

exp: Unable to resume checkpoint after using dvc exp run --queue #8622

Closed

dberenbaum mentioned this issue Feb 14, 2023

Queue: wrap up e2e workflow iterative/vscode-dvc#3178

Closed

dberenbaum added the p1-important Important, aka current backlog of things to do label Feb 17, 2023

dberenbaum added p3-nice-to-have It should be done this or next sprint and removed p1-important Important, aka current backlog of things to do labels Mar 31, 2023

dberenbaum closed this as completed Aug 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`exp run`: `--temp` `--rev` does not properly resume from the target revision #7813

`exp run`: `--temp` `--rev` does not properly resume from the target revision #7813

mattlbeck commented May 26, 2022

dberenbaum commented Jun 6, 2022

pmrowla commented Jun 7, 2022

mattlbeck commented Jun 7, 2022

mattlbeck commented Jun 7, 2022

dberenbaum commented Jun 7, 2022

mattlbeck commented Jun 8, 2022

dberenbaum commented Jun 8, 2022

dberenbaum commented Jun 8, 2022

mattlbeck commented Jun 9, 2022

mattlbeck commented Jun 9, 2022

aschuh-hf commented Nov 24, 2022 •

edited

Loading

dberenbaum commented Aug 14, 2023

exp run: --temp --rev does not properly resume from the target revision #7813

exp run: --temp --rev does not properly resume from the target revision #7813

Comments

mattlbeck commented May 26, 2022

Bug Report

Description

Reproduce

Expected

Environment information

dberenbaum commented Jun 6, 2022

pmrowla commented Jun 7, 2022

mattlbeck commented Jun 7, 2022

mattlbeck commented Jun 7, 2022

dberenbaum commented Jun 7, 2022

mattlbeck commented Jun 8, 2022

dberenbaum commented Jun 8, 2022

dberenbaum commented Jun 8, 2022

mattlbeck commented Jun 9, 2022

mattlbeck commented Jun 9, 2022

aschuh-hf commented Nov 24, 2022 • edited Loading

dberenbaum commented Aug 14, 2023

`exp run`: `--temp` `--rev` does not properly resume from the target revision #7813

`exp run`: `--temp` `--rev` does not properly resume from the target revision #7813

aschuh-hf commented Nov 24, 2022 •

edited

Loading