exp: Unable to resume checkpoint after using `dvc exp run --queue` #8622

aschuh-hf · 2022-11-24T12:45:18Z

Bug Report

Description

I had run an experiment using dvc exp run --queue followed by dvc queue start --jobs 4. The experiment was interrupted using SIGINT and wrote a final checkpoint. Resuming from this final checkpoint works, but I seem to have an issue with PyTorch Lightning terminating training immediately again, presumably because the SIGINT interruption state is saved to this final checkpoint (pl.Trainer.interrupted attribute). I then wanted to try and resume from an earlier checkpoint before training was interrupted using SIGINT. This, however, fails with the following error message:

ERROR: Cannot resume from 'fdddf49' as it is not derived from your current workspace.: Experiment derived from 'celeryf', expected 'd8060e5'.

Note that I haven't committed any changes to my workspace since running the first experiment. Command dvc exp list and dvc exp show display the experiments without any --all-tags or --all-commits flags.

The issue seems that the base for the experiment is celeryf rather than my workspace HEAD.

Reproduce

Expected

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.31.0 (rpm)
---------------------------------
Platform: Python 3.8.3 on Linux-3.10.0-1160.15.2.el7.x86_64-x86_64-with-glibc2.14
Subprojects:

Supports:
        azure (adlfs = None, knack = 0.10.0, azure-identity = 1.11.0),
        gdrive (pydrive2 = 1.14.0),
        gs (gcsfs = None),
        hdfs (fsspec = None, pyarrow = 9.0.0),
        http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        oss (ossfs = 2021.8.0),
        s3 (s3fs = None, boto3 = 1.24.59),
        ssh (sshfs = 2022.6.0),
        webdav (webdav4 = 0.9.7),
        webdavs (webdav4 = 0.9.7),
        webhdfs (fsspec = None)
Cache types: hardlink, symlink
Cache directory: xfs on /dev/md124
Caches: local, s3
Remotes: s3, s3
Workspace directory: xfs on /dev/md124
Repo: dvc (subdir), git

Additional Information (if any):

The text was updated successfully, but these errors were encountered:

aschuh-hf · 2022-11-25T13:47:38Z

Though not the actual issue reported, in the description of events I note:

The experiment was interrupted using SIGINT and wrote a final checkpoint. Resuming from this final checkpoint works, but I seem to have an issue with PyTorch Lightning terminating training immediately again, presumably because the SIGINT interruption state is saved to this final checkpoint (pl.Trainer.interrupted attribute).

The reason training did not actually proceed was actually because I was unaware that DVC would apply the changes in my workspace when using the --rev option of dvc exp run after the checkpoint has been restored in the temp folder. That meant that the PyTorch Lightning checkpoint file loaded by the training script was different from the expected .ckpt corresponding to the DVC checkpoint. The .ckpt file in my workspace was from an older completed training run (i.e., max number of epochs was already reached).

Nonetheless, the actual reported issue, from which I may only distract with this background information, remains:

Experiment derived from 'celeryf', expected 'd8060e5'.

dberenbaum · 2022-11-29T16:39:29Z

Related to #7813. I think it should be fixed if we adopt the suggested approach there to resume directly from the specified rev without trying to apply the current workspace. This seems more useful to me than spending time patching the current approach.

karajan1001 self-assigned this Nov 25, 2022

karajan1001 added bug A: experiments Related to dvc exp labels Nov 25, 2022

karajan1001 added this to DVC Nov 25, 2022

karajan1001 moved this to Backlog in DVC Nov 25, 2022

omesser added the p0-critical label Nov 25, 2022

karajan1001 moved this from Backlog to In Progress in DVC Nov 30, 2022

skshetry added p1-important Important, aka current backlog of things to do and removed p0-critical labels Dec 9, 2022

omesser added bug Did we break something? and removed bug labels Dec 14, 2022

karajan1001 moved this from In Progress to Review In Progress in DVC Dec 16, 2022

dberenbaum added the A: task-queue label Dec 20, 2022

karajan1001 moved this from Review In Progress to Backlog in DVC Jan 17, 2023

karajan1001 mentioned this issue Jan 18, 2023

exp show: improve performance and stability of collection for vs code #8787

Closed

3 tasks

dberenbaum added p2-medium Medium priority, should be done, but less important and removed p1-important Important, aka current backlog of things to do labels Feb 6, 2023

pmrowla unassigned karajan1001 Feb 28, 2023

daavoo removed this from DVC Mar 23, 2023

daavoo mentioned this issue Mar 29, 2023

exp: Drop checkpoints #9271

Merged

skshetry closed this as completed in #9271 May 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exp: Unable to resume checkpoint after using `dvc exp run --queue` #8622

exp: Unable to resume checkpoint after using `dvc exp run --queue` #8622

aschuh-hf commented Nov 24, 2022

aschuh-hf commented Nov 25, 2022

dberenbaum commented Nov 29, 2022

exp: Unable to resume checkpoint after using dvc exp run --queue #8622

exp: Unable to resume checkpoint after using dvc exp run --queue #8622

Comments

aschuh-hf commented Nov 24, 2022

Bug Report

Description

Reproduce

Expected

Environment information

aschuh-hf commented Nov 25, 2022

dberenbaum commented Nov 29, 2022

exp: Unable to resume checkpoint after using `dvc exp run --queue` #8622

exp: Unable to resume checkpoint after using `dvc exp run --queue` #8622