Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exp: Unable to resume checkpoint after using dvc exp run --queue #8622

Closed
aschuh-hf opened this issue Nov 24, 2022 · 2 comments · Fixed by #9271
Closed

exp: Unable to resume checkpoint after using dvc exp run --queue #8622

aschuh-hf opened this issue Nov 24, 2022 · 2 comments · Fixed by #9271
Labels
A: experiments Related to dvc exp bug Did we break something? p2-medium Medium priority, should be done, but less important

Comments

@aschuh-hf
Copy link

Bug Report

Description

I had run an experiment using dvc exp run --queue followed by dvc queue start --jobs 4. The experiment was interrupted using SIGINT and wrote a final checkpoint. Resuming from this final checkpoint works, but I seem to have an issue with PyTorch Lightning terminating training immediately again, presumably because the SIGINT interruption state is saved to this final checkpoint (pl.Trainer.interrupted attribute). I then wanted to try and resume from an earlier checkpoint before training was interrupted using SIGINT. This, however, fails with the following error message:

ERROR: Cannot resume from 'fdddf49' as it is not derived from your current workspace.: Experiment derived from 'celeryf', expected 'd8060e5'.

Note that I haven't committed any changes to my workspace since running the first experiment. Command dvc exp list and dvc exp show display the experiments without any --all-tags or --all-commits flags.

The issue seems that the base for the experiment is celeryf rather than my workspace HEAD.

Reproduce

Expected

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.31.0 (rpm)
---------------------------------
Platform: Python 3.8.3 on Linux-3.10.0-1160.15.2.el7.x86_64-x86_64-with-glibc2.14
Subprojects:

Supports:
        azure (adlfs = None, knack = 0.10.0, azure-identity = 1.11.0),
        gdrive (pydrive2 = 1.14.0),
        gs (gcsfs = None),
        hdfs (fsspec = None, pyarrow = 9.0.0),
        http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        oss (ossfs = 2021.8.0),
        s3 (s3fs = None, boto3 = 1.24.59),
        ssh (sshfs = 2022.6.0),
        webdav (webdav4 = 0.9.7),
        webdavs (webdav4 = 0.9.7),
        webhdfs (fsspec = None)
Cache types: hardlink, symlink
Cache directory: xfs on /dev/md124
Caches: local, s3
Remotes: s3, s3
Workspace directory: xfs on /dev/md124
Repo: dvc (subdir), git

Additional Information (if any):

@karajan1001 karajan1001 self-assigned this Nov 25, 2022
@karajan1001 karajan1001 added bug A: experiments Related to dvc exp labels Nov 25, 2022
@karajan1001 karajan1001 added this to DVC Nov 25, 2022
@karajan1001 karajan1001 moved this to Backlog in DVC Nov 25, 2022
@aschuh-hf
Copy link
Author

Though not the actual issue reported, in the description of events I note:

The experiment was interrupted using SIGINT and wrote a final checkpoint. Resuming from this final checkpoint works, but I seem to have an issue with PyTorch Lightning terminating training immediately again, presumably because the SIGINT interruption state is saved to this final checkpoint (pl.Trainer.interrupted attribute).

The reason training did not actually proceed was actually because I was unaware that DVC would apply the changes in my workspace when using the --rev option of dvc exp run after the checkpoint has been restored in the temp folder. That meant that the PyTorch Lightning checkpoint file loaded by the training script was different from the expected .ckpt corresponding to the DVC checkpoint. The .ckpt file in my workspace was from an older completed training run (i.e., max number of epochs was already reached).

Nonetheless, the actual reported issue, from which I may only distract with this background information, remains:

Experiment derived from 'celeryf', expected 'd8060e5'.

@dberenbaum
Copy link
Collaborator

Related to #7813. I think it should be fixed if we adopt the suggested approach there to resume directly from the specified rev without trying to apply the current workspace. This seems more useful to me than spending time patching the current approach.

@karajan1001 karajan1001 moved this from Backlog to In Progress in DVC Nov 30, 2022
@skshetry skshetry added p1-important Important, aka current backlog of things to do and removed p0-critical labels Dec 9, 2022
@omesser omesser added bug Did we break something? and removed bug labels Dec 14, 2022
@karajan1001 karajan1001 moved this from In Progress to Review In Progress in DVC Dec 16, 2022
@karajan1001 karajan1001 moved this from Review In Progress to Backlog in DVC Jan 17, 2023
@dberenbaum dberenbaum added p2-medium Medium priority, should be done, but less important and removed p1-important Important, aka current backlog of things to do labels Feb 6, 2023
@daavoo daavoo removed this from DVC Mar 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: experiments Related to dvc exp bug Did we break something? p2-medium Medium priority, should be done, but less important
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants