-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
exp: Unable to resume checkpoint after using dvc exp run --queue
#8622
Comments
Though not the actual issue reported, in the description of events I note:
The reason training did not actually proceed was actually because I was unaware that DVC would apply the changes in my workspace when using the Nonetheless, the actual reported issue, from which I may only distract with this background information, remains:
|
Related to #7813. I think it should be fixed if we adopt the suggested approach there to resume directly from the specified rev without trying to apply the current workspace. This seems more useful to me than spending time patching the current approach. |
Bug Report
Description
I had run an experiment using
dvc exp run --queue
followed bydvc queue start --jobs 4
. The experiment was interrupted using SIGINT and wrote a final checkpoint. Resuming from this final checkpoint works, but I seem to have an issue with PyTorch Lightning terminating training immediately again, presumably because the SIGINT interruption state is saved to this final checkpoint (pl.Trainer.interrupted
attribute). I then wanted to try and resume from an earlier checkpoint before training was interrupted using SIGINT. This, however, fails with the following error message:Note that I haven't committed any changes to my workspace since running the first experiment. Command
dvc exp list
anddvc exp show
display the experiments without any--all-tags
or--all-commits
flags.The issue seems that the base for the experiment is
celeryf
rather than my workspace HEAD.Reproduce
Expected
Environment information
Output of
dvc doctor
:Additional Information (if any):
The text was updated successfully, but these errors were encountered: