-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make temp dir executed checkpoint experiment return result to workspace. #8668
Conversation
except Exception: # pylint: disable=broad-except | ||
logger.exception( | ||
"Error running '%s' task, '%s' will be aborted", | ||
task.name, | ||
task.stage, | ||
) | ||
Monitor.kill(task.proc) | ||
task.killed.set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously task.killed
will only be set if the checkpoint (Not the actual training progress) raises an exception. The new task.update
will be set if at least one checkpoint has been committed. It is the condition that we need to collect the checkpoint result even if it didn't finish all of the iterations.
Codecov ReportBase: 93.52% // Head: 93.52% // No change to project coverage 👍
Additional details and impacted files@@ Coverage Diff @@
## main #8668 +/- ##
=======================================
Coverage 93.52% 93.52%
=======================================
Files 457 457
Lines 36139 36139
Branches 5229 5229
=======================================
Hits 33800 33800
Misses 1836 1836
Partials 503 503 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
dvc/repo/reproduce.py
Outdated
if notrun: | ||
logger.warning( | ||
"Some of the stages '%s' were not processed because " | ||
"something wrong occurred in the previous stages", | ||
",".join([stage.addressing for stage in notrun]), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we actually need to log this given that we don't log additional un-run stages when a normal error occurs for any stage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To follow up on this, if we decide we would like to log the skipped stages, it will be cleaner to just adjust the loop to use
for i, stage in enumerate(steps):
try:
except CheckpointKilledError:
...
logger.warning("skipped stages '%s'", ", ".join((s.addressing for s in steps[i + 1:]))
break
rather than creating the additional list of unrun stages and continuing the loop
@karajan1001 I'm actually not sure that the checkpoint handling behavior belongs in |
@karajan1001 Maybe there is someone else in @iterative/dvc who can review while @pmrowla is out so we can get it merged? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
One problem remained for the |
fix: iterative#8612 For checkpoint experiment in some case users might want to give it a early stopping to reduce variance. But currently if we interrupt/kill the experiment it will be marked as failed, and all of the completed checkpoints will be removed as we cleanup all the running directly after the process failed. 1. We raise CheckpointKilledError intead of StageCmdFailed error if at least one checkpoint had been commited. 2. Temp_dir executor will continue collecting data if the Checkpoint stage was interrupted. 3. Raise warnings if a checkpoint stage was incomplete and the other stages were not forwarded. 4. Add a new functional test for this
@karajan1001 Is this still an issue? |
This behavior is different from |
fix: #8612
For checkpoint experiments in some cases, users might want to give it an early stopping to reduce variance. But currently, if we interrupt/kill the experiment it will be marked as failed, and all of the completed checkpoints will be removed as we clean up all the running directly after the process failed.
❗ I have followed the Contributing to DVC checklist.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
Thank you for the contribution - we'll try to review it as soon as possible. 🙏