-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tune] Tuner.restore
from a local directory that has moved
#29920
Conversation
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
…new local_checkpoint_dir Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
…n trial logdir Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
…ets overidden anyways) Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu> Fix old logdir path Signed-off-by: Justin Yu <justinvyu@berkeley.edu> Fix style Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
…efore) Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu> Small readability fix Signed-off-by: Justin Yu <justinvyu@berkeley.edu> Lint Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu> Remove unused tempfile Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu> Legacy test and use Trial stub Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great so far!
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu> Change assertion to check for str instead Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu> Lint Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks awesome!
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
trial = Trial("__fake", checkpoint_config=CheckpointConfig(num_to_keep=2)) | ||
trial = Trial( | ||
"__fake", | ||
local_dir=tempdir, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I needed to set local_dir
to be the same as TrialRunner._local_checkpoint_dir
for these tests to work. This is because TrialRunner.resume
will set the trial checkpoint paths to be relative to its _local_checkpoint_dir
and assumes a directory structure of:
<experiment logdir> == _local_checkpoint_dir
| experiment-checkpoint.json
| <trial.logdir>
| checkpoint_00000
Previously, the trial checkpoints and experiment checkpoints in the test were being saved in two separate places. On resume, the trial checkpoint dirs get updated and point at the wrong location.
<experiment logdir> == _local_checkpoint_dir
| experiment-checkpoint.json
<default trial logdir>
| checkpoint_00000
Is this trial directory structure okay to assume? Wondering if it will be too easy for future tests to make this mistake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to hardcode it provided there's a comment explaining it.
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
…eckpoint_dir Signed-off-by: Justin Yu <justinvyu@berkeley.edu> Improve comments Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, let's wait for CI to pass
@justinvyu can you merge master again please |
…_relative_paths Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
…ject#29920) Upon restore, parts of the Tuner state are not updated, including local_dir, experiment name, and when trials get recreated from the TrialRunner experiment checkpoint, the paths stored inside the the trial checkpoints are relative to the old experiment directory. Signed-off-by: Justin Yu <justinvyu@berkeley.edu> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Why are these changes needed?
Problem
Upon restore, parts of the Tuner state are not updated, including
local_dir
, experimentname
, and when trials get recreated from theTrialRunner
experiment checkpoint, the paths stored inside the the trial checkpoints are relative to the old experiment directory. See the issues linked below for more context.Solution
Tuner.restore(new_experiment_path)
now properly sets the newlocal_dir
and experimentname
for restored trials based on the path that is passed in./original/path/exp_name/
to/moved/path/new_exp_name/
,/moved/path
is the newlocal_dir
, andnew_exp_name
is the new experimentname
.CheckpointManager
to the correct paths based on what's found in the trial dir.pickled_error_file
anderror_file
are now stored as relative paths with respect to the trial logdir. (They used to be saved/loaded as absolute paths.)Tuner.restore
andtune.run(resume=...)
). Also, added a test that shows loading aResultGrid
viaTuner.restore().get_results()
and accessing checkpoints works even if the directory has moved.Notes for future
This PR changes
TrialRunner.resume
functionality to reconstruct the trial logdir as a subdirectory of thelocal_checkpoint_dir
, which is the experiment directory. The checkpoint paths are then updated to be relative to the trial logdir. In the future, unit tests that restore trial checkpoints should set:This matches the experiment directory structure that a normal Tune run assumes.
Related issue number
Closes #28082
Closes #28621
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.