Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tune] Tuner.restore from a local directory that has moved #29920

Merged
merged 34 commits into from
Nov 18, 2022

Conversation

justinvyu
Copy link
Contributor

@justinvyu justinvyu commented Nov 1, 2022

Why are these changes needed?

Problem

Upon restore, parts of the Tuner state are not updated, including local_dir, experiment name, and when trials get recreated from the TrialRunner experiment checkpoint, the paths stored inside the the trial checkpoints are relative to the old experiment directory. See the issues linked below for more context.

Solution

  1. For local checkpointing, Tuner.restore(new_experiment_path) now properly sets the new local_dir and experiment name for restored trials based on the path that is passed in.
    • Example: If the experiment directory moves from /original/path/exp_name/ to /moved/path/new_exp_name/, /moved/path is the new local_dir, and new_exp_name is the new experiment name.
  2. Trial restoration now sets the checkpoints stored in the trial CheckpointManager to the correct paths based on what's found in the trial dir.
  3. Other trial state such as pickled_error_file and error_file are now stored as relative paths with respect to the trial logdir. (They used to be saved/loaded as absolute paths.)
  4. Added tests for this workflow (both through Tuner.restore and tune.run(resume=...)). Also, added a test that shows loading a ResultGrid via Tuner.restore().get_results() and accessing checkpoints works even if the directory has moved.

Notes for future

This PR changes TrialRunner.resume functionality to reconstruct the trial logdir as a subdirectory of the local_checkpoint_dir, which is the experiment directory. The checkpoint paths are then updated to be relative to the trial logdir. In the future, unit tests that restore trial checkpoints should set:

experiment_dir = ... # some temp dir
trial = Trial(local_dir=experiment_dir)
trial_runner = TrialRunner(local_checkpoint_dir=experiment_dir)

This matches the experiment directory structure that a normal Tune run assumes.

experiment-directory/
    experiment-checkpoint.json
    trial-directory/
        checkpoint_00000/

Related issue number

Closes #28082
Closes #28621

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Sorry, something went wrong.

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
…new local_checkpoint_dir

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
…n trial logdir

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
…ets overidden anyways)

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Fix old logdir path

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Fix style

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
…efore)

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Small readability fix

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Lint

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Remove unused tempfile

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Legacy test and use Trial stub

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great so far!

python/ray/tune/execution/trial_runner.py Outdated Show resolved Hide resolved
python/ray/tune/execution/trial_runner.py Outdated Show resolved Hide resolved
python/ray/tune/experiment/trial.py Outdated Show resolved Hide resolved
python/ray/tune/experiment/trial.py Outdated Show resolved Hide resolved
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Change assertion to check for str instead

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Lint

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Copy link
Member

@Yard1 Yard1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks awesome!

python/ray/tune/execution/checkpoint_manager.py Outdated Show resolved Hide resolved
python/ray/tune/tests/test_tuner_restore.py Outdated Show resolved Hide resolved
python/ray/tune/tests/test_tuner_restore.py Show resolved Hide resolved
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
trial = Trial("__fake", checkpoint_config=CheckpointConfig(num_to_keep=2))
trial = Trial(
"__fake",
local_dir=tempdir,
Copy link
Contributor Author

@justinvyu justinvyu Nov 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed to set local_dir to be the same as TrialRunner._local_checkpoint_dir for these tests to work. This is because TrialRunner.resume will set the trial checkpoint paths to be relative to its _local_checkpoint_dir and assumes a directory structure of:

<experiment logdir> == _local_checkpoint_dir
    | experiment-checkpoint.json
    | <trial.logdir>
        | checkpoint_00000

Previously, the trial checkpoints and experiment checkpoints in the test were being saved in two separate places. On resume, the trial checkpoint dirs get updated and point at the wrong location.

<experiment logdir> == _local_checkpoint_dir
    | experiment-checkpoint.json

<default trial logdir>
    | checkpoint_00000

Is this trial directory structure okay to assume? Wondering if it will be too easy for future tests to make this mistake.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to hardcode it provided there's a comment explaining it.

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
…eckpoint_dir

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Improve comments

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Copy link
Member

@Yard1 Yard1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, let's wait for CI to pass

@amogkam
Copy link
Contributor

amogkam commented Nov 17, 2022

@justinvyu can you merge master again please

@amogkam amogkam merged commit 176cee5 into ray-project:master Nov 18, 2022
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
…ject#29920)

Upon restore, parts of the Tuner state are not updated, including local_dir, experiment name, and when trials get recreated from the TrialRunner experiment checkpoint, the paths stored inside the the trial checkpoints are relative to the old experiment directory.

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
@justinvyu justinvyu deleted the tune_relative_paths branch August 8, 2023 22:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants