Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIR] Whether or not to change the working dir should be configurable when using Ray Train #29787

Closed
justinvyu opened this issue Oct 27, 2022 · 5 comments
Assignees
Labels
enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks train Ray Train Related Issue

Comments

@justinvyu
Copy link
Contributor

Description

Context

See #29128 for more context on why this may be needed.

#29258 introduced a should_chdir_to_trial_dir flag in TuneConfig, which allows you to turn off the default of changing the working directory of each trial worker to its independent trial-level directory (ex: trial 0's working directory gets set to ~/ray_results/exp_name/trial_0000.../, trial 1 directory gets set to ~/ray_results/exp_name/trial_0001.../.

Ray Train currently also defaults to changing the working directory to the worker-level directory within the trial directory (since Train runs as a trial on Tune). For example, worker 0 of a distributed Train run would have its working directory changed to ~/ray_results/exp_name/trial_0000.../rank_0.

To maintain consistency within AIR, it makes sense to allow this should_chdir configuration to be set for both Tune and Train.

Current behavior

def training_loop_per_worker(config):
    print(os.getcwd())   # will still be changed to `~/ray_results/exp_name/trial_0000.../rank_0`

trainer = TorchTrainer(train_func)
tuner = Tuner(trainer, tune_config=TuneConfig(should_chdir_to_trial_dir=True))
tuner.fit()

Suggested change

  1. Move the config from TuneConfig to RunConfig so that it can be passed into both Trainer and Tuner.
  2. Introduce a session.get_log_dir() API that would return the corresponding worker-level directory (the trial dir for Tune worker, and the rank dir within the trial dir for Train worker).
def training_loop_per_worker(config):
    print(os.getcwd())   # will be the original working directory
    # introduce an API to get the logdir that should be used for saving worker-level outputs
    session.get_log_dir()

trainer = TorchTrainer(train_func)
tuner = Tuner(trainer, run_config=RunConfig(should_chdir_to_log_dir=True))
tuner.fit()

Use case

No response

@justinvyu justinvyu added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) air labels Oct 27, 2022
@hora-anyscale hora-anyscale added P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 28, 2022
@bveeramani bveeramani added P1 Issue that should be fixed within a few weeks and removed P2 Important issue, but not time-critical labels Oct 28, 2022
@justinvyu justinvyu self-assigned this Oct 31, 2022
@justinvyu justinvyu added P2 Important issue, but not time-critical and removed P1 Issue that should be fixed within a few weeks labels Feb 27, 2023
@einarbmag
Copy link

Having issues with this as well. I'm using HuggingFaceTrainer and it always changes the working dir to the trial dir, even if I wrap in a Tuner with the TunerConfig as above.

My use case is that I'm starting from a pretrained model but there's no Internet access on my cluster, so I want to push the model weights to the workers using the working_dir.

Is there any way to get the path to the original working dir? The TUNE_ORIG_WORKING_DIR is not available

@justinvyu justinvyu added P1 Issue that should be fixed within a few weeks and removed P2 Important issue, but not time-critical labels May 11, 2023
@justinvyu
Copy link
Contributor Author

Hi @einarbmag, this is something that I'll try to get fixed for Ray 2.5. How do you push the working directory to the workers -- through a Ray runtime environment? So, your nodes are able to communicate with each other, but none of them are connected to the internet?

@einarbmag
Copy link

That's right, I'm using a private corporate cloud, nodes can communicate but no internet connection, so I can't just specify the model name and let it pull from HF hub. I'm specifying working_dir through the runtime environment in ray.init.

There is also a warning when I push the working_dir containing the pretrained model, that the file is big and I should consider excluding it. It's not ideal that it pushes multiple gigs every time I run the script... Maybe I should just use a GCS bucket to store the model (I'm on GCP).

@justinvyu
Copy link
Contributor Author

@einarbmag I'd recommend doing that for now -- pulling from HF once and uploading to some storage accessible by all nodes. I'll tag this thread in a future PR to fix this working directory issue though.

@anyscalesam anyscalesam added @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. train Ray Train Related Issue and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. air labels Oct 27, 2023
@justinvyu
Copy link
Contributor Author

Can be configured via RAY_CHDIR_TO_TRIAL_DIR=0 env var.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks train Ray Train Related Issue
Projects
None yet
Development

No branches or pull requests

5 participants