[AIR] Whether or not to change the working dir should be configurable when using Ray Train #29787

justinvyu · 2022-10-27T23:59:28Z

Description

Context

See #29128 for more context on why this may be needed.

#29258 introduced a should_chdir_to_trial_dir flag in TuneConfig, which allows you to turn off the default of changing the working directory of each trial worker to its independent trial-level directory (ex: trial 0's working directory gets set to ~/ray_results/exp_name/trial_0000.../, trial 1 directory gets set to ~/ray_results/exp_name/trial_0001.../.

Ray Train currently also defaults to changing the working directory to the worker-level directory within the trial directory (since Train runs as a trial on Tune). For example, worker 0 of a distributed Train run would have its working directory changed to ~/ray_results/exp_name/trial_0000.../rank_0.

To maintain consistency within AIR, it makes sense to allow this should_chdir configuration to be set for both Tune and Train.

Current behavior

def training_loop_per_worker(config):
    print(os.getcwd())   # will still be changed to `~/ray_results/exp_name/trial_0000.../rank_0`

trainer = TorchTrainer(train_func)
tuner = Tuner(trainer, tune_config=TuneConfig(should_chdir_to_trial_dir=True))
tuner.fit()

Suggested change

Move the config from TuneConfig to RunConfig so that it can be passed into both Trainer and Tuner.
Introduce a session.get_log_dir() API that would return the corresponding worker-level directory (the trial dir for Tune worker, and the rank dir within the trial dir for Train worker).

def training_loop_per_worker(config):
    print(os.getcwd())   # will be the original working directory
    # introduce an API to get the logdir that should be used for saving worker-level outputs
    session.get_log_dir()

trainer = TorchTrainer(train_func)
tuner = Tuner(trainer, run_config=RunConfig(should_chdir_to_log_dir=True))
tuner.fit()

Use case

No response

The text was updated successfully, but these errors were encountered:

einarbmag · 2023-05-11T13:56:29Z

Having issues with this as well. I'm using HuggingFaceTrainer and it always changes the working dir to the trial dir, even if I wrap in a Tuner with the TunerConfig as above.

My use case is that I'm starting from a pretrained model but there's no Internet access on my cluster, so I want to push the model weights to the workers using the working_dir.

Is there any way to get the path to the original working dir? The TUNE_ORIG_WORKING_DIR is not available

justinvyu · 2023-05-11T17:10:22Z

Hi @einarbmag, this is something that I'll try to get fixed for Ray 2.5. How do you push the working directory to the workers -- through a Ray runtime environment? So, your nodes are able to communicate with each other, but none of them are connected to the internet?

einarbmag · 2023-05-12T05:49:44Z

That's right, I'm using a private corporate cloud, nodes can communicate but no internet connection, so I can't just specify the model name and let it pull from HF hub. I'm specifying working_dir through the runtime environment in ray.init.

There is also a warning when I push the working_dir containing the pretrained model, that the file is big and I should consider excluding it. It's not ideal that it pushes multiple gigs every time I run the script... Maybe I should just use a GCS bucket to store the model (I'm on GCP).

justinvyu · 2023-05-12T17:09:04Z

@einarbmag I'd recommend doing that for now -- pulling from HF once and uploading to some storage accessible by all nodes. I'll tag this thread in a future PR to fix this working directory issue though.

justinvyu · 2023-10-30T20:07:33Z

Can be configured via RAY_CHDIR_TO_TRIAL_DIR=0 env var.

justinvyu added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) air labels Oct 27, 2022

hora-anyscale added P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 28, 2022

bveeramani added P1 Issue that should be fixed within a few weeks and removed P2 Important issue, but not time-critical labels Oct 28, 2022

justinvyu self-assigned this Oct 31, 2022

KKIEEK mentioned this issue Dec 21, 2022

Bump ray from v2.1.0 to v2.2.0 SIAnalytics/siatune#121

Merged

1 task

justinvyu added P2 Important issue, but not time-critical and removed P1 Issue that should be fixed within a few weeks labels Feb 27, 2023

justinvyu added P1 Issue that should be fixed within a few weeks and removed P2 Important issue, but not time-critical labels May 11, 2023

justinvyu closed this as completed Oct 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AIR] Whether or not to change the working dir should be configurable when using Ray Train #29787

[AIR] Whether or not to change the working dir should be configurable when using Ray Train #29787

justinvyu commented Oct 27, 2022

einarbmag commented May 11, 2023

justinvyu commented May 11, 2023

einarbmag commented May 12, 2023

justinvyu commented May 12, 2023

justinvyu commented Oct 30, 2023

[AIR] Whether or not to change the working dir should be configurable when using Ray Train #29787

[AIR] Whether or not to change the working dir should be configurable when using Ray Train #29787

Comments

justinvyu commented Oct 27, 2022

Description

Context

Current behavior

Suggested change

Use case

einarbmag commented May 11, 2023

justinvyu commented May 11, 2023

einarbmag commented May 12, 2023

justinvyu commented May 12, 2023

justinvyu commented Oct 30, 2023