-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIR] Whether or not to change the working dir should be configurable when using Ray Train #29787
Comments
Having issues with this as well. I'm using HuggingFaceTrainer and it always changes the working dir to the trial dir, even if I wrap in a Tuner with the TunerConfig as above. My use case is that I'm starting from a pretrained model but there's no Internet access on my cluster, so I want to push the model weights to the workers using the working_dir. Is there any way to get the path to the original working dir? The TUNE_ORIG_WORKING_DIR is not available |
Hi @einarbmag, this is something that I'll try to get fixed for Ray 2.5. How do you push the working directory to the workers -- through a Ray runtime environment? So, your nodes are able to communicate with each other, but none of them are connected to the internet? |
That's right, I'm using a private corporate cloud, nodes can communicate but no internet connection, so I can't just specify the model name and let it pull from HF hub. I'm specifying working_dir through the runtime environment in ray.init. There is also a warning when I push the working_dir containing the pretrained model, that the file is big and I should consider excluding it. It's not ideal that it pushes multiple gigs every time I run the script... Maybe I should just use a GCS bucket to store the model (I'm on GCP). |
@einarbmag I'd recommend doing that for now -- pulling from HF once and uploading to some storage accessible by all nodes. I'll tag this thread in a future PR to fix this working directory issue though. |
Can be configured via |
Description
Context
See #29128 for more context on why this may be needed.
#29258 introduced a
should_chdir_to_trial_dir
flag inTuneConfig
, which allows you to turn off the default of changing the working directory of each trial worker to its independent trial-level directory (ex: trial 0's working directory gets set to~/ray_results/exp_name/trial_0000.../
, trial 1 directory gets set to~/ray_results/exp_name/trial_0001.../
.Ray Train currently also defaults to changing the working directory to the worker-level directory within the trial directory (since Train runs as a trial on Tune). For example, worker 0 of a distributed Train run would have its working directory changed to
~/ray_results/exp_name/trial_0000.../rank_0
.To maintain consistency within
AIR
, it makes sense to allow thisshould_chdir
configuration to be set for both Tune and Train.Current behavior
Suggested change
TuneConfig
toRunConfig
so that it can be passed into both Trainer and Tuner.session.get_log_dir()
API that would return the corresponding worker-level directory (the trial dir for Tune worker, and the rank dir within the trial dir for Train worker).Use case
No response
The text was updated successfully, but these errors were encountered: