Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune] Option to not override working_dir #29128

Closed
ericl opened this issue Oct 6, 2022 · 16 comments · Fixed by #29258
Closed

[tune] Option to not override working_dir #29128

ericl opened this issue Oct 6, 2022 · 16 comments · Fixed by #29258
Assignees
Labels
P0 Issues that should be fixed in short order ray-team-created Ray Team created tune Tune-related issues UX The issue is not only about technical bugs

Comments

@ericl
Copy link
Contributor

ericl commented Oct 6, 2022

Currently, Tune always sets the working dir to ~/ray_results/<trial_scoped_dir>. This can be very confusing for users that depend on relative paths.

We should (1) Add a top-level run config option like should_chdir to make this very visible, (2) improve docs here, and (3) consider making this the default.

Another issue is that TUNE_ORIG_WORKING_DIR isn't pointing to the runtime_env dir, but the dir on the driver, which isn't the right one in this case. It should point to the runtime_env dir (dir on the worker before the chdir).

@ericl ericl added P1 Issue that should be fixed within a few weeks tune Tune-related issues UX The issue is not only about technical bugs labels Oct 6, 2022
@richardliaw richardliaw added P0 Issues that should be fixed in short order and removed P1 Issue that should be fixed within a few weeks labels Oct 6, 2022
@richardliaw
Copy link
Contributor

Labeling this as high prio due to internal needs.

This does not need to land in 2.1 release.

Assignee should take this issue and get consensus on the API on this thread.

@matthewdeng
Copy link
Contributor

@justinvyu can you take a look at this (cc @krfricke)

Some original context: #9571

Would be helpful to collect use-cases and possible considerations - of the top of my head there are a few directories the user may want to use (though I haven't verified which of these are practical):

  1. Driver
  2. Tuner (will be different from 1 when using Ray Client)
  3. Trial

@justinvyu
Copy link
Contributor

justinvyu commented Oct 11, 2022

Here's a summary of my understanding of the problem and my proposed solution:

Problem

  1. I try to use relative paths from the directory that I launched my script to load data, saved model weights, etc.
  2. When trying to load, I get an error that doesn't make sense because I'm not aware that the working dir has changed to the trial directory:
print(open("./hello.txt").read())
FileNotFoundError: [Errno 2] No such file or directory: './hello.txt'
  1. If I do manage to find the TUNE_ORIG_WORKING_DIR env variable (which is very hard to do right now, since it's under the Trainable class docstring), it still doesn't work for multi-node training, because the TUNE_ORIG_WORKING_DIR gets set to the driver node's working directory and mismatches with the copied working dir on a separate worker node (where it's at some /tmp/ray/.../working_dir_files/).

Proposed Solution

  1. Fix TUNE_ORIG_WORKING_DIR to pull the correct current working directory within the worker process when the Tune trial logdir is being created.
  2. Introduce a change_working_dir_to_trial_dir flag in the Tune config that defaults to True.
    • Include an explicit warning telling the user to only perform READ operations with relative paths.
    • If this flag is set to False, the user may still want to access the Tune trial directory. We should store this as another environment variable TUNE_TRIAL_DIR. We should prompt the user to perform WRITE operations to this TUNE_TRIAL_DIR if needed.
  3. Add a better error message that intercepts the FileNotFoundError and checks if it exists at {TUNE_ORIG_WORKING_DIR}/<relative-path> and suggest that the user modifies their paths to use TUNE_ORIG_WORKING_DIR, OR turn off the change_working_dir_to_trial_dir flag.
  4. Make the TUNE_ORIG_WORKING_DIR and TUNE_TRIAL_DIR environment variables more visible in the documentation.

Discussion

Example Usage

Option 1: Setting change_working_dir_to_trial_dir=False

def train_func(config):
    # Read from relative paths
    print(open("./hello.txt").read())

    # Access working dir, which should == os.getcwd()
    print(os.environ["TUNE_ORIG_WORKING_DIR"])

    # Write to the Tune trial directory
    tune_trial_dir = Path(os.environ["TUNE_TRIAL_DIR"])
    with open(tune_trial_dir / "write.txt", "w") as f:
        f.write("test write")

tuner = Tuner(
    train_func,
    tune_config=tune.TuneConfig(..., change_working_dir_to_trial_dir=False),
)

Option 2: Manual control with environment variables

def train_func(config):
    orig_working_dir = Path(os.environ["TUNE_ORIG_WORKING_DIR"])
    print(open(orig_working_dir / "hello.txt").read())

    # Write to the Tune trial directory
    with open("./write.txt", "w") as f:
        f.write("test write")

@ericl
Copy link
Contributor Author

ericl commented Oct 11, 2022

Sounds good to me. The flag is a bit long though, how about chdir_to_trial_dir?

@matthewdeng
Copy link
Contributor

Looks good. Any reason not to just always store both TUNE_ORIG_WORKING_DIR and TUNE_TRIAL_DIR?

@justinvyu
Copy link
Contributor

I thought the "original" working dir concept would be a bit confusing if the working directory is never changed (from setting the flag to false).

@justinvyu
Copy link
Contributor

Note: Updated proposal to set the flag in TuneConfig instead of RunConfig.

@ericl
Copy link
Contributor Author

ericl commented Oct 11, 2022

Agreed we should always set both flags. Sometimes present env vars are not something users will have the patience to understand.

@krfricke
Copy link
Contributor

This looks good to me. IMO we should just point users to use session.logdir or session.trial_dir to read/write from/to the trial dir in the future.

This is a breaking API change, so should we default the flag to True for now and move to False after 2.2?

@justinvyu
Copy link
Contributor

Yeah, I think session.get_trial_dir() makes sense, and we should make the AIR session expose this from the underlying Train/Tune session.

Currently defaulting to True. Do you think it makes sense to set it to False by default for 2.2 based on the discussion on #9571? I was proposing just keeping it True.

@krfricke
Copy link
Contributor

cc @jiaodong

@krfricke
Copy link
Contributor

The main question I have is what we should default to if we don't use a runtime env and the worker node does not have the current working dir locally available

@justinvyu
Copy link
Contributor

justinvyu commented Oct 12, 2022

Regarding @krfricke's comment: the problem with the flag is that someone might run into the case where they have multiple nodes, and the working directory with the driver script may not exist on each node. This would not happen if runtime_env={"working_dir": "."} is set, since the working directory would get zipped and sent over to each node.

Updated Proposal:

Instead of setting a flag, we should control whether or not to change the working directory by checking if the user passed in a working_dir to the runtime env. If they do, we should not change the working directory to the Tune trial dir. If they do not specify a working_dir, we should switch as before. This way, we can guarantee that the working directory exists in both cases.

def train_func(config):
    # Read from relative paths
    print(open("./hello.txt").read())

    # Still need to tell users to write to the Tune logdir
    # This would be a new API that gets the trial dir for Tune session
    # and trial_dir/rank_x for Train sessions within a Tune session
    tune_log_dir = Path(session.get_log_dir())
    with open(tune_log_dir / "write.txt", "w") as f:
        f.write("test write")

ray.init(runtime_env={"working_dir": "."})
tuner = Tuner(
    train_func,
    ..., # No flag anymore
)

New session.get_log_dir() API

This would allow you to get the logdir to write to from the session API.

For Tune session: session.get_log_dir() -> trial_dir
For Train session: session.get_log_dir() -> trial_dir/rank_x, where x is the rank of the worker.

(Sidenote: also a popular API request https://discuss.ray.io/t/how-to-get-the-logdir-of-current-run-when-using-a-function-together-with-tuner/7823)

Comments/Questions

Now, you don't have to deal with environment variables, and the recommended usage aligns more with this guide.

This would change the default functionality if specifying a runtime_env, so should we still add a flag to enable the behavior before 2.2?

@krfricke
Copy link
Contributor

I'd be happy with that. We should make sure to document this cleanly and encourage users to always write to session.get_log_dir() (or maybe just session.log_dir) and not rely on relative paths.

@ericl
Copy link
Contributor Author

ericl commented Oct 12, 2022

Instead of setting a flag, we should control whether or not to change the working directory by checking if the user passed in a working_dir to the runtime env. If they do, we should not change the working directory to the Tune trial dir. If they do not specify a working_dir, we should switch as before. This way, we can guarantee that the working directory exists in both cases.

This will be confusing to users since runtime_env is a config set outside of the Tune job (spooky action at a distance). Actually, I don't see a huge problem with the working dir not existing.

Why not do the following:

  • Have the flag.
  • If the flag is True, the behavior is as current (chdir).
  • If the flag is False, we don't chdir, and the cwd is whatever the worker happens to be (either runtime_env working dir, or... whatever dir Ray was started in).
  • TUNE_ORIG_WORKING_DIR is always set to the worker original working dir.
  • TUNE_TRIAL_WORKING_DIR is always set to the trial working dir.

The main question I have is what we should default to if we don't use a runtime env and the worker node does not have the current working dir locally available

In other words, no action is required here. We just don't chdir to anything, it's whatever it happens to be.

@krfricke
Copy link
Contributor

Sounds good to me!

krfricke pushed a commit that referenced this issue Oct 18, 2022
See #29128 for more context on the problem.

This PR does the following:
1. **Fix `TUNE_ORIG_WORKING_DIR`** to pull the correct current working directory within the worker process when the Tune trial logdir is being created. This PR deprecates this environment variable as it's confusing to get Tune metadata from sources other than `session`.
2. **Introduce a `chdir_to_trial_dir` flag** in the TuneConfig that defaults to `True`, which configures whether or not Tune should change the working directory of each worker to its corresponding trial directory.
    - If this flag is set to False, the user may still want to access the Tune trial directory. This can be done with a **newly added `session.get_trial_dir()` API.**
3. Make the `TUNE_ORIG_WORKING_DIR` deprecation, `chdir_to_trial_dir` flag, and `session.get_trial_dir()` more visible in the documentation with an **example in the Tune FAQ.**

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this issue Dec 19, 2022
…29258)

See ray-project#29128 for more context on the problem.

This PR does the following:
1. **Fix `TUNE_ORIG_WORKING_DIR`** to pull the correct current working directory within the worker process when the Tune trial logdir is being created. This PR deprecates this environment variable as it's confusing to get Tune metadata from sources other than `session`.
2. **Introduce a `chdir_to_trial_dir` flag** in the TuneConfig that defaults to `True`, which configures whether or not Tune should change the working directory of each worker to its corresponding trial directory.
    - If this flag is set to False, the user may still want to access the Tune trial directory. This can be done with a **newly added `session.get_trial_dir()` API.**
3. Make the `TUNE_ORIG_WORKING_DIR` deprecation, `chdir_to_trial_dir` flag, and `session.get_trial_dir()` more visible in the documentation with an **example in the Tune FAQ.**

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
@richardliaw richardliaw added the ray-team-created Ray Team created label Dec 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P0 Issues that should be fixed in short order ray-team-created Ray Team created tune Tune-related issues UX The issue is not only about technical bugs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants