Skip to content

Conversation

@rynewang
Copy link
Contributor

@rynewang rynewang commented Feb 16, 2024

In #31383 we add _py_driver_sys_path to all workers, making them search from the driver's path. This makes the workers able to find modules near the driver file. However, in these cases it's not working:

  • if there's working_dir. The code should search from working_dir, not from the driver's dir. e.g. lib code in driver's dir can update after job submission and before worker starts, making the runs inconsistent among workers.
  • if the worker is in a different node than the driver's node. The code would search "the driver's dir as in the node where the driver runs in", but in a different node. If there happen to be a stale code file in the worker node, the stale code is used.
  • (already handled) if it's dashboard, it's client mode, or it's interactive mode.

This PR addresses the first issue. Left a TODO for the second issue.

Part of #42863

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
@rkooo567
Copy link
Contributor

ETA for review this sun

Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have an associated issue?

Also, this happens only when a driver is submitted to a worker node?

# (5) the driver is at the same node (machine) as the worker.
#
# We only do the first 4 checks here.
# TODO: do the (5) check by also passing node id.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't we just do it here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to compare node_id of the driver vs node_id of this worker. We don't have the former value IIUC.

Copy link
Contributor

@rkooo567 rkooo567 Feb 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I prefer to just fix 100% in this PR rather than leaving it as tech debt if the fix is easy. Isn't this pretty easy to do it? (just add driver node id to job config?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

namespace and namespace == ray_constants.RAY_INTERNAL_DASHBOARD_NAMESPACE
# paths of the workers. Also add the current directory.
#
# Note that this is only meaningful when all of these are true:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also mention why here briefly (to the code comment)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

runtime_env.pop("excludes", None)
job_config.set_runtime_env(runtime_env)

if mode == SCRIPT_MODE:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, this code path is added only when the process is a driver. What's the reason why it is set for worker processes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the driver sets the _py_driver_sys_path here to record the driver's current path, then the worker adds it to its search path in maybe_initialize_job_config. If this worker is not a driver there's no point setting that.

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
@rynewang
Copy link
Contributor Author

Test (basic_3.py) passing. There are some failures in Serve docs which should be irrevant.

rynewang and others added 4 commits February 22, 2024 11:36
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
…ath-workingdir

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
… to work because we checked the local dir, defying the point of the whole test.

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good, but I'd like to discuss a little bit more about #43214 (comment) before merging it.

@rynewang rynewang assigned rynewang and unassigned scv119 and rkooo567 Feb 26, 2024
…init. Added tests

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
@rynewang
Copy link
Contributor Author

Added the 5th check, and a unit test. It should work now, @rkooo567

Get the driver node id via Job Info. Populates the cached driver node id if
found. The fetching can be slow if there are too many jobs.
@raises Exception if the gcs client times out.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use the consistent docstring format as other functions here?

Args:

Raises:

Returns:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

def get_driver_node_id(self, timeout=None) -> ray.NodeID:
"""
Get the driver node id via Job Info. Populates the cached driver node id if
found. The fetching can be slow if there are too many jobs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually fetching can be slow if there are so many workers started at the same time because GCS main thread is overloaded. I've seen it is delayed up to 10~30 seconds.

Ideally, we should not query GCS when a new worker starts, but we are doing this already (each worker pings GCS like 5~6 times when a new worker starts). We can fix this holistically when we address large scale cluster problems.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added to comment as TODO

return self._cached_driver_node_id
all_infos = self.gcs_client.get_all_job_info(timeout=timeout)
# Type: src.ray.protobuf.gcs_pb2.JobTableData
my_job_info = all_infos[self.current_job_id.binary()]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use type annotation instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

# should simply set "global_worker" to equal "None" or something like that.
global_worker.set_mode(None)
global_worker.set_cached_job_id(None)
global_worker._cached_driver_node_id = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add type annotation here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in init

@rkooo567
Copy link
Contributor

Please make sure to change docstring to the consistent style!

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
@rkooo567
Copy link
Contributor

lmk when it is mergeable!

@rkooo567
Copy link
Contributor

premerge failure

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Feb 28, 2024
rynewang and others added 2 commits February 28, 2024 23:29
…on). Instead, populate it to JobConfig in core worker ctor

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
@rynewang rynewang requested review from a team, ericl, pcmoritz and raulchen as code owners March 1, 2024 04:15
@rynewang
Copy link
Contributor Author

rynewang commented Mar 1, 2024

Premerge failure due to it adds a read to gcs on core worker creation. It can happen very slow (as slow as 3 second in bad cases) and makes some unit tests to timeout. Fixed by removing the read, instead, populate driver_node_id in JobConfig in driver's core worker ctor.

rynewang and others added 2 commits February 29, 2024 20:35
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
@rkooo567 rkooo567 merged commit b7a6d6d into ray-project:master Mar 1, 2024
@rynewang rynewang deleted the no-syspath-workingdir branch March 2, 2024 00:39
edoakes pushed a commit that referenced this pull request Mar 14, 2024
…ath (#43899)

Re: #43214 There is a core change that stopped adding script directory and current directory for different conditions. This accidentally stopped adding current directory when it's coming from the dashboard where in that case we only want to stop adding the script directory. This PR fixes the issue and allow serve deploy to continue working with the current directory.

---------

Signed-off-by: Gene Su <e870252314@gmail.com>
GeneDer added a commit to GeneDer/ray that referenced this pull request Mar 14, 2024
…ath (ray-project#43899)

Re: ray-project#43214 There is a core change that stopped adding script directory and current directory for different conditions. This accidentally stopped adding current directory when it's coming from the dashboard where in that case we only want to stop adding the script directory. This PR fixes the issue and allow serve deploy to continue working with the current directory.

---------

Signed-off-by: Gene Su <e870252314@gmail.com>
khluu pushed a commit that referenced this pull request Mar 15, 2024
…ath (#43899) (#44011)

Cherry-pick #43899

Re: #43214 There is a core change that stopped adding script directory and current directory for different conditions. This accidentally stopped adding current directory when it's coming from the dashboard where in that case we only want to stop adding the script directory. This PR fixes the issue and allow serve deploy to continue working with the current directory.

---------

Signed-off-by: Gene Su <e870252314@gmail.com>
rynewang added a commit to rynewang/ray that referenced this pull request Apr 2, 2024
…ray-project#43214)"

This reverts commit b7a6d6d.

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
jjyao pushed a commit that referenced this pull request Apr 3, 2024
…#43214)" (#44435)

This reverts commit b7a6d6d.

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants