-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[core] If there's working_dir, don't set _py_driver_sys_path. #43214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
|
ETA for review this sun |
rkooo567
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have an associated issue?
Also, this happens only when a driver is submitted to a worker node?
python/ray/_private/worker.py
Outdated
| # (5) the driver is at the same node (machine) as the worker. | ||
| # | ||
| # We only do the first 4 checks here. | ||
| # TODO: do the (5) check by also passing node id. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't we just do it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to compare node_id of the driver vs node_id of this worker. We don't have the former value IIUC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I prefer to just fix 100% in this PR rather than leaving it as tech debt if the fix is easy. Isn't this pretty easy to do it? (just add driver node id to job config?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
python/ray/_private/worker.py
Outdated
| namespace and namespace == ray_constants.RAY_INTERNAL_DASHBOARD_NAMESPACE | ||
| # paths of the workers. Also add the current directory. | ||
| # | ||
| # Note that this is only meaningful when all of these are true: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you also mention why here briefly (to the code comment)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| runtime_env.pop("excludes", None) | ||
| job_config.set_runtime_env(runtime_env) | ||
|
|
||
| if mode == SCRIPT_MODE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, this code path is added only when the process is a driver. What's the reason why it is set for worker processes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the driver sets the _py_driver_sys_path here to record the driver's current path, then the worker adds it to its search path in maybe_initialize_job_config. If this worker is not a driver there's no point setting that.
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
|
Test (basic_3.py) passing. There are some failures in Serve docs which should be irrevant. |
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
…ath-workingdir Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
… to work because we checked the local dir, defying the point of the whole test. Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
rkooo567
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good, but I'd like to discuss a little bit more about #43214 (comment) before merging it.
…init. Added tests Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
|
Added the 5th check, and a unit test. It should work now, @rkooo567 |
python/ray/_private/worker.py
Outdated
| Get the driver node id via Job Info. Populates the cached driver node id if | ||
| found. The fetching can be slow if there are too many jobs. | ||
| @raises Exception if the gcs client times out. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use the consistent docstring format as other functions here?
Args:
Raises:
Returns:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
python/ray/_private/worker.py
Outdated
| def get_driver_node_id(self, timeout=None) -> ray.NodeID: | ||
| """ | ||
| Get the driver node id via Job Info. Populates the cached driver node id if | ||
| found. The fetching can be slow if there are too many jobs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually fetching can be slow if there are so many workers started at the same time because GCS main thread is overloaded. I've seen it is delayed up to 10~30 seconds.
Ideally, we should not query GCS when a new worker starts, but we are doing this already (each worker pings GCS like 5~6 times when a new worker starts). We can fix this holistically when we address large scale cluster problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added to comment as TODO
python/ray/_private/worker.py
Outdated
| return self._cached_driver_node_id | ||
| all_infos = self.gcs_client.get_all_job_info(timeout=timeout) | ||
| # Type: src.ray.protobuf.gcs_pb2.JobTableData | ||
| my_job_info = all_infos[self.current_job_id.binary()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use type annotation instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
python/ray/_private/worker.py
Outdated
| # should simply set "global_worker" to equal "None" or something like that. | ||
| global_worker.set_mode(None) | ||
| global_worker.set_cached_job_id(None) | ||
| global_worker._cached_driver_node_id = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add type annotation here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in init
|
Please make sure to change docstring to the consistent style! |
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
|
lmk when it is mergeable! |
|
premerge failure |
…on). Instead, populate it to JobConfig in core worker ctor Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
|
Premerge failure due to it adds a read to gcs on core worker creation. It can happen very slow (as slow as 3 second in bad cases) and makes some unit tests to timeout. Fixed by removing the read, instead, populate driver_node_id in JobConfig in driver's core worker ctor. |
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
…ath (#43899) Re: #43214 There is a core change that stopped adding script directory and current directory for different conditions. This accidentally stopped adding current directory when it's coming from the dashboard where in that case we only want to stop adding the script directory. This PR fixes the issue and allow serve deploy to continue working with the current directory. --------- Signed-off-by: Gene Su <e870252314@gmail.com>
…ath (ray-project#43899) Re: ray-project#43214 There is a core change that stopped adding script directory and current directory for different conditions. This accidentally stopped adding current directory when it's coming from the dashboard where in that case we only want to stop adding the script directory. This PR fixes the issue and allow serve deploy to continue working with the current directory. --------- Signed-off-by: Gene Su <e870252314@gmail.com>
…ath (#43899) (#44011) Cherry-pick #43899 Re: #43214 There is a core change that stopped adding script directory and current directory for different conditions. This accidentally stopped adding current directory when it's coming from the dashboard where in that case we only want to stop adding the script directory. This PR fixes the issue and allow serve deploy to continue working with the current directory. --------- Signed-off-by: Gene Su <e870252314@gmail.com>
…ray-project#43214)" This reverts commit b7a6d6d. Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
In #31383 we add _py_driver_sys_path to all workers, making them search from the driver's path. This makes the workers able to find modules near the driver file. However, in these cases it's not working:
This PR addresses the first issue. Left a TODO for the second issue.
Part of #42863