-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Join Ray Jobs API JobInfo
with GCS JobTableData
#31046
Join Ray Jobs API JobInfo
with GCS JobTableData
#31046
Conversation
Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
|
Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
…-job-info-gcs Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
@iycheng Could you please help give a preliminary review of the change in the GCS Job manager? Currently all it's doing is to pull Ray Job API |
Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
JobInfo
with GCS JobTableData
JobInfo
with GCS JobTableData
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG!
@@ -146,13 +147,76 @@ void GcsJobManager::HandleGetAllJobInfo(rpc::GetAllJobInfoRequest request, | |||
rpc::GetAllJobInfoReply *reply, | |||
rpc::SendReplyCallback send_reply_callback) { | |||
RAY_LOG(INFO) << "Getting all job info."; | |||
auto on_done = [reply, send_reply_callback]( | |||
|
|||
int limit = std::numeric_limits<int>::max(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A more fundamental problem is that we need pagination for GetAll apis, otherwise, the backend still doing a lot of work.
Test failures:
Java test failed, but it isn't failing on the flaky test tracker. Retrying it to be safe |
Java test passed, merging |
…ay-project#31046)" This reverts commit b2c5e63.
…GetAllJobInfo endpoint (#32388) The changes to the GetAllJobInfo endpoint in #31046 did not handle the possibility that multiple job table jobs (drivers) could have the same submission_id. This can actually happen, for example if there are multiple ray.init() calls in a Ray Job API entrypoint command. The GCS would crash in this case due to failing a RAY_CHECK that the number of jobs equaled the number of submission_ids seen. This PR updates the endpoint to handle the above possibility, and adds a unit test which fails without this PR. Related issue number Closes #32213
…GetAllJobInfo endpoint (ray-project#32388) The changes to the GetAllJobInfo endpoint in ray-project#31046 did not handle the possibility that multiple job table jobs (drivers) could have the same submission_id. This can actually happen, for example if there are multiple ray.init() calls in a Ray Job API entrypoint command. The GCS would crash in this case due to failing a RAY_CHECK that the number of jobs equaled the number of submission_ids seen. This PR updates the endpoint to handle the above possibility, and adds a unit test which fails without this PR. Related issue number Closes ray-project#32213
…GetAllJobInfo endpoint (#32388) (#32426) The changes to the GetAllJobInfo endpoint in #31046 did not handle the possibility that multiple job table jobs (drivers) could have the same submission_id. This can actually happen, for example if there are multiple ray.init() calls in a Ray Job API entrypoint command. The GCS would crash in this case due to failing a RAY_CHECK that the number of jobs equaled the number of submission_ids seen. This PR updates the endpoint to handle the above possibility, and adds a unit test which fails without this PR. Related issue number Closes #32213 Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
…ct#31046) Why are these changes needed? Add a new protobuf for JobInfo from the Ray Job API Augment the existing GCS GetAllJobInfo endpoint to return this information, if available (not all GCS jobs were submitted via the Ray Job API; these jobs won't have this extra JobInfo.) Related issue number Closes ray-project#29621 Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…GetAllJobInfo endpoint (ray-project#32388) The changes to the GetAllJobInfo endpoint in ray-project#31046 did not handle the possibility that multiple job table jobs (drivers) could have the same submission_id. This can actually happen, for example if there are multiple ray.init() calls in a Ray Job API entrypoint command. The GCS would crash in this case due to failing a RAY_CHECK that the number of jobs equaled the number of submission_ids seen. This PR updates the endpoint to handle the above possibility, and adds a unit test which fails without this PR. Related issue number Closes ray-project#32213 Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
This PR propagates the existing num_pending_tasks property of a driver's core worker process to a new boolean is_running_tasks field for that driver in the GCS Job table. Exposing this field is useful for external cluster managers to determine whether a cluster is "active" or not. The code for the new NumPendingTasks RPC from the GCS job manager to the driver's core worker process mimics the existing PushTask RPC made from the GCS actor manager to a worker's core worker process. The core worker client factory pattern is reused as well. To make the connection to the core worker from the Job manager, we update the Job table proto to include the full Address object (including the port) instead of just the driver IP address. This PR also adds unit tests in C++ and an end to end Python test. Related issue number Followup to PR #31046. Closes #30436
This PR propagates the existing num_pending_tasks property of a driver's core worker process to a new boolean is_running_tasks field for that driver in the GCS Job table. Exposing this field is useful for external cluster managers to determine whether a cluster is "active" or not. The code for the new NumPendingTasks RPC from the GCS job manager to the driver's core worker process mimics the existing PushTask RPC made from the GCS actor manager to a worker's core worker process. The core worker client factory pattern is reused as well. To make the connection to the core worker from the Job manager, we update the Job table proto to include the full Address object (including the port) instead of just the driver IP address. This PR also adds unit tests in C++ and an end to end Python test. Related issue number Followup to PR ray-project#31046. Closes ray-project#30436 Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
Signed-off-by: Archit Kulkarni architkulkarni@users.noreply.github.com
Why are these changes needed?
JobInfo
from the Ray Job APIGetAllJobInfo
endpoint to return this information, if available (not all GCS jobs were submitted via the Ray Job API; these jobs won't have this extraJobInfo
.)Related issue number
Closes #29621
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.