Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Join Ray Jobs API JobInfo with GCS JobTableData #31046

Merged
merged 54 commits into from
Feb 1, 2023

Conversation

architkulkarni
Copy link
Contributor

@architkulkarni architkulkarni commented Dec 12, 2022

Signed-off-by: Archit Kulkarni architkulkarni@users.noreply.github.com

Why are these changes needed?

  • Add a new protobuf for JobInfo from the Ray Job API
  • Augment the existing GCS GetAllJobInfo endpoint to return this information, if available (not all GCS jobs were submitted via the Ray Job API; these jobs won't have this extra JobInfo.)

Related issue number

Closes #29621

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
@stale
Copy link

stale bot commented Jan 15, 2023

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jan 15, 2023
@stale stale bot removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jan 23, 2023
Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
…-job-info-gcs

Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
@architkulkarni
Copy link
Contributor Author

architkulkarni commented Jan 25, 2023

@iycheng Could you please help give a preliminary review of the change in the GCS Job manager?

Currently all it's doing is to pull Ray Job API JobInfo from the internal KV for each job submitted via the Ray Job API. Because the internal KV get is async, we need some custom logic here involving a counter, but perhaps you know a better pattern to use. I worked on this with @rickyyx and @edoakes.

Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Signed-off-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
@architkulkarni architkulkarni changed the title [WIP] Join Ray Jobs API JobInfo with GCS JobTableData Join Ray Jobs API JobInfo with GCS JobTableData Jan 26, 2023
@architkulkarni architkulkarni marked this pull request as ready for review January 26, 2023 15:44
@architkulkarni architkulkarni added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Jan 31, 2023
Copy link
Contributor

@fishbone fishbone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG!

@@ -146,13 +147,76 @@ void GcsJobManager::HandleGetAllJobInfo(rpc::GetAllJobInfoRequest request,
rpc::GetAllJobInfoReply *reply,
rpc::SendReplyCallback send_reply_callback) {
RAY_LOG(INFO) << "Getting all job info.";
auto on_done = [reply, send_reply_callback](

int limit = std::numeric_limits<int>::max();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more fundamental problem is that we need pagination for GetAll apis, otherwise, the backend still doing a lot of work.

@architkulkarni
Copy link
Contributor Author

Test failures:

  • rllib multi_agent_independent_learning broken on master, unrelated
  • Datasets air:test_tensor_extension was broken on master, unrelated
  • Workflows test_basic_workflows_2 flaky on master, unrelated
  • serve:tutorial_rllib was broken on master

Java test failed, but it isn't failing on the flaky test tracker. Retrying it to be safe

@architkulkarni
Copy link
Contributor Author

Java test passed, merging

@architkulkarni architkulkarni merged commit b2c5e63 into ray-project:master Feb 1, 2023
architkulkarni added a commit to architkulkarni/ray that referenced this pull request Feb 7, 2023
scv119 pushed a commit that referenced this pull request Feb 10, 2023
…GetAllJobInfo endpoint (#32388)

The changes to the GetAllJobInfo endpoint in #31046 did not handle the possibility that multiple job table jobs (drivers) could have the same submission_id. This can actually happen, for example if there are multiple ray.init() calls in a Ray Job API entrypoint command. The GCS would crash in this case due to failing a RAY_CHECK that the number of jobs equaled the number of submission_ids seen.

This PR updates the endpoint to handle the above possibility, and adds a unit test which fails without this PR.

Related issue number
Closes #32213
cadedaniel pushed a commit to cadedaniel/ray that referenced this pull request Feb 10, 2023
…GetAllJobInfo endpoint (ray-project#32388)

The changes to the GetAllJobInfo endpoint in ray-project#31046 did not handle the possibility that multiple job table jobs (drivers) could have the same submission_id. This can actually happen, for example if there are multiple ray.init() calls in a Ray Job API entrypoint command. The GCS would crash in this case due to failing a RAY_CHECK that the number of jobs equaled the number of submission_ids seen.

This PR updates the endpoint to handle the above possibility, and adds a unit test which fails without this PR.

Related issue number
Closes ray-project#32213
cadedaniel added a commit that referenced this pull request Feb 10, 2023
…GetAllJobInfo endpoint (#32388) (#32426)

The changes to the GetAllJobInfo endpoint in #31046 did not handle the possibility that multiple job table jobs (drivers) could have the same submission_id. This can actually happen, for example if there are multiple ray.init() calls in a Ray Job API entrypoint command. The GCS would crash in this case due to failing a RAY_CHECK that the number of jobs equaled the number of submission_ids seen.

This PR updates the endpoint to handle the above possibility, and adds a unit test which fails without this PR.

Related issue number
Closes #32213

Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
…ct#31046)

Why are these changes needed?
Add a new protobuf for JobInfo from the Ray Job API
Augment the existing GCS GetAllJobInfo endpoint to return this information, if available (not all GCS jobs were submitted via the Ray Job API; these jobs won't have this extra JobInfo.)
Related issue number
Closes ray-project#29621

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
…GetAllJobInfo endpoint (ray-project#32388)

The changes to the GetAllJobInfo endpoint in ray-project#31046 did not handle the possibility that multiple job table jobs (drivers) could have the same submission_id. This can actually happen, for example if there are multiple ray.init() calls in a Ray Job API entrypoint command. The GCS would crash in this case due to failing a RAY_CHECK that the number of jobs equaled the number of submission_ids seen.

This PR updates the endpoint to handle the above possibility, and adds a unit test which fails without this PR.

Related issue number
Closes ray-project#32213

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
architkulkarni added a commit that referenced this pull request Jun 21, 2023
This PR propagates the existing num_pending_tasks property of a driver's core worker process to a new boolean is_running_tasks field for that driver in the GCS Job table. Exposing this field is useful for external cluster managers to determine whether a cluster is "active" or not.

The code for the new NumPendingTasks RPC from the GCS job manager to the driver's core worker process mimics the existing PushTask RPC made from the GCS actor manager to a worker's core worker process. The core worker client factory pattern is reused as well.

To make the connection to the core worker from the Job manager, we update the Job table proto to include the full Address object (including the port) instead of just the driver IP address.

This PR also adds unit tests in C++ and an end to end Python test.

Related issue number
Followup to PR #31046.

Closes #30436
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
This PR propagates the existing num_pending_tasks property of a driver's core worker process to a new boolean is_running_tasks field for that driver in the GCS Job table. Exposing this field is useful for external cluster managers to determine whether a cluster is "active" or not.

The code for the new NumPendingTasks RPC from the GCS job manager to the driver's core worker process mimics the existing PushTask RPC made from the GCS actor manager to a worker's core worker process. The core worker client factory pattern is reused as well.

To make the connection to the core worker from the Job manager, we update the Job table proto to include the full Address object (including the port) instead of just the driver IP address.

This PR also adds unit tests in C++ and an end to end Python test.

Related issue number
Followup to PR ray-project#31046.

Closes ray-project#30436

Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[jobs] Define a proto API for checking job status
5 participants