Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core][state] Proper report of failure when job finishes #31761

Merged
merged 24 commits into from
Jan 23, 2023

Conversation

rickyyx
Copy link
Contributor

@rickyyx rickyyx commented Jan 18, 2023

Why are these changes needed?

This PR handles cases when a job finishes but tasks still running should be marked as failed.

  • It adds a handler function OnJobFinished as a job finish listener in the GcsJobManager, so when a job is marked as finished, the OnJobFinished will be called to mark any non-terminated tasks as failed

Related issue number

Checks

rickyyx added 17 commits January 7, 2023 22:58
Signed-off-by: rickyyx <rickyx@anyscale.com>
Signed-off-by: rickyyx <rickyx@anyscale.com>
Signed-off-by: rickyyx <rickyx@anyscale.com>
Signed-off-by: rickyyx <rickyx@anyscale.com>
Signed-off-by: rickyyx <rickyx@anyscale.com>
Signed-off-by: rickyyx <rickyx@anyscale.com>
Signed-off-by: rickyyx <rickyx@anyscale.com>
Signed-off-by: rickyyx <rickyx@anyscale.com>
Signed-off-by: rickyyx <rickyx@anyscale.com>
Signed-off-by: rickyyx <rickyx@anyscale.com>
Signed-off-by: rickyyx <rickyx@anyscale.com>
Signed-off-by: rickyyx <rickyx@anyscale.com>
Signed-off-by: rickyyx <rickyx@anyscale.com>
Signed-off-by: rickyyx <rickyx@anyscale.com>
Signed-off-by: rickyyx <rickyx@anyscale.com>
python/ray/tests/test_task_events.py Outdated Show resolved Hide resolved
src/ray/common/ray_config_def.h Outdated Show resolved Hide resolved
src/ray/gcs/gcs_server/gcs_server.cc Outdated Show resolved Hide resolved
src/ray/gcs/gcs_server/test/gcs_task_manager_test.cc Outdated Show resolved Hide resolved
src/ray/gcs/gcs_server/gcs_task_manager.cc Show resolved Hide resolved
src/ray/gcs/gcs_server/test/gcs_task_manager_test.cc Outdated Show resolved Hide resolved
src/ray/gcs/gcs_server/gcs_task_manager.cc Outdated Show resolved Hide resolved
@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 19, 2023
Signed-off-by: rickyyx <rickyx@anyscale.com>
Signed-off-by: rickyyx <rickyx@anyscale.com>
Signed-off-by: rickyyx <rickyx@anyscale.com>
@rickyyx rickyyx removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 20, 2023
Signed-off-by: rickyyx <rickyx@anyscale.com>
@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 20, 2023
Signed-off-by: rickyyx <rickyx@anyscale.com>
Signed-off-by: rickyyx <rickyx@anyscale.com>
@rickyyx rickyyx removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 20, 2023
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Two last requests!

python/ray/tests/test_task_events.py Show resolved Hide resolved
src/ray/gcs/gcs_server/gcs_task_manager.cc Show resolved Hide resolved
@rkooo567 rkooo567 merged commit 86bd6c6 into ray-project:master Jan 23, 2023
@rkooo567
Copy link
Contributor

Btw note; the PR description may be wrong. We only mark children failed (not grandchildren)

@rkooo567
Copy link
Contributor

Let's also close the P0 issue!

@rickyyx rickyyx changed the title [core][state] Proper report of failure when job finishes and for finished tasks [core][state] Proper report of failure when job finishes Jan 23, 2023
@rickyyx
Copy link
Contributor Author

rickyyx commented Jan 23, 2023

Updated PR description and title.

@rkooo567
Copy link
Contributor

Let's also make sure to not forget the follow up test!

@rickyyx
Copy link
Contributor Author

rickyyx commented Jan 24, 2023

Yep here: #31875

@rickyyx rickyyx deleted the task-backend-job-fail branch January 24, 2023 04:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants