Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] surface OOM error when actor is killed due to OOM #29736

Closed
clarng opened this issue Oct 26, 2022 · 0 comments
Closed

[core] surface OOM error when actor is killed due to OOM #29736

clarng opened this issue Oct 26, 2022 · 0 comments

Comments

@clarng
Copy link
Contributor

clarng commented Oct 26, 2022

What happened + What you expected to happen

Expose OOM error for actor instead of worker crash (use existing actor protocol) - observability improvement

we should have actor throw oom error and also use the oom retry (P1) when it is killed by ray oom killer. One challenge is that right now there is a race between worker failure reporting actor failure vs push task reporting grpc failure.

#14211

https://github.com/ray-project/ray/pull/26898/files

Versions / Dependencies

master

Reproduction script

n/a

Issue Severity

No response

@clarng clarng added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) core-oom-killer labels Oct 26, 2022
@clarng clarng self-assigned this Oct 26, 2022
@clarng clarng added feature request and removed bug Something that is supposed to be working; but isn't labels Oct 26, 2022
@hora-anyscale hora-anyscale removed the triage Needs triage (eg: priority, bug/not-bug, and owning component) label Oct 28, 2022
@clarng clarng added Ray 2.2 and removed Ray 2.2 labels Nov 4, 2022
@clarng clarng changed the title [core] actor oom retry [core] actor oom error Jan 29, 2023
@clarng clarng changed the title [core] actor oom error [core] surface OOM error when actor is killed due to OOM Jan 29, 2023
scv119 pushed a commit that referenced this issue Feb 1, 2023
Right now we show Actor error if the actor is killed due to OOM. This PR changes it so it surfaces a OOM error

It does not support actor / actor task oom retry, as the goal of this PR is to improve observability by setting the death cause of the actor to OOM

Related issue number
#29736
Signed-off-by: Aviv Haber <aviv@anyscale.com>
Signed-off-by: Clarence Ng <clarence@anyscale.com>
@clarng clarng closed this as completed Feb 1, 2023
edoakes pushed a commit to edoakes/ray that referenced this issue Mar 22, 2023
…#32107)

Right now we show Actor error if the actor is killed due to OOM. This PR changes it so it surfaces a OOM error

It does not support actor / actor task oom retry, as the goal of this PR is to improve observability by setting the death cause of the actor to OOM

Related issue number
ray-project#29736
Signed-off-by: Aviv Haber <aviv@anyscale.com>
Signed-off-by: Clarence Ng <clarence@anyscale.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants