You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Expose OOM error for actor instead of worker crash (use existing actor protocol) - observability improvement
we should have actor throw oom error and also use the oom retry (P1) when it is killed by ray oom killer. One challenge is that right now there is a race between worker failure reporting actor failure vs push task reporting grpc failure.
Right now we show Actor error if the actor is killed due to OOM. This PR changes it so it surfaces a OOM error
It does not support actor / actor task oom retry, as the goal of this PR is to improve observability by setting the death cause of the actor to OOM
Related issue number
#29736
Signed-off-by: Aviv Haber <aviv@anyscale.com>
Signed-off-by: Clarence Ng <clarence@anyscale.com>
…#32107)
Right now we show Actor error if the actor is killed due to OOM. This PR changes it so it surfaces a OOM error
It does not support actor / actor task oom retry, as the goal of this PR is to improve observability by setting the death cause of the actor to OOM
Related issue number
ray-project#29736
Signed-off-by: Aviv Haber <aviv@anyscale.com>
Signed-off-by: Clarence Ng <clarence@anyscale.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
What happened + What you expected to happen
Expose OOM error for actor instead of worker crash (use existing actor protocol) - observability improvement
we should have actor throw oom error and also use the oom retry (P1) when it is killed by ray oom killer. One challenge is that right now there is a race between worker failure reporting actor failure vs push task reporting grpc failure.
#14211
https://github.com/ray-project/ray/pull/26898/files
Versions / Dependencies
master
Reproduction script
n/a
Issue Severity
No response
The text was updated successfully, but these errors were encountered: