You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[Core] Exit the Core Worker Early Error Received from Plasma Store (#53679)
What's the issue:
- During node shutdown, when the raylet is killed before its core
workers, and the tasks on the core workers read/write objects from the
plasma store, a broken pipe error will be obtained and the tasks will
fail due to the ray task error with reason broken pipe and thus the
whole job failed.
- This is not the desired behavior because the task failure due to node
shutdown should be seen as the system failure and the core worker
shouldn't continue executing tasks when the raylet is down.
The PR made the change to mitigate the above issue:
- In the plasmas store client, add the logic to do core worker quick
exit when error happens during read/write buffer and the plasma store
client is on the core worker side
Test the logic manually to verify the behavior:
- With the following test code:
```
ray.init()
@ray.remote(max_retries=2)
def test_task(obj_ref):
time.sleep(1)
raylet_pid = int(os.environ["RAY_RAYLET_PID"])
os.kill(raylet_pid, signal.SIGKILL)
ray.put(obj_ref)
a = ray.put([0] * 250000)
ray.get(test_task.remote(a))
```
- Without the change:
```
ray.exceptions.RayTaskError(OSError): ray::test_task() (pid=30681, ip=127.0.0.1)
File "/Users/myan/ray-core-quickstart/test-tasks/test-tasks.py", line 18, in test_task
ray.get(test_ref)
File "python/ray/includes/common.pxi", line 93, in ray._raylet.check_status
raise IOError(message)
OSError: Failed to read data from the socket: End of file
```
- With the change in the PR:
```
ray.exceptions.LocalRayletDiedError: The task's local raylet died. Check raylet.out for more information.
```
---------
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Co-authored-by: Ibrahim Rabbani <israbbani@gmail.com>
0 commit comments