-
Notifications
You must be signed in to change notification settings - Fork 7k
[core] Clean up NODE_DIED task error message
#58638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request cleans up the error message for NODE_DIED events, making it more user-friendly and providing clearer debugging instructions. The changes in the C++ source and corresponding test updates look good. However, I found a potential bug in one of the test files that could cause it to fail.
Minor follow ups from: ray-project#58539 Example error message: ``` Task failed because the node it was running on is dead or unavailable. Node IP: 127.0.0.1, node ID: e55b8ca03ebf3f7418f51533d8d55abeaab75fa9b29e2e6282b47646. This can happen if the node was preempted, had a hardware failure, or its raylet crashed unexpectedly. To see node death information, use `ray list nodes --filter node_id=e55b8ca03ebf3f7418f51533d8d55abeaab75fa9b29e2e6282b47646`, check the Ray dashboard cluster page, search the node ID in the GCS logs, or use `ray logs raylet.out -ip 127.0.0.1`. ``` --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Minor follow ups from: ray-project#58539 Example error message: ``` Task failed because the node it was running on is dead or unavailable. Node IP: 127.0.0.1, node ID: e55b8ca03ebf3f7418f51533d8d55abeaab75fa9b29e2e6282b47646. This can happen if the node was preempted, had a hardware failure, or its raylet crashed unexpectedly. To see node death information, use `ray list nodes --filter node_id=e55b8ca03ebf3f7418f51533d8d55abeaab75fa9b29e2e6282b47646`, check the Ray dashboard cluster page, search the node ID in the GCS logs, or use `ray logs raylet.out -ip 127.0.0.1`. ``` --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Minor follow ups from: ray-project#58539 Example error message: ``` Task failed because the node it was running on is dead or unavailable. Node IP: 127.0.0.1, node ID: e55b8ca03ebf3f7418f51533d8d55abeaab75fa9b29e2e6282b47646. This can happen if the node was preempted, had a hardware failure, or its raylet crashed unexpectedly. To see node death information, use `ray list nodes --filter node_id=e55b8ca03ebf3f7418f51533d8d55abeaab75fa9b29e2e6282b47646`, check the Ray dashboard cluster page, search the node ID in the GCS logs, or use `ray logs raylet.out -ip 127.0.0.1`. ``` --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: YK <1811651+ykdojo@users.noreply.github.com>
Minor follow ups from: ray-project#58539 Example error message: ``` Task failed because the node it was running on is dead or unavailable. Node IP: 127.0.0.1, node ID: e55b8ca03ebf3f7418f51533d8d55abeaab75fa9b29e2e6282b47646. This can happen if the node was preempted, had a hardware failure, or its raylet crashed unexpectedly. To see node death information, use `ray list nodes --filter node_id=e55b8ca03ebf3f7418f51533d8d55abeaab75fa9b29e2e6282b47646`, check the Ray dashboard cluster page, search the node ID in the GCS logs, or use `ray logs raylet.out -ip 127.0.0.1`. ``` --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Minor follow ups from: #58539
Example error message: