Skip to content

Conversation

@edoakes
Copy link
Collaborator

@edoakes edoakes commented Nov 14, 2025

Minor follow ups from: #58539

Example error message:

Task failed because the node it was running on is dead or unavailable. Node IP: 127.0.0.1, node ID: e55b8ca03ebf3f7418f51533d8d55abeaab75fa9b29e2e6282b47646. This can happen if the node was preempted, had a hardware failure, or its raylet crashed unexpectedly. To see node death information, use `ray list nodes --filter node_id=e55b8ca03ebf3f7418f51533d8d55abeaab75fa9b29e2e6282b47646`, check the Ray dashboard cluster page, search the node ID in the GCS logs, or use `ray logs raylet.out -ip 127.0.0.1`.

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
@edoakes edoakes requested a review from a team as a code owner November 14, 2025 18:04
@edoakes edoakes added the go add ONLY when ready to merge, run all tests label Nov 14, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request cleans up the error message for NODE_DIED events, making it more user-friendly and providing clearer debugging instructions. The changes in the C++ source and corresponding test updates look good. However, I found a potential bug in one of the test files that could cause it to fail.

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
@edoakes edoakes enabled auto-merge (squash) November 14, 2025 18:20
@ray-gardener ray-gardener bot added the core Issues that should be addressed in Ray Core label Nov 14, 2025
@edoakes edoakes merged commit cc02220 into ray-project:master Nov 14, 2025
7 checks passed
ArturNiederfahrenhorst pushed a commit to ArturNiederfahrenhorst/ray that referenced this pull request Nov 16, 2025
Minor follow ups from: ray-project#58539

Example error message:
```
Task failed because the node it was running on is dead or unavailable. Node IP: 127.0.0.1, node ID: e55b8ca03ebf3f7418f51533d8d55abeaab75fa9b29e2e6282b47646. This can happen if the node was preempted, had a hardware failure, or its raylet crashed unexpectedly. To see node death information, use `ray list nodes --filter node_id=e55b8ca03ebf3f7418f51533d8d55abeaab75fa9b29e2e6282b47646`, check the Ray dashboard cluster page, search the node ID in the GCS logs, or use `ray logs raylet.out -ip 127.0.0.1`.
```

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
Minor follow ups from: ray-project#58539

Example error message:
```
Task failed because the node it was running on is dead or unavailable. Node IP: 127.0.0.1, node ID: e55b8ca03ebf3f7418f51533d8d55abeaab75fa9b29e2e6282b47646. This can happen if the node was preempted, had a hardware failure, or its raylet crashed unexpectedly. To see node death information, use `ray list nodes --filter node_id=e55b8ca03ebf3f7418f51533d8d55abeaab75fa9b29e2e6282b47646`, check the Ray dashboard cluster page, search the node ID in the GCS logs, or use `ray logs raylet.out -ip 127.0.0.1`.
```

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
ykdojo pushed a commit to ykdojo/ray that referenced this pull request Nov 27, 2025
Minor follow ups from: ray-project#58539

Example error message:
```
Task failed because the node it was running on is dead or unavailable. Node IP: 127.0.0.1, node ID: e55b8ca03ebf3f7418f51533d8d55abeaab75fa9b29e2e6282b47646. This can happen if the node was preempted, had a hardware failure, or its raylet crashed unexpectedly. To see node death information, use `ray list nodes --filter node_id=e55b8ca03ebf3f7418f51533d8d55abeaab75fa9b29e2e6282b47646`, check the Ray dashboard cluster page, search the node ID in the GCS logs, or use `ray logs raylet.out -ip 127.0.0.1`.
```

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Signed-off-by: YK <1811651+ykdojo@users.noreply.github.com>
SheldonTsen pushed a commit to SheldonTsen/ray that referenced this pull request Dec 1, 2025
Minor follow ups from: ray-project#58539

Example error message:
```
Task failed because the node it was running on is dead or unavailable. Node IP: 127.0.0.1, node ID: e55b8ca03ebf3f7418f51533d8d55abeaab75fa9b29e2e6282b47646. This can happen if the node was preempted, had a hardware failure, or its raylet crashed unexpectedly. To see node death information, use `ray list nodes --filter node_id=e55b8ca03ebf3f7418f51533d8d55abeaab75fa9b29e2e6282b47646`, check the Ray dashboard cluster page, search the node ID in the GCS logs, or use `ray logs raylet.out -ip 127.0.0.1`.
```

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants