Skip to content

Conversation

@Sparks0219
Copy link
Contributor

@Sparks0219 Sparks0219 commented Oct 30, 2025

Description

There was a video object detection Ray Data workload hang reported.

An initial investigation by @jjyao and @dayshah observed that it was due to an actor restart and the actor creation task was being spilled to a raylet that had an outdated resource view. This was found by looking at the raylet state dump. This actor creation task required 1 GPU and 1 CPU, and the raylet where this actor creation task was being spilled to had a cluster view that reported no available GPUs. However there were many available GPUs, and all the other raylet state dumps correctly reported this. Furthermore in the raylet logs for the oudated raylet there was a "Failed to send a message to node: " originating from the ray syncer. Hence an initial hypothesis was formed that the ray syncer retry policy was not working as intended.

A follow up investigation by @edoakes and I revealed an incorrect usage of the grpc streaming callback API.
Currently how retries works in the ray syncer on fail to send/write is:

  • OnWriteDone/OnReadDone(ok = false) is called after a failed read/write
  • Disconnect() (the one in *_bidi_reactor.h!) is called which flips _disconnected to true and calls DoDisconnect()
  • DoDisconnect() notifies grpc we will no longer write to the channel via StartWritesDone() and removes the hold via RemoveHold()
  • GRPC will see that the channel is idle and has no hold so will call OnDone()
  • we've overriden OnDone() to hold a cleanup_cb that contains the retry policy that reinitializes the bidi reactor and connects to the same server at a repeated interval of 2 seconds until it succeeds
  • fault tolerance accomplished! :)

However from logs that we added we weren't seeing OnDone() being called after DoDisconnect() happens. From reading the grpc streaming callback best practices here:
https://grpc.io/docs/languages/cpp/best_practices/#callback-streaming-api
it states that "The best practice is always to read until ok=false on the client side"
From the OnDone grpc documentation: https://grpc.github.io/grpc/cpp/classgrpc_1_1_client_bidi_reactor.html#a51529f76deeda6416ce346291577ffa9:
it states that "Notifies the application that all operations associated with this RPC have completed and all Holds have been removed"

Since we call StartWritesDone() and removed the hold, this should notify grpc that all operations associated with this bidi reactor are completed. HOWEVER reads may not be finished, i.e. we have not read all incoming data.
Consider the following scenario:
1.) We receive a bunch of resource view messages from the GCS and have not processed all of them
2.) OnWriteDone(ok = false) is called => Disconnected() => disconnected_ = false
3.) OnReadDone(ok = true) is called however because disconnected_ = true we early return and STOP processing any more reads as shown below:

if (*disconnected) {
return;
}

4.) Pending reads left in queue, and prevent grpc from calling OnDone since not all operations are done
5.) Hang, we're left in a zombie state and drop all incoming resource view messages and don't send any resource view updates due to the disconnected check

Hence the solution is to remove the disconnected check in OnReadDone and simply allow all incoming data to be read.

There's a couple of interesting observations/questions remaining:
1.) The raylet with the outdated view is the local raylet to the gcs and we're seeing read/write errors despite being on the same node
2.) From the logs I see that the gcs syncer thinks that the channel to the raylet syncer is still available. There's no error logs on the gcs side, its still sending messages to the raylet. Hence even though the raylet gets the "Failed to write error: " we don't see a corresponding error log on the GCS side.

Signed-off-by: joshlee <joshlee@anyscale.com>
@Sparks0219 Sparks0219 force-pushed the joshlee/fix-syncer-bug branch from 3583b62 to 0783d7d Compare November 1, 2025 00:15
@Sparks0219 Sparks0219 changed the title [core][WIP] Ray Syncer Triage [core] Fix incorrect usage of grpc streaming API in ray syncer Nov 1, 2025
Signed-off-by: joshlee <joshlee@anyscale.com>
@Sparks0219 Sparks0219 added the go add ONLY when ready to merge, run all tests label Nov 1, 2025
@Sparks0219 Sparks0219 marked this pull request as ready for review November 1, 2025 01:09
@Sparks0219 Sparks0219 requested a review from a team as a code owner November 1, 2025 01:09
@ray-gardener ray-gardener bot added the core Issues that should be addressed in Ray Core label Nov 1, 2025
Signed-off-by: joshlee <joshlee@anyscale.com>
@edoakes edoakes merged commit f229d86 into ray-project:master Nov 4, 2025
6 checks passed
@Sparks0219 Sparks0219 deleted the joshlee/fix-syncer-bug branch November 7, 2025 21:17
YoussefEssDS pushed a commit to YoussefEssDS/ray that referenced this pull request Nov 8, 2025
…roject#58307)

There was a video object detection Ray Data workload hang reported. 

An initial investigation by @jjyao and @dayshah observed that it was due
to an actor restart and the actor creation task was being spilled to a
raylet that had an outdated resource view. This was found by looking at
the raylet state dump. This actor creation task required 1 GPU and 1
CPU, and the raylet where this actor creation task was being spilled to
had a cluster view that reported no available GPUs. However there were
many available GPUs, and all the other raylet state dumps correctly
reported this. Furthermore in the raylet logs for the oudated raylet
there was a "Failed to send a message to node: " originating from the
ray syncer. Hence an initial hypothesis was formed that the ray syncer
retry policy was not working as intended.

A follow up investigation by @edoakes and I revealed an incorrect usage
of the grpc streaming callback API.
Currently how retries works in the ray syncer on fail to send/write is:
- OnWriteDone/OnReadDone(ok = false) is called after a failed read/write
- Disconnect() (the one in *_bidi_reactor.h!) is called which flips
_disconnected to true and calls DoDisconnect()
- DoDisconnect() notifies grpc we will no longer write to the channel
via StartWritesDone() and removes the hold via RemoveHold()
- GRPC will see that the channel is idle and has no hold so will call
OnDone()
- we've overriden OnDone() to hold a cleanup_cb that contains the retry
policy that reinitializes the bidi reactor and connects to the same
server at a repeated interval of 2 seconds until it succeeds
- fault tolerance accomplished! :) 

However from logs that we added we weren't seeing OnDone() being called
after DoDisconnect() happens. From reading the grpc streaming callback
best practices here:

https://grpc.io/docs/languages/cpp/best_practices/#callback-streaming-api
it states that "The best practice is always to read until ok=false on
the client side"
From the OnDone grpc documentation:
https://grpc.github.io/grpc/cpp/classgrpc_1_1_client_bidi_reactor.html#a51529f76deeda6416ce346291577ffa9:
it states that "Notifies the application that all operations associated
with this RPC have completed and all Holds have been removed"

Since we call StartWritesDone() and removed the hold, this should notify
grpc that all operations associated with this bidi reactor are
completed. HOWEVER reads may not be finished, i.e. we have not read all
incoming data.
Consider the following scenario:
1.) We receive a bunch of resource view messages from the GCS and have
not processed all of them
2.) OnWriteDone(ok = false) is called => Disconnected() => disconnected_
= false
3.) OnReadDone(ok = true) is called however because disconnected_ = true
we early return and STOP processing any more reads as shown below:

https://github.com/ray-project/ray/blob/275a585203bef4e48c04b46b2b7778bd8265cf46/src/ray/ray_syncer/ray_syncer_bidi_reactor_base.h#L178-L180
4.) Pending reads left in queue, and prevent grpc from calling OnDone
since not all operations are done
5.) Hang, we're left in a zombie state and drop all incoming resource
view messages and don't send any resource view updates due to the
disconnected check

Hence the solution is to remove the disconnected check in OnReadDone and
simply allow all incoming data to be read.

There's a couple of interesting observations/questions remaining:
1.) The raylet with the outdated view is the local raylet to the gcs and
we're seeing read/write errors despite being on the same node
2.) From the logs I see that the gcs syncer thinks that the channel to
the raylet syncer is still available. There's no error logs on the gcs
side, its still sending messages to the raylet. Hence even though the
raylet gets the "Failed to write error: " we don't see a corresponding
error log on the GCS side.

---------

Signed-off-by: joshlee <joshlee@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…roject#58307)

There was a video object detection Ray Data workload hang reported. 

An initial investigation by @jjyao and @dayshah observed that it was due
to an actor restart and the actor creation task was being spilled to a
raylet that had an outdated resource view. This was found by looking at
the raylet state dump. This actor creation task required 1 GPU and 1
CPU, and the raylet where this actor creation task was being spilled to
had a cluster view that reported no available GPUs. However there were
many available GPUs, and all the other raylet state dumps correctly
reported this. Furthermore in the raylet logs for the oudated raylet
there was a "Failed to send a message to node: " originating from the
ray syncer. Hence an initial hypothesis was formed that the ray syncer
retry policy was not working as intended.

A follow up investigation by @edoakes and I revealed an incorrect usage
of the grpc streaming callback API.
Currently how retries works in the ray syncer on fail to send/write is:
- OnWriteDone/OnReadDone(ok = false) is called after a failed read/write
- Disconnect() (the one in *_bidi_reactor.h!) is called which flips
_disconnected to true and calls DoDisconnect()
- DoDisconnect() notifies grpc we will no longer write to the channel
via StartWritesDone() and removes the hold via RemoveHold()
- GRPC will see that the channel is idle and has no hold so will call
OnDone()
- we've overriden OnDone() to hold a cleanup_cb that contains the retry
policy that reinitializes the bidi reactor and connects to the same
server at a repeated interval of 2 seconds until it succeeds
- fault tolerance accomplished! :) 

However from logs that we added we weren't seeing OnDone() being called
after DoDisconnect() happens. From reading the grpc streaming callback
best practices here:

https://grpc.io/docs/languages/cpp/best_practices/#callback-streaming-api
it states that "The best practice is always to read until ok=false on
the client side"
From the OnDone grpc documentation:
https://grpc.github.io/grpc/cpp/classgrpc_1_1_client_bidi_reactor.html#a51529f76deeda6416ce346291577ffa9:
it states that "Notifies the application that all operations associated
with this RPC have completed and all Holds have been removed"

Since we call StartWritesDone() and removed the hold, this should notify
grpc that all operations associated with this bidi reactor are
completed. HOWEVER reads may not be finished, i.e. we have not read all
incoming data.
Consider the following scenario:
1.) We receive a bunch of resource view messages from the GCS and have
not processed all of them
2.) OnWriteDone(ok = false) is called => Disconnected() => disconnected_
= false
3.) OnReadDone(ok = true) is called however because disconnected_ = true
we early return and STOP processing any more reads as shown below:

https://github.com/ray-project/ray/blob/275a585203bef4e48c04b46b2b7778bd8265cf46/src/ray/ray_syncer/ray_syncer_bidi_reactor_base.h#L178-L180
4.) Pending reads left in queue, and prevent grpc from calling OnDone
since not all operations are done
5.) Hang, we're left in a zombie state and drop all incoming resource
view messages and don't send any resource view updates due to the
disconnected check

Hence the solution is to remove the disconnected check in OnReadDone and
simply allow all incoming data to be read.

There's a couple of interesting observations/questions remaining:
1.) The raylet with the outdated view is the local raylet to the gcs and
we're seeing read/write errors despite being on the same node
2.) From the logs I see that the gcs syncer thinks that the channel to
the raylet syncer is still available. There's no error logs on the gcs
side, its still sending messages to the raylet. Hence even though the
raylet gets the "Failed to write error: " we don't see a corresponding
error log on the GCS side.

---------

Signed-off-by: joshlee <joshlee@anyscale.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…roject#58307)

There was a video object detection Ray Data workload hang reported.

An initial investigation by @jjyao and @dayshah observed that it was due
to an actor restart and the actor creation task was being spilled to a
raylet that had an outdated resource view. This was found by looking at
the raylet state dump. This actor creation task required 1 GPU and 1
CPU, and the raylet where this actor creation task was being spilled to
had a cluster view that reported no available GPUs. However there were
many available GPUs, and all the other raylet state dumps correctly
reported this. Furthermore in the raylet logs for the oudated raylet
there was a "Failed to send a message to node: " originating from the
ray syncer. Hence an initial hypothesis was formed that the ray syncer
retry policy was not working as intended.

A follow up investigation by @edoakes and I revealed an incorrect usage
of the grpc streaming callback API.
Currently how retries works in the ray syncer on fail to send/write is:
- OnWriteDone/OnReadDone(ok = false) is called after a failed read/write
- Disconnect() (the one in *_bidi_reactor.h!) is called which flips
_disconnected to true and calls DoDisconnect()
- DoDisconnect() notifies grpc we will no longer write to the channel
via StartWritesDone() and removes the hold via RemoveHold()
- GRPC will see that the channel is idle and has no hold so will call
OnDone()
- we've overriden OnDone() to hold a cleanup_cb that contains the retry
policy that reinitializes the bidi reactor and connects to the same
server at a repeated interval of 2 seconds until it succeeds
- fault tolerance accomplished! :)

However from logs that we added we weren't seeing OnDone() being called
after DoDisconnect() happens. From reading the grpc streaming callback
best practices here:

https://grpc.io/docs/languages/cpp/best_practices/#callback-streaming-api
it states that "The best practice is always to read until ok=false on
the client side"
From the OnDone grpc documentation:
https://grpc.github.io/grpc/cpp/classgrpc_1_1_client_bidi_reactor.html#a51529f76deeda6416ce346291577ffa9:
it states that "Notifies the application that all operations associated
with this RPC have completed and all Holds have been removed"

Since we call StartWritesDone() and removed the hold, this should notify
grpc that all operations associated with this bidi reactor are
completed. HOWEVER reads may not be finished, i.e. we have not read all
incoming data.
Consider the following scenario:
1.) We receive a bunch of resource view messages from the GCS and have
not processed all of them
2.) OnWriteDone(ok = false) is called => Disconnected() => disconnected_
= false
3.) OnReadDone(ok = true) is called however because disconnected_ = true
we early return and STOP processing any more reads as shown below:

https://github.com/ray-project/ray/blob/275a585203bef4e48c04b46b2b7778bd8265cf46/src/ray/ray_syncer/ray_syncer_bidi_reactor_base.h#L178-L180
4.) Pending reads left in queue, and prevent grpc from calling OnDone
since not all operations are done
5.) Hang, we're left in a zombie state and drop all incoming resource
view messages and don't send any resource view updates due to the
disconnected check

Hence the solution is to remove the disconnected check in OnReadDone and
simply allow all incoming data to be read.

There's a couple of interesting observations/questions remaining:
1.) The raylet with the outdated view is the local raylet to the gcs and
we're seeing read/write errors despite being on the same node
2.) From the logs I see that the gcs syncer thinks that the channel to
the raylet syncer is still available. There's no error logs on the gcs
side, its still sending messages to the raylet. Hence even though the
raylet gets the "Failed to write error: " we don't see a corresponding
error log on the GCS side.

---------

Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
ykdojo pushed a commit to ykdojo/ray that referenced this pull request Nov 27, 2025
…roject#58307)

There was a video object detection Ray Data workload hang reported.

An initial investigation by @jjyao and @dayshah observed that it was due
to an actor restart and the actor creation task was being spilled to a
raylet that had an outdated resource view. This was found by looking at
the raylet state dump. This actor creation task required 1 GPU and 1
CPU, and the raylet where this actor creation task was being spilled to
had a cluster view that reported no available GPUs. However there were
many available GPUs, and all the other raylet state dumps correctly
reported this. Furthermore in the raylet logs for the oudated raylet
there was a "Failed to send a message to node: " originating from the
ray syncer. Hence an initial hypothesis was formed that the ray syncer
retry policy was not working as intended.

A follow up investigation by @edoakes and I revealed an incorrect usage
of the grpc streaming callback API.
Currently how retries works in the ray syncer on fail to send/write is:
- OnWriteDone/OnReadDone(ok = false) is called after a failed read/write
- Disconnect() (the one in *_bidi_reactor.h!) is called which flips
_disconnected to true and calls DoDisconnect()
- DoDisconnect() notifies grpc we will no longer write to the channel
via StartWritesDone() and removes the hold via RemoveHold()
- GRPC will see that the channel is idle and has no hold so will call
OnDone()
- we've overriden OnDone() to hold a cleanup_cb that contains the retry
policy that reinitializes the bidi reactor and connects to the same
server at a repeated interval of 2 seconds until it succeeds
- fault tolerance accomplished! :)

However from logs that we added we weren't seeing OnDone() being called
after DoDisconnect() happens. From reading the grpc streaming callback
best practices here:

https://grpc.io/docs/languages/cpp/best_practices/#callback-streaming-api
it states that "The best practice is always to read until ok=false on
the client side"
From the OnDone grpc documentation:
https://grpc.github.io/grpc/cpp/classgrpc_1_1_client_bidi_reactor.html#a51529f76deeda6416ce346291577ffa9:
it states that "Notifies the application that all operations associated
with this RPC have completed and all Holds have been removed"

Since we call StartWritesDone() and removed the hold, this should notify
grpc that all operations associated with this bidi reactor are
completed. HOWEVER reads may not be finished, i.e. we have not read all
incoming data.
Consider the following scenario:
1.) We receive a bunch of resource view messages from the GCS and have
not processed all of them
2.) OnWriteDone(ok = false) is called => Disconnected() => disconnected_
= false
3.) OnReadDone(ok = true) is called however because disconnected_ = true
we early return and STOP processing any more reads as shown below:

https://github.com/ray-project/ray/blob/275a585203bef4e48c04b46b2b7778bd8265cf46/src/ray/ray_syncer/ray_syncer_bidi_reactor_base.h#L178-L180
4.) Pending reads left in queue, and prevent grpc from calling OnDone
since not all operations are done
5.) Hang, we're left in a zombie state and drop all incoming resource
view messages and don't send any resource view updates due to the
disconnected check

Hence the solution is to remove the disconnected check in OnReadDone and
simply allow all incoming data to be read.

There's a couple of interesting observations/questions remaining:
1.) The raylet with the outdated view is the local raylet to the gcs and
we're seeing read/write errors despite being on the same node
2.) From the logs I see that the gcs syncer thinks that the channel to
the raylet syncer is still available. There's no error logs on the gcs
side, its still sending messages to the raylet. Hence even though the
raylet gets the "Failed to write error: " we don't see a corresponding
error log on the GCS side.

---------

Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: YK <1811651+ykdojo@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants