-
Notifications
You must be signed in to change notification settings - Fork 7k
Open
Labels
P0Issues that should be fixed in short orderIssues that should be fixed in short orderbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tcoreIssues that should be addressed in Ray CoreIssues that should be addressed in Ray Core
Description
What happened + What you expected to happen
Currently arguments for actor creation tasks in plasma don't have the correct lineage_ref_count. This means they can be evicted or the task specs needed to reconstruct them can be evicted. That means the max_restarts # on the actor will not be honored and we'll hit a ray.exceptions.ReferenceCountingAssertionError, when the actor tries to restart.
The solution will be to have an actor manager on the owner worker to handle actors owned by that worker. Behavior for detached actors will remain the same as it is today. Context for solution decision #51653 (comment)
Versions / Dependencies
Any Ray version
Reproduction script
cluster = Cluster()
cluster.add_node(num_cpus=0) # head
ray.init(address=cluster.address)
worker1 = cluster.add_node(num_cpus=1)
@ray.remote(num_cpus=1, max_restarts=1)
class Actor:
def __init__(self, config):
self.config = config
def ping(self):
return self.config
# Arg is >100kb so will go in the object store
actor = Actor.remote(np.zeros(100 * 1024 * 1024, dtype=np.uint8))
ray.get(actor.ping.remote())
worker2 = cluster.add_node(num_cpus=1)
cluster.remove_node(worker1, allow_graceful=True)
# This line will break
ray.get(actor.ping.remote())
Issue Severity
Medium: It is a significant difficulty but I can work around it.
Metadata
Metadata
Assignees
Labels
P0Issues that should be fixed in short orderIssues that should be fixed in short orderbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tcoreIssues that should be addressed in Ray CoreIssues that should be addressed in Ray Core