Skip to content

[core] Actor restarts don't work when an actor creation arg is evicted from plasma #53727

@dayshah

Description

@dayshah

What happened + What you expected to happen

Currently arguments for actor creation tasks in plasma don't have the correct lineage_ref_count. This means they can be evicted or the task specs needed to reconstruct them can be evicted. That means the max_restarts # on the actor will not be honored and we'll hit a ray.exceptions.ReferenceCountingAssertionError, when the actor tries to restart.

The solution will be to have an actor manager on the owner worker to handle actors owned by that worker. Behavior for detached actors will remain the same as it is today. Context for solution decision #51653 (comment)

Related:
#53713
#51653

Versions / Dependencies

Any Ray version

Reproduction script

cluster = Cluster()
cluster.add_node(num_cpus=0)  # head
ray.init(address=cluster.address)
worker1 = cluster.add_node(num_cpus=1)

@ray.remote(num_cpus=1, max_restarts=1)
class Actor:
    def __init__(self, config):
        self.config = config

    def ping(self):
        return self.config

# Arg is >100kb so will go in the object store
actor = Actor.remote(np.zeros(100 * 1024 * 1024, dtype=np.uint8))
ray.get(actor.ping.remote())

worker2 = cluster.add_node(num_cpus=1)
cluster.remove_node(worker1, allow_graceful=True)

# This line will break
ray.get(actor.ping.remote())

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Metadata

Metadata

Assignees

Labels

P0Issues that should be fixed in short orderbugSomething that is supposed to be working; but isn'tcoreIssues that should be addressed in Ray Core

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions