Remote actors don't appear in the local registry #93

calebfletcher · 2023-04-24T00:29:28Z

I was looking for a way to be able to communicate with a remote actor without having to join it to a process group, and found this section in the code that prevents the RemoteActors from registering with the remote actor's name into the local registry:

ractor/ractor/src/actor/actor_cell/mod.rs

Lines 222 to 225 in 6e36b8b

    
           // TODO: remote actors don't appear in the name registry 
        
           // if let Some(r_name) = name { 
        
           //     crate::registry::register(r_name, cell.clone())?; 
        
           // }

Uncommenting this code (and cloning the name to get around the move into ActorProperties::new_remote) seems to work as expected, with all the remote actors successfully being reigstered.

Is there an issue that this approach causes, or could this be enabled?

The text was updated successfully, but these errors were encountered:

calebfletcher · 2023-04-24T00:32:14Z

Ran into one issue, when an actor is stopped on the remote node it causes errors on the local node:

[2023-04-23T15:17:15Z DEBUG ractor_cluster::net::session] RECEIVE 127.0.0.1:54596 <- 127.0.0.1:4697 - 'NetworkMessage { message: Some(Control(ControlMessage { msg: Some(Terminate(Terminate { ids: [2] })) })) }'
[2023-04-23T15:17:15Z DEBUG ractor_cluster::net::session] RECEIVE 127.0.0.1:54596 <- 127.0.0.1:4697 - 'NetworkMessage { message: Some(Control(ControlMessage { msg: Some(PgLeave(PgLeave { group: "test", actors: [Actor { pid: 2, name: Some("actorname") }] })) })) }'
[2023-04-23T15:17:15Z DEBUG ractor_cluster::node::node_session] Actor 2 on node 0 exited, terminating local `RemoteActor` 0.2
[2023-04-23T15:17:15Z DEBUG ractor::registry] registering actor with name actorname
[2023-04-23T15:17:15Z ERROR ractor_cluster::node::node_session] Failed to spawn remote actor with 'Actor 'actorname' is already registered in the actor registry'       
[2023-04-23T15:17:15Z WARN  ractor_cluster::node::node_session] NodeSession Some(NameMessage { name: "node_b@localhost", flags: Some(NodeFlags { version: 1 }), connection_string: "localhost:4697" }) received an unknown child actor exit event from 0.2 - 'Some("remote")'

The name of the remote actor is 'actorname', and node_a is the remote node.

I'm not sure why the termination of the actor is trying to re-register the actor, will have to look into it.

calebfletcher · 2023-04-24T01:04:07Z

The problem seems to occur due to NodeSession receiving ActorTerminated prior to PgLeave. The handler for ActorTerminated removes the actor from state.remote_actors, and then when the PgLeave handler calls NodeSession::get_or_spawn_remote_actor in order to get the ActorCell to remove them from the group, it inadvertently creates the remote actor again:

ractor/ractor_cluster/src/node/node_session/mod.rs

Line 546 in 6e36b8b

.get_or_spawn_remote_actor(&myself, name, pid, state)

I believe we need to be able to remove terminated actors from the process groups without recreating them, but currently when we unregister from the process group it uses the ActorCell in order to notify any supervisors with the SupervisionEvent::ProcessGroupChanged event. It looks like ActorCell gets around this in the local case by first removing all group event listeners for that actor, and then leaving the group:

ractor/ractor/src/actor/actor_cell/mod.rs

Lines 283 to 284 in 6e36b8b

    
           crate::pg::demonitor_all(self.get_id()); 
        
           crate::pg::leave_all(self.get_id());

This approach wouldn't work for this situation, since we don't control the entire terminate+pgleave process.

I'm open to ideas in solving this. One approach could be that when the NodeSession receives a Terminate event, we could manually unregister it from the process group (and in the process, notifying any supervisors that are monitoring it), and then in the PgLeave event handler only unregister actors from the process group if they still exist in state.remote_actors. Testing this approach does seem to fix the issue, but it may not be the cleanest solution.

slawlor · 2023-04-24T20:55:12Z

Hmm so the local remote actor isn't fully shutdown perhaps? And therefore couldn't self-leave the process group? Yeah it might be something we need to look at, let me see if I can get a repro scenario together and replicate it and what might be good solutions for solving it. Thanks for reporting!

calebfletcher · 2023-04-25T11:40:35Z

I have a minimum reproducible example available here: https://gist.github.com/calebfletcher/057e9ca35acf3aaa181c1f0a5f679357
If you start server.rs first, then client.rs, then have a look at the logs on the client there should be a warning and an error, the same messages that are in the second comment of this thread.

slawlor · 2023-05-01T19:05:14Z

I need to re-run this after the merged PR, but I'm hoping this solved your issue. If yes, I'll add an integration test of the situation to make sure we don't regress on it.

calebfletcher · 2023-05-08T00:22:44Z

Reran my test case with the upstream changes, it seems to have fixed the issue, so thanks for that. I have opened PR #103 to implement the change and to add an integration test.

calebfletcher · 2023-05-08T00:24:54Z

Additionally (although this may be better separated out as a new issue) calling .stop() on a RemoteActor kills the RemoteActor, not the actor that is represents. This may be the intended behaviour, but it is a bit of a footgun and so should be either documented or have its behaviour changed. I got around this in the integration test by sending a message to the remote actor that then causes it to stop itself.

slawlor · 2023-05-08T14:04:04Z

Additionally (although this may be better separated out as a new issue) calling .stop() on a RemoteActor kills the RemoteActor, not the actor that is represents. This may be the intended behaviour, but it is a bit of a footgun and so should be either documented or have its behaviour changed. I got around this in the integration test by sending a message to the remote actor that then causes it to stop itself.

Yes for this, I don't have a good solution for remote supervision which is why for now stop() doesn't propagate over the wire. I suppose it does make sense to allow a remote process to stop an actor, and we should propagate the stop signal (and kill for that matter), however I still think remote supervision is a large task to tackle here.

slawlor added the bug Something isn't working label Apr 24, 2023

slawlor mentioned this issue Apr 28, 2023

Add support for "stop and wait" and "kill and wait" #97

Merged

2 tasks

slawlor added the cluster Related to actor clusters label May 2, 2023

slawlor removed the bug Something isn't working label Aug 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remote actors don't appear in the local registry #93

Remote actors don't appear in the local registry #93

calebfletcher commented Apr 24, 2023

calebfletcher commented Apr 24, 2023

calebfletcher commented Apr 24, 2023 •

edited

Loading

slawlor commented Apr 24, 2023

calebfletcher commented Apr 25, 2023

slawlor commented May 1, 2023

calebfletcher commented May 8, 2023

calebfletcher commented May 8, 2023

slawlor commented May 8, 2023

Remote actors don't appear in the local registry #93

Remote actors don't appear in the local registry #93

Comments

calebfletcher commented Apr 24, 2023

calebfletcher commented Apr 24, 2023

calebfletcher commented Apr 24, 2023 • edited Loading

slawlor commented Apr 24, 2023

calebfletcher commented Apr 25, 2023

slawlor commented May 1, 2023

calebfletcher commented May 8, 2023

calebfletcher commented May 8, 2023

slawlor commented May 8, 2023

calebfletcher commented Apr 24, 2023 •

edited

Loading