Unrecognized cast: {:untrack, pid} #127

jesseshieh · 2019-05-14T13:33:12Z

I'm not sure if it actually causes a problem or not, but my logs sometimes show lines like these

2019-02-25 23:55:35.676 [warn] [swarm on web@100.96.20.105] [tracker:handle_cast] unrecognized cast: {:untrack, #PID<0.583.0>}
2019-02-25 23:55:35.676 [warn] [swarm on web@100.96.20.105] [tracker:handle_cast] unrecognized cast: {:track, MyApp.Scheduler.TaskRegistry, #PID<39255.580.0>, %{mfa: {Quantum.TaskRegistry, :start_link, [%Quantum.TaskRegistry.StartOpts{name: MyApp.Scheduler.TaskRegistry}]}}}
2019-02-25 23:55:35.676 [warn] [swarm on web@100.96.20.105] [tracker:handle_cast] unrecognized cast: {:untrack, #PID<0.584.0>}
2019-02-25 23:55:35.677 [warn] [swarm on web@100.96.20.105] [tracker:handle_cast] unrecognized cast: {:track, MyApp.Scheduler.JobBroadcaster, #PID<39255.581.0>, %{mfa: {Quantum.JobBroadcaster, :start_link, [%Quantum.JobBroadcaster.StartOpts{debug_logging: true, jobs: [...jobs here...]}}}

Taken from quantum-elixir/quantum-core#398

My guess is that something is sending the :track and :untrack messages as casts instead of calls. This line expects the message to be a call, and this line sends the message as a call.

This line, however, sends the message as a cast so my guess is this is the culprit. Perhaps simply changing it to a call will fix the issue?

I'm on swarm 3.4.0.

The text was updated successfully, but these errors were encountered:

jesseshieh · 2019-05-14T15:22:43Z

fwiw, I also see these messages when running mix test on swarm. Specifically

mix test test/integration_test.exs:63

And the test passes. When I change the cast to a call the messages are gone, but the test fails.

Also, interestingly, when I delete the lines here, the tests still pass.

jesseshieh · 2019-05-14T15:46:09Z

More observations

In the passing case, I see the following logs


*DBG* 'Elixir.Swarm.Tracker' consume cast {sync_recv,<7155.141.0>,
                {{0,{0,1}},0},
                [{entry,{worker,8},
                        <7155.158.0>,#Ref<7155.3024736589.70516738.97011>,
                        #{group1 => true,
                          mfa => {'Elixir.MyApp.Worker',start_link,[]}},
                        {0,{0,0,1}}},
...
                        {0,{0,0,1}}}]} in state syncing => tracking
*DBG* 'Elixir.Swarm.Tracker' postpone cast {untrack,<0.158.0>} in state awaiting_sync_ack
*DBG* 'Elixir.Swarm.Tracker' receive cast {sync_ack,<7155.141.0>,
               {{0,{0,1}},0},
               [{entry,{worker,8},
                       <7155.159.0>,#Ref<7155.608380958.70516737.97926>,
                       #{group1 => true,
                         mfa => {'Elixir.MyApp.Worker',start_link,[]}},
                       {0,{0,2,0}}},
...
                       {0,{0,2,0}}}]} in state awaiting_sync_ack
08:06:14.211 [info]  [swarm on node2@127.0.0.1] [tracker:awaiting_sync_ack] received sync acknowledgement from node1@127.0.0.1, syncing with remote registry
*DBG* 'Elixir.Swarm.Tracker' receive info {event,<7155.141.0>,{0,{0,0,2}},{untrack,<7155.158.0>}} in state tracking
*DBG* 'Elixir.Swarm.Tracker' consume cast {sync_ack,<7155.141.0>,
               {{0,{0,1}},0},
               [{entry,{worker,8},
                       <7155.159.0>,#Ref<7155.608380958.70516737.97926>,
                       #{group1 => true,
                         mfa => {'Elixir.MyApp.Worker',start_link,[]}},
                       {0,{0,2,0}}},
...
                       {0,{0,2,0}}}]} in state awaiting_sync_ack => tracking

Note that both nodes receive a sync_ack and the untrack is "postponed".

In the failing case, I see the following logs.

*DBG* 'Elixir.Swarm.Tracker' receive cast {sync_recv,<7155.143.0>,
                {{{0,1},0},0},
                [{entry,{worker,8},
                        <7155.160.0>,#Ref<7155.310440809.337641473.52450>,
                        #{group1 => true,
                          mfa => {'Elixir.MyApp.Worker',start_link,[]}},
                        {0,{0,1,0}}},
...
                        {0,{0,1,0}}}]} in state syncing
08:02:14.404 [info]  [swarm on node1@127.0.0.1] [tracker:syncing] received registry from node2@127.0.0.1, merging..
*DBG* 'Elixir.Swarm.Tracker' receive call {untrack,<0.160.0>} from <7155.143.0> in state awaiting_sync_ack

Note that this time no sync_ack messages appear to be received, but the untrack is received. I'm guessing that the untrack message somehow prevents the sync_ack messages from being sent.

jesseshieh · 2019-05-16T14:12:17Z

It turns out, the reason why changing the cast to a call doesn't work is because it causes a deadlock. The pull request I submitted does not change it to a call, but just tries to make the cast recognized.

x-ji · 2020-02-17T14:58:49Z

@jesseshieh Hi, we're also seeing this error in our app, and it causes some workers to not be properly registered. It seems to happen when nodes lose connection with each other (maybe due to libcluster's Kubernetes strategy not being totally stable?), and try to reconnect again.

Did your patch fix the issue? If so we'll just host the patched version in our repo and point the library to that version.

Thanks.

jesseshieh · 2020-02-17T15:41:10Z

I can't remember if the patch fixed it or not since it was so long ago, but I've since decided to use something other than swarm so I'm not sure I'll be of much help.

bernardo-martinez · 2022-08-04T09:10:07Z

I can't remember if the patch fixed it or not since it was so long ago, but I've since decided to use something other than swarm so I'm not sure I'll be of much help.

@jesseshieh what did you use instead?

huruk · 2022-08-29T15:11:01Z

@bernardo-martinez I still saw the issue on every deployment.
I ended up using horde as a replacement.

jesseshieh mentioned this issue May 14, 2019

Global mode doesn't start on any node quantum-elixir/quantum-core#398

Closed

jesseshieh linked a pull request May 14, 2019 that will close this issue

Switch from :track and :untrack events, which are unrecognized, to replica events which *are* recognized #128

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unrecognized cast: {:untrack, pid} #127

Unrecognized cast: {:untrack, pid} #127

jesseshieh commented May 14, 2019

jesseshieh commented May 14, 2019 •

edited

Loading

jesseshieh commented May 14, 2019

jesseshieh commented May 16, 2019

x-ji commented Feb 17, 2020

jesseshieh commented Feb 17, 2020

bernardo-martinez commented Aug 4, 2022

huruk commented Aug 29, 2022

Unrecognized cast: {:untrack, pid} #127

Unrecognized cast: {:untrack, pid} #127

Comments

jesseshieh commented May 14, 2019

jesseshieh commented May 14, 2019 • edited Loading

jesseshieh commented May 14, 2019

More observations

jesseshieh commented May 16, 2019

x-ji commented Feb 17, 2020

jesseshieh commented Feb 17, 2020

bernardo-martinez commented Aug 4, 2022

huruk commented Aug 29, 2022

jesseshieh commented May 14, 2019 •

edited

Loading