Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unrecognized cast: {:untrack, pid} #127

Open
jesseshieh opened this issue May 14, 2019 · 7 comments · May be fixed by #128
Open

Unrecognized cast: {:untrack, pid} #127

jesseshieh opened this issue May 14, 2019 · 7 comments · May be fixed by #128

Comments

@jesseshieh
Copy link

I'm not sure if it actually causes a problem or not, but my logs sometimes show lines like these

2019-02-25 23:55:35.676 [warn] [swarm on web@100.96.20.105] [tracker:handle_cast] unrecognized cast: {:untrack, #PID<0.583.0>}
2019-02-25 23:55:35.676 [warn] [swarm on web@100.96.20.105] [tracker:handle_cast] unrecognized cast: {:track, MyApp.Scheduler.TaskRegistry, #PID<39255.580.0>, %{mfa: {Quantum.TaskRegistry, :start_link, [%Quantum.TaskRegistry.StartOpts{name: MyApp.Scheduler.TaskRegistry}]}}}
2019-02-25 23:55:35.676 [warn] [swarm on web@100.96.20.105] [tracker:handle_cast] unrecognized cast: {:untrack, #PID<0.584.0>}
2019-02-25 23:55:35.677 [warn] [swarm on web@100.96.20.105] [tracker:handle_cast] unrecognized cast: {:track, MyApp.Scheduler.JobBroadcaster, #PID<39255.581.0>, %{mfa: {Quantum.JobBroadcaster, :start_link, [%Quantum.JobBroadcaster.StartOpts{debug_logging: true, jobs: [...jobs here...]}}}

Taken from quantum-elixir/quantum-core#398

My guess is that something is sending the :track and :untrack messages as casts instead of calls. This line expects the message to be a call, and this line sends the message as a call.

This line, however, sends the message as a cast so my guess is this is the culprit. Perhaps simply changing it to a call will fix the issue?

I'm on swarm 3.4.0.

@jesseshieh
Copy link
Author

jesseshieh commented May 14, 2019

fwiw, I also see these messages when running mix test on swarm. Specifically

mix test test/integration_test.exs:63

And the test passes. When I change the cast to a call the messages are gone, but the test fails.

Also, interestingly, when I delete the lines here, the tests still pass.

@jesseshieh
Copy link
Author

More observations

In the passing case, I see the following logs


*DBG* 'Elixir.Swarm.Tracker' consume cast {sync_recv,<7155.141.0>,
                {{0,{0,1}},0},
                [{entry,{worker,8},
                        <7155.158.0>,#Ref<7155.3024736589.70516738.97011>,
                        #{group1 => true,
                          mfa => {'Elixir.MyApp.Worker',start_link,[]}},
                        {0,{0,0,1}}},
...
                        {0,{0,0,1}}}]} in state syncing => tracking
*DBG* 'Elixir.Swarm.Tracker' postpone cast {untrack,<0.158.0>} in state awaiting_sync_ack
*DBG* 'Elixir.Swarm.Tracker' receive cast {sync_ack,<7155.141.0>,
               {{0,{0,1}},0},
               [{entry,{worker,8},
                       <7155.159.0>,#Ref<7155.608380958.70516737.97926>,
                       #{group1 => true,
                         mfa => {'Elixir.MyApp.Worker',start_link,[]}},
                       {0,{0,2,0}}},
...
                       {0,{0,2,0}}}]} in state awaiting_sync_ack
08:06:14.211 [info]  [swarm on node2@127.0.0.1] [tracker:awaiting_sync_ack] received sync acknowledgement from node1@127.0.0.1, syncing with remote registry
*DBG* 'Elixir.Swarm.Tracker' receive info {event,<7155.141.0>,{0,{0,0,2}},{untrack,<7155.158.0>}} in state tracking
*DBG* 'Elixir.Swarm.Tracker' consume cast {sync_ack,<7155.141.0>,
               {{0,{0,1}},0},
               [{entry,{worker,8},
                       <7155.159.0>,#Ref<7155.608380958.70516737.97926>,
                       #{group1 => true,
                         mfa => {'Elixir.MyApp.Worker',start_link,[]}},
                       {0,{0,2,0}}},
...
                       {0,{0,2,0}}}]} in state awaiting_sync_ack => tracking

Note that both nodes receive a sync_ack and the untrack is "postponed".

In the failing case, I see the following logs.

*DBG* 'Elixir.Swarm.Tracker' receive cast {sync_recv,<7155.143.0>,
                {{{0,1},0},0},
                [{entry,{worker,8},
                        <7155.160.0>,#Ref<7155.310440809.337641473.52450>,
                        #{group1 => true,
                          mfa => {'Elixir.MyApp.Worker',start_link,[]}},
                        {0,{0,1,0}}},
...
                        {0,{0,1,0}}}]} in state syncing
08:02:14.404 [info]  [swarm on node1@127.0.0.1] [tracker:syncing] received registry from node2@127.0.0.1, merging..
*DBG* 'Elixir.Swarm.Tracker' receive call {untrack,<0.160.0>} from <7155.143.0> in state awaiting_sync_ack

Note that this time no sync_ack messages appear to be received, but the untrack is received. I'm guessing that the untrack message somehow prevents the sync_ack messages from being sent.

@jesseshieh
Copy link
Author

It turns out, the reason why changing the cast to a call doesn't work is because it causes a deadlock. The pull request I submitted does not change it to a call, but just tries to make the cast recognized.

@x-ji
Copy link

x-ji commented Feb 17, 2020

@jesseshieh Hi, we're also seeing this error in our app, and it causes some workers to not be properly registered. It seems to happen when nodes lose connection with each other (maybe due to libcluster's Kubernetes strategy not being totally stable?), and try to reconnect again.

Did your patch fix the issue? If so we'll just host the patched version in our repo and point the library to that version.

Thanks.

@jesseshieh
Copy link
Author

I can't remember if the patch fixed it or not since it was so long ago, but I've since decided to use something other than swarm so I'm not sure I'll be of much help.

@bernardo-martinez
Copy link

I can't remember if the patch fixed it or not since it was so long ago, but I've since decided to use something other than swarm so I'm not sure I'll be of much help.

@jesseshieh what did you use instead?

@huruk
Copy link

huruk commented Aug 29, 2022

@bernardo-martinez I still saw the issue on every deployment.
I ended up using horde as a replacement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants