Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

target module not available on remote node #122

Open
alexferreira opened this issue Jan 11, 2019 · 9 comments
Open

target module not available on remote node #122

alexferreira opened this issue Jan 11, 2019 · 9 comments

Comments

@alexferreira
Copy link

I have a service calculator and the investigator running through the libcluster.

but when I register my service, sometimes something strange happens.

for example:

When I start the calculator service it returns me the following message.

[warn] [swarm on calculator@127.0.0.1] [tracker: start_pid_remotely] "a475b420-e5f8-4528-9b72-766b7e75d177" could not be started on investigator@127.0.0.1: target module not available on remote node, retrying operation after 1000ms ..

and in the investigator service I get the following return

[warn] [swarm on investigator@127.0.0.1] [tracker: do_track] ** (UndefinedFunctionError) function Calculator.Supervisor.register / 1 is undefined (module Calculator.Supervisor is not available)
    Calculator.Supervisor.register ("a475b420-e5f8-4528-9b72-766b7e75d177")
    (swarm) lib / swarm / tracker / tracker.ex: 1082: Swarm.Tracker.do_track / 2
    (stdlib) gen_statem.erl: 1660:: gen_statem.call_state_function / 5
    (stdlib) gen_statem.erl: 1023:: gen_statem.loop_event_state_function / 6
    (stdlib) proc_lib.erl: 249:: proc_lib.init_p_do_apply / 3

but if I try to start it sometimes it works without problems.

can anybody help me?

@arjan
Copy link
Collaborator

arjan commented Jan 11, 2019

To me this looks like if you have a cluster with heterogenous OTP apps.
For swarm to work, the OTP application that you are going distribute processes for (e.g. with Swarm.register_name/4) need to be available on all the nodes participating in the swarm cluster.

@alexferreira
Copy link
Author

@arjan this is happening soon after running Swarm.register_name/4 only after a few times it works.

@arjan
Copy link
Collaborator

arjan commented Jan 11, 2019

So are the same OTP applications started on both nodes?

@alexferreira
Copy link
Author

yes the same applications were started in both nos.

It's working right now. however if I stop one of the applications and start again many times the problem mentioned above happens.

@arjan
Copy link
Collaborator

arjan commented Jan 11, 2019

Do you mean stopping the node or just stopping the application? (Application.stop)?
Maybe the cluster is already formed before all application code is loaded, and tracker requests come in already, however I cannot imagine that this takes very long...

@alexferreira
Copy link
Author

in this first gif as you can see I started the applications and soon came the error quoted

swarm

in the second gif as you can see the error does not happen.

swarm1

@bitwalker
Copy link
Owner

The problem seems to be that the second node is still loading code when Swarm on the first node tells Swarm on the second node to start a process (resulting in the crash, because the code isn't loaded yet). This is happening because when running with Mix, applications and their code are loaded and started sequentially, while in a release, all application code is first loaded, then applications are started.

My guess is that Mix starts Swarm before it starts the part of the system which invokes register_name, so Swarm on the second node starts and is able to communicate with the first node and accept registration requests before the code for the registration callback is loaded - since this is inherently racy, that's why it works only some of the time.

@arjan @beardedeagle Until we get the refactoring implemented so that Swarm can be started under the supervision tree rather than as its own tree, we could provide a configuration option which allows specifying an application that needs to be started before Swarm will start serving requests, and then basically just loop until the application status (via :application_controller.info/0) shows that it is started. Thoughts? The refactor is really the fix, but having a short term solution to this would be nice.

@alexferreira
Copy link
Author

@bitwalker I circumvented the situation using the dynamicSypervisor.

@beardedeagle
Copy link
Collaborator

@bitwalker I think that's a workable temp solution, though I'd take it a step further and allow it to accept a list of applications.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants