Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swarm gets blocked/deadlocked during tests #96

Open
sashaafm opened this issue Jul 30, 2018 · 3 comments
Open

Swarm gets blocked/deadlocked during tests #96

sashaafm opened this issue Jul 30, 2018 · 3 comments

Comments

@sashaafm
Copy link

Hello @bitwalker, we've been using Swarm for some time now and recently we started having a problem where our test suite seems to get blocked or deadlocked when it runs. It always get stuck at at the same point, printing the same logs and getting blocked there.

Below are the logs which are printed before the suite getting stuck:

[warn] [swarm on primary@127.0.0.1] [tracker:start_pid_remotely] failed to start {:tracker, "user_id", "acc_id"} on nonode@nohost: nodedown, retrying operation.. pid=<0.3704.0> application=swarm module=Swarm.Logger function=warn/1 file=/presence-api/deps/swarm/lib/swarm/logger.ex line=15 
[error] [swarm on primary@127.0.0.1] [tracker:start_pid_remotely] ** (ErlangError) Erlang error: {:function_clause, {:gen_statem, :call, [{Swarm.Tracker, {:error, {:invalid_ring, :no_nodes}}}, {:track, {:tracker, "user_id", "acc_id"}, %{mfa: {Presence.Users.Swarm, :start, [{:user, "user_id", "acc_id"}, {:controller, "user_id", "acc_id"}, {:user_supervisor, "user_id", "acc_id"}, {:tracker, "user_id", "acc_id"}, %Presence.Models.User{account_id: "acc_id", user_id: "user_id"}]}}}, :infinity]}}
    (stdlib) gen.erl:253: :gen.do_for_proc({Swarm.Tracker, {:error, {:invalid_ring, :no_nodes}}}, #Function<0.94381411/1 in :gen.call/4>)
    (stdlib) gen_statem.erl:591: :gen_statem.call_dirty/4
    (swarm) lib/swarm/tracker/tracker.ex:1115: Swarm.Tracker.start_pid_remotely/6
    (elixir) lib/task/supervised.ex:88: Task.Supervised.do_apply/2
    (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
 pid=<0.3704.0> application=swarm module=Swarm.Logger function=error/1 file=/presence-api/deps/swarm/lib/swarm/logger.ex line=27 
[error] Task #PID<0.3704.0> started from Swarm.Tracker terminating
** (Protocol.UndefinedError) protocol String.Chars not implemented for {:error, {:invalid_ring, :no_nodes}}. This protocol is implemented for: Atom, BitString, Date, DateTime, Float, Integer, List, NaiveDateTime, Time, URI, Version, Version.Requirement
    (elixir) /usr/local/src/elixir/lib/elixir/lib/string/chars.ex:3: String.Chars.impl_for!/1
    (elixir) /usr/local/src/elixir/lib/elixir/lib/string/chars.ex:22: String.Chars.to_string/1
    (swarm) lib/swarm/tracker/tracker.ex:1152: Swarm.Tracker.start_pid_remotely/6
    (elixir) lib/task/supervised.ex:88: Task.Supervised.do_apply/2
    (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
Function: #Function<1.98420104/0 in Swarm.Tracker.do_track/2>
    Args: [] pid=<0.3704.0> error_logger=format 

So far we've got no idea why it is getting stuck. Even worse, I had a hunch about what could be causing the problem. I altered the code I thought was problematic and the suite no longer got stuck (locally, running in a Docker container). However, the same problem seems to persist in our CI pipeline, which runs the same exact Docker image and test script. I've even executed the tests with the --trace flag, which serialises all tests and should prevent deadlocks (I think).

I'm not sure if this is a bug in Swarm, or how we should proceed to try to dig deeper. I also find it quite odd that ExUnit isn't timing out... Maybe it's getting stuck between test cases?

Also what could cause such behaviour?

@arjan
Copy link
Collaborator

arjan commented Jul 30, 2018

The :no_nodes error seems to me that the internal ring does not have any nodes to connect to. Are you using a blacklist or whitelist in your tests?

@sashaafm
Copy link
Author

We do have a blacklist:

config :swarm,
  node_blacklist: [
    ~r/^primary@.+$/
  ]

I've noticed these exact logs also pop up in other places during the tests and they also get stuck for a little while. However, they soon pass and the tests continue, while this other test case keeps getting stuck for ever.

@arjan
Copy link
Collaborator

arjan commented Jul 31, 2018

Could you check my branch on PR #94? it should fix all current issues with the test cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants