Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swarm tracker clock gets weird, eats memory #106

Open
morganatwishpond opened this issue Sep 20, 2018 · 1 comment
Open

Swarm tracker clock gets weird, eats memory #106

morganatwishpond opened this issue Sep 20, 2018 · 1 comment

Comments

@morganatwishpond
Copy link

We're running a small application in our kubernetes cluster and we're encountering a few problems.

Specifically, we've got nodes that will deadlock on startup like #91 and clocks getting out of sync left and right as in #21 (comment),

These problems don't occur reliably, and I haven't been able to reproduce them locally or even with an isolated version of the application in the production cluster, so tracking them down and figuring out what is going on has proven so far impossible.

One thing that we have reliably noticed is that the tracker clocks get kinda wild. See here:

iex(default@100.96.20.17)1> :sys.get_state(Swarm.Tracker)
{:tracking,
 %Swarm.Tracker.TrackerState{
   clock: {{0,
     {0,
      {0,
       {0,
        {0,
         {0,
          {0,
           {0,
            {0,
             {0,
              {0,
               {0, {0, {0, {0, {0, {0, {0, {0, {0, {0, {0, {0, {...}}}}}}}}}}}}}}}}}}}}}}}},
    0},
   nodes: [:"default@100.96.18.13", :"default@100.96.20.18"],
   pending_sync_reqs: [],
   self: :"default@100.96.20.17",
   strategy: #<Ring[:"default@100.96.20.18", :"default@100.96.20.17", :"default@100.96.18.13"]>,
   sync_node: nil,
   sync_ref: nil
 }}

That example is after the node had been up approximately 8 minutes.

I killed a node and reconnected as soon as it was available (~10s or so) and the state looked like this:

iex(default@100.96.20.19)1> :sys.get_state(Swarm.Tracker)
{:tracking,
 %Swarm.Tracker.TrackerState{
   clock: {{0,
     {0,
      {0,
       {0,
        {0,
         {0,
          {0,
           {0,
            {0,
             {0,
              {0,
               {0, {0, {0, {0, {0, {0, {0, {0, {0, {0, {0, {0, {...}}}}}}}}}}}}}}}}}}}}}}}},
    0},
   nodes: [:"default@100.96.20.18", :"default@100.96.18.13"],
   pending_sync_reqs: [],
   self: :"default@100.96.20.19",
   strategy: #<Ring[:"default@100.96.20.18", :"default@100.96.20.19", :"default@100.96.18.13"]>,
   sync_node: nil,
   sync_ref: nil
 }}

The other nodes were none to happy about this restart:

21:30:24.493 [warn] [swarm on default@100.96.20.18] [tracker:handle_call] failed to execute remote get_by_name on :"default@100.96.18.12": {:EXIT, {:badarg, [{:ets, :lookup, [:swarm_registry, "alias:1223585:83c9eaa2540ffb4f"], []}, {Swarm.Registry, :get_by_name, 1, [file: 'lib/swarm/registry.ex', line: 148]}, {:rpc, :"-handle_call_call/6-fun-0-", 5, [file: 'rpc.erl', line: 197]}]}}
21:30:24.586 [warn] [swarm on default@100.96.20.18] [tracker:handle_call] failed to execute remote get_by_name on :"default@100.96.18.12": {:EXIT, {:badarg, [{:ets, :lookup, [:swarm_registry, "lead:26951:f5nil3ngubszyxp9"], []}, {Swarm.Registry, :get_by_name, 1, [file: 'lib/swarm/registry.ex', line: 148]}, {:rpc, :"-handle_call_call/6-fun-0-", 5, [file: 'rpc.erl', line: 197]}]}}
21:30:24.658 [warn] [swarm on default@100.96.20.18] [tracker:handle_call] failed to execute remote get_by_name on :"default@100.96.18.12": {:EXIT, {:badarg, [{:ets, :lookup, [:swarm_registry, "entrant:2314846:1617034e75bf1637"], []}, {Swarm.Registry, :get_by_name, 1, [file: 'lib/swarm/registry.ex', line: 148]}, {:rpc, :"-handle_call_call/6-fun-0-", 5, [file: 'rpc.erl', line: 197]}]}}
21:30:24.699 [warn] [swarm on default@100.96.20.18] [tracker:handle_call] failed to execute remote get_by_name on :"default@100.96.18.12": {:EXIT, {:badarg, [{:ets, :lookup, [:swarm_registry, "alias:1223585:83c9eaa2540ffb4f"], []}, {Swarm.Registry, :get_by_name, 1, [file: 'lib/swarm/registry.ex', line: 148]}, {:rpc, :"-handle_call_call/6-fun-0-", 5, [file: 'rpc.erl', line: 197]}]}}
21:30:24.763 [warn] [swarm on default@100.96.20.18] [tracker:handle_call] failed to execute remote get_by_name on :"default@100.96.18.12": {:EXIT, {:badarg, [{:ets, :lookup, [:swarm_registry, "alias:1223585:83c9eaa2540ffb4f"], []}, {Swarm.Registry, :get_by_name, 1, [file: 'lib/swarm/registry.ex', line: 148]}, {:rpc, :"-handle_call_call/6-fun-0-", 5, [file: 'rpc.erl', line: 197]}]}}
21:30:25.026 [warn] [swarm on default@100.96.20.18] [tracker:handle_call] failed to execute remote get_by_name on :"default@100.96.18.12": {:EXIT, {:badarg, [{:ets, :lookup, [:swarm_registry, "alias:1223585:83c9eaa2540ffb4f"], []}, {Swarm.Registry, :get_by_name, 1, [file: 'lib/swarm/registry.ex', line: 148]}, {:rpc, :"-handle_call_call/6-fun-0-", 5, [file: 'rpc.erl', line: 197]}]}}
21:30:25.036 [warn] [swarm on default@100.96.20.18] [tracker:handle_call] failed to execute remote get_by_name on :"default@100.96.18.12": {:EXIT, {:badarg, [{:ets, :lookup, [:swarm_registry, "entrant:2268963:ry8i9l30ncx9gsbm"], []}, {Swarm.Registry, :get_by_name, 1, [file: 'lib/swarm/registry.ex', line: 148]}, {:rpc, :"-handle_call_call/6-fun-0-", 5, [file: 'rpc.erl', line: 197]}]}}
21:30:25.425 [warn] [swarm on default@100.96.20.18] [tracker:handle_call] failed to execute remote get_by_name on :"default@100.96.18.12": {:EXIT, {:badarg, [{:ets, :lookup, [:swarm_registry, "entrant:2302935:elgx5nl1mdnciy6m"], []}, {Swarm.Registry, :get_by_name, 1, [file: 'lib/swarm/registry.ex', line: 148]}, {:rpc, :"-handle_call_call/6-fun-0-", 5, [file: 'rpc.erl', line: 197]}]}}
21:41:06.042 [warn] [swarm on default@100.96.20.18] [tracker:handle_call] failed to execute remote get_by_name on :"default@100.96.20.17": {:EXIT, {:badarg, [{:ets, :lookup, [:swarm_registry, "entrant:2271277:wjtwl0qpbn1eeyzk"], []}, {Swarm.Registry, :get_by_name, 1, [file: 'lib/swarm/registry.ex', line: 148]}, {:rpc, :"-handle_call_call/6-fun-0-", 5, [file: 'rpc.erl', line: 197]}]}}
21:41:06.437 [warn] [swarm on default@100.96.20.18] [tracker:handle_call] failed to execute remote get_by_name on :"default@100.96.20.17": {:EXIT, {:badarg, [{:ets, :lookup, [:swarm_registry, "entrant:2271277:y5j7xaxbqvabkwd3"], []}, {Swarm.Registry, :get_by_name, 1, [file: 'lib/swarm/registry.ex', line: 148]}, {:rpc, :"-handle_call_call/6-fun-0-", 5, [file: 'rpc.erl', line: 197]}]}}

We're running 7ed7b2 right now and haven't been able improve the behaviour of the application.

Is anyone able to shed some light on these issues?

@beardedeagle
Copy link
Collaborator

I've encountered some weird clock drift as well and can reproduce somewhat reliably. It is an issue what @bitwalker and I are aware of and are working to address.

tschmittni added a commit to tschmittni/swarm that referenced this issue Nov 14, 2018
This is a follow up of bitwalker#85

I forgot to change the handle_topology_change method which is
as far as I can tell also the root cause for
bitwalker#106
tschmittni added a commit to tschmittni/swarm that referenced this issue Nov 16, 2018
- Added test for inspecting the node state after sync to make sure
  that clocks are still correct.
- Fix sync test which was not testing concurrent clocks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants