Dead server not removed from consul manager's servers list when dead server's IP address alive as client #5650

rammohanganap · 2019-04-11T18:11:08Z

When filing a bug, please include the following headings if possible. Any example text in this template can be deleted.

Overview of the Issue

Dead server not removed from consul manager's servers list when dead server's IP address alive as client. New server was added, killed and force left (consul force-leave ). The same IP address was assigned to a new client and we see consul manager is rebalancing to 4 servers even we have only 3 server even after 24h.

Reproduction Steps

Steps to reproduce this issue, eg:

Server log before adding a server:

2019/04/09 18:50:58 [DEBUG] manager: Rebalanced 1 servers, next active server is econ-52485095260256623-c-gwo1.us-west1 (Addr: tcp/10.132.2.48:8300) (DC: us-west1)
2019/04/09 18:51:03 [DEBUG] manager: Rebalanced 3 servers, next active server is econ-0d5059ea0ad72e4f2-b-ea.us-east-1 (Addr: tcp/10.16.37.45:8300) (DC: us-east-1)
2019/04/09 18:51:03 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.3.52:8301

Server Debug logs after adding new server node (ezk-0760588ca5d7c54e6-a-wo) and killing the consul process ezk-0760588ca5d7c54e6-a-wo:

2019/04/09 18:50:18 [DEBUG] memberlist: Stream connection from=10.16.37.45:44530
2019/04/09 18:50:23 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.37.45:8302
2019/04/09 18:50:30 [DEBUG] http: Request GET /v1/kv/?recurse (1.486555ms) from=127.0.0.1:58348
2019/04/09 18:50:30 [DEBUG] http: Request GET /v1/catalog/services (1.323442ms) from=127.0.0.1:58350
2019/04/09 18:50:30 [DEBUG] http: Request GET /v1/health/node/econ-03b2d0d795a41eb0a-c-wo (1.492054ms) from=127.0.0.1:58352
2019/04/09 18:50:30 [DEBUG] http: Request GET /v1/agent/members (93.618µs) from=127.0.0.1:58354
2019/04/09 18:50:33 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.3.52:8301
2019/04/09 18:50:35 [DEBUG] memberlist: Stream connection from=10.16.3.52:39958
2019/04/09 18:50:46 [DEBUG] memberlist: Stream connection from=10.16.3.238:39546
2019/04/09 18:50:58 [DEBUG] manager: Rebalanced 1 servers, next active server is econ-52485095260256623-c-gwo1.us-west1 (Addr: tcp/10.132.2.48:8300) (DC: us-west1)
2019/04/09 18:51:03 [DEBUG] manager: Rebalanced 3 servers, next active server is econ-0d5059ea0ad72e4f2-b-ea.us-east-1 (Addr: tcp/10.16.37.45:8300) (DC: us-east-1)
2019/04/09 18:51:03 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.3.52:8301
2019/04/09 18:51:18 [DEBUG] memberlist: Stream connection from=10.16.37.45:44600
2019/04/09 18:51:23 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.33.185:8302
2019/04/09 18:51:26 [DEBUG] manager: Rebalanced 3 servers, next active server is econ-00dfa4c5521dace34-a-wo.us-west-2 (Addr: tcp/10.16.3.52:8300) (DC: us-west-2)
2019/04/09 18:51:29 [DEBUG] memberlist: Stream connection from=10.16.3.52:34084
2019/04/09 18:51:30 [DEBUG] http: Request GET /v1/kv/?recurse (1.41693ms) from=127.0.0.1:58426
2019/04/09 18:51:30 [DEBUG] http: Request GET /v1/catalog/services (1.294127ms) from=127.0.0.1:58428
2019/04/09 18:51:30 [DEBUG] http: Request GET /v1/health/node/econ-03b2d0d795a41eb0a-c-wo (1.281487ms) from=127.0.0.1:58430
2019/04/09 18:51:30 [DEBUG] http: Request GET /v1/agent/members (100.231µs) from=127.0.0.1:58432
2019/04/09 18:51:33 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.3.238:8301
2019/04/09 18:51:58 [DEBUG] agent: Skipping remote check "serfHealth" since it is managed automatically
2019/04/09 18:51:58 [DEBUG] agent: Node info in sync
2019/04/09 18:52:03 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.3.52:8301
2019/04/09 18:52:05 [DEBUG] memberlist: Stream connection from=10.16.3.52:40032
2019/04/09 18:52:17 [INFO] serf: EventMemberJoin: ezk-0760588ca5d7c54e6-a-wo 10.16.1.81
2019/04/09 18:52:17 [INFO] consul: Adding LAN server ezk-0760588ca5d7c54e6-a-wo (Addr: tcp/10.16.1.81:8300) (DC: us-west-2)
2019/04/09 18:52:17 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.1.81:8302
2019/04/09 18:52:17 [INFO] serf: EventMemberJoin: ezk-0760588ca5d7c54e6-a-wo.us-west-2 10.16.1.81
2019/04/09 18:52:17 [DEBUG] consul: Successfully performed flood-join for "ezk-0760588ca5d7c54e6-a-wo" at 10.16.1.81:8302
2019/04/09 18:52:17 [INFO] consul: Handled member-join event for server "ezk-0760588ca5d7c54e6-a-wo.us-west-2" in area "wan"
2019/04/09 18:52:17 [DEBUG] serf: messageJoinType: ezk-0760588ca5d7c54e6-a-wo
2019/04/09 18:52:17 [DEBUG] serf: messageJoinType: ezk-0760588ca5d7c54e6-a-wo
2019/04/09 18:52:17 [DEBUG] serf: messageJoinType: ezk-0760588ca5d7c54e6-a-wo
2019/04/09 18:52:17 [DEBUG] serf: messageJoinType: ezk-0760588ca5d7c54e6-a-wo
2019/04/09 18:52:17 [DEBUG] serf: messageJoinType: ezk-0760588ca5d7c54e6-a-wo
2019/04/09 18:52:18 [DEBUG] serf: messageJoinType: econ-03b2d0d795a41eb0a-c-wo.us-west-2
2019/04/09 18:52:18 [DEBUG] serf: messageJoinType: econ-00dfa4c5521dace34-a-wo.us-west-2
2019/04/09 18:52:18 [DEBUG] serf: messageJoinType: econ-00dfa4c5521dace34-a-wo.us-west-2
2019/04/09 18:52:18 [DEBUG] serf: messageJoinType: econ-03b2d0d795a41eb0a-c-wo.us-west-2
2019/04/09 18:52:18 [DEBUG] serf: messageJoinType: econ-03b2d0d795a41eb0a-c-wo.us-west-2
2019/04/09 18:52:18 [DEBUG] serf: messageJoinType: econ-00dfa4c5521dace34-a-wo.us-west-2
2019/04/09 18:52:18 [DEBUG] serf: messageJoinType: econ-00dfa4c5521dace34-a-wo.us-west-2
2019/04/09 18:52:18 [DEBUG] serf: messageJoinType: econ-03b2d0d795a41eb0a-c-wo.us-west-2
2019/04/09 18:52:18 [DEBUG] serf: messageJoinType: econ-03b2d0d795a41eb0a-c-wo.us-west-2
2019/04/09 18:52:18 [DEBUG] serf: messageJoinType: econ-00dfa4c5521dace34-a-wo.us-west-2
2019/04/09 18:52:18 [DEBUG] serf: messageJoinType: econ-00dfa4c5521dace34-a-wo.us-west-2
2019/04/09 18:52:19 [DEBUG] memberlist: Stream connection from=10.16.37.45:44672
2019/04/09 18:52:19 [DEBUG] serf: messageJoinType: ezk-0760588ca5d7c54e6-a-wo.us-west-2
2019/04/09 18:52:19 [DEBUG] serf: messageJoinType: ezk-0760588ca5d7c54e6-a-wo.us-west-2
2019/04/09 18:52:23 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.33.185:8302
2019/04/09 18:52:30 [DEBUG] http: Request GET /v1/kv/?recurse (1.396631ms) from=127.0.0.1:58498
2019/04/09 18:52:30 [DEBUG] http: Request GET /v1/catalog/services (1.349508ms) from=127.0.0.1:58500
2019/04/09 18:52:30 [DEBUG] http: Request GET /v1/health/node/econ-03b2d0d795a41eb0a-c-wo (1.27465ms) from=127.0.0.1:58502
2019/04/09 18:52:30 [DEBUG] http: Request GET /v1/agent/members (98.878µs) from=127.0.0.1:58504
2019/04/09 18:52:33 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.1.81:8301
2019/04/09 18:52:35 [DEBUG] memberlist: Stream connection from=10.16.3.52:40112
2019/04/09 18:53:00 [DEBUG] manager: Rebalanced 1 servers, next active server is econ-52485095260256623-c-gwo1.us-west1 (Addr: tcp/10.132.2.48:8300) (DC: us-west1)
2019/04/09 18:53:03 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.3.238:8301
2019/04/09 18:53:11 [DEBUG] manager: Rebalanced 3 servers, next active server is econ-0d5059ea0ad72e4f2-b-ea.us-east-1 (Addr: tcp/10.16.37.45:8300) (DC: us-east-1)
2019/04/09 18:53:23 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.37.45:8302
2019/04/09 18:53:30 [DEBUG] http: Request GET /v1/kv/?recurse (1.599572ms) from=127.0.0.1:58572
2019/04/09 18:53:30 [DEBUG] http: Request GET /v1/catalog/services (1.711998ms) from=127.0.0.1:58574
2019/04/09 18:53:30 [DEBUG] http: Request GET /v1/health/node/econ-03b2d0d795a41eb0a-c-wo (1.321894ms) from=127.0.0.1:58576
2019/04/09 18:53:30 [DEBUG] http: Request GET /v1/agent/members (96.184µs) from=127.0.0.1:58578
2019/04/09 18:53:33 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.3.238:8301
2019/04/09 18:53:44 [DEBUG] memberlist: Stream connection from=10.16.1.81:60388
2019/04/09 18:53:56 [DEBUG] agent: Skipping remote check "serfHealth" since it is managed automatically
2019/04/09 18:53:56 [DEBUG] agent: Node info in sync
2019/04/09 18:54:03 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.3.238:8301
2019/04/09 18:54:10 [DEBUG] manager: Rebalanced 4 servers, next active server is ezk-0760588ca5d7c54e6-a-wo.us-west-2 (Addr: tcp/10.16.1.81:8300) (DC: us-west-2)
2019/04/09 18:54:23 [DEBUG] memberlist: Initiating push/pull sync with: 10.132.2.48:8302
2019/04/09 18:54:30 [DEBUG] memberlist: Failed ping: ezk-0760588ca5d7c54e6-a-wo (timeout reached)
2019/04/09 18:54:30 [DEBUG] http: Request GET /v1/kv/?recurse (1.372775ms) from=127.0.0.1:58646
2019/04/09 18:54:30 [DEBUG] http: Request GET /v1/catalog/services (1.336123ms) from=127.0.0.1:58648
2019/04/09 18:54:30 [DEBUG] http: Request GET /v1/health/node/econ-03b2d0d795a41eb0a-c-wo (1.281863ms) from=127.0.0.1:58650
2019/04/09 18:54:30 [DEBUG] http: Request GET /v1/agent/members (103.225µs) from=127.0.0.1:58652
2019/04/09 18:54:30 [INFO] memberlist: Suspect ezk-0760588ca5d7c54e6-a-wo has failed, no acks received
2019/04/09 18:54:33 [DEBUG] memberlist: Failed ping: ezk-0760588ca5d7c54e6-a-wo (timeout reached)
2019/04/09 18:54:33 [DEBUG] memberlist: Failed ping: ezk-0760588ca5d7c54e6-a-wo.us-west-2 (timeout reached)
2019/04/09 18:54:33 [INFO] memberlist: Suspect ezk-0760588ca5d7c54e6-a-wo has failed, no acks received
2019/04/09 18:54:33 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.3.52:8301
2019/04/09 18:54:34 [DEBUG] memberlist: Failed ping: ezk-0760588ca5d7c54e6-a-wo (timeout reached)
2019/04/09 18:54:34 [INFO] memberlist: Marking ezk-0760588ca5d7c54e6-a-wo as failed, suspect timeout reached (2 peer confirmations)
2019/04/09 18:54:34 [INFO] serf: EventMemberFailed: ezk-0760588ca5d7c54e6-a-wo 10.16.1.81
2019/04/09 18:54:34 [INFO] consul: Removing LAN server ezk-0760588ca5d7c54e6-a-wo (Addr: tcp/10.16.1.81:8300) (DC: us-west-2)
2019/04/09 18:54:35 [INFO] memberlist: Suspect ezk-0760588ca5d7c54e6-a-wo.us-west-2 has failed, no acks received
2019/04/09 18:54:36 [INFO] memberlist: Suspect ezk-0760588ca5d7c54e6-a-wo has failed, no acks received
2019/04/09 18:54:37 [DEBUG] memberlist: Stream connection from=10.16.33.185:51406
2019/04/09 18:54:40 [INFO] serf: attempting reconnect to ezk-0760588ca5d7c54e6-a-wo 10.16.1.81:8301
2019/04/09 18:54:40 [DEBUG] memberlist: Failed to join 10.16.1.81: dial tcp 10.16.1.81:8301: connect: connection refused
2019/04/09 18:54:43 [DEBUG] serf: messageLeaveType: ezk-0760588ca5d7c54e6-a-wo
2019/04/09 18:54:43 [INFO] serf: EventMemberLeave (forced): ezk-0760588ca5d7c54e6-a-wo 10.16.1.81
2019/04/09 18:54:43 [INFO] consul: Removing LAN server ezk-0760588ca5d7c54e6-a-wo (Addr: tcp/10.16.1.81:8300) (DC: us-west-2)
2019/04/09 18:54:43 [DEBUG] serf: messageLeaveType: ezk-0760588ca5d7c54e6-a-wo
2019/04/09 18:54:43 [DEBUG] serf: messageLeaveType: ezk-0760588ca5d7c54e6-a-wo
2019/04/09 18:55:03 [DEBUG] memberlist: Stream connection from=10.132.2.48:53596
2019/04/09 18:55:03 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.3.52:8301
2019/04/09 18:55:05 [INFO] serf: EventMemberFailed: ezk-0760588ca5d7c54e6-a-wo.us-west-2 10.16.1.81
2019/04/09 18:55:05 [DEBUG] manager: cycled away from server "ezk-0760588ca5d7c54e6-a-wo.us-west-2"
2019/04/09 18:55:05 [INFO] consul: Handled member-failed event for server "ezk-0760588ca5d7c54e6-a-wo.us-west-2" in area "wan"
2019/04/09 18:55:10 [DEBUG] serf: forgoing reconnect for random throttling
2019/04/09 18:55:13 [DEBUG] manager: Rebalanced 3 servers, next active server is econ-0d5059ea0ad72e4f2-b-ea.us-east-1 (Addr: tcp/10.16.37.45:8300) (DC: us-east-1)
2019/04/09 18:55:16 [DEBUG] memberlist: Stream connection from=10.16.3.238:39850
2019/04/09 18:55:23 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.37.152:8302
2019/04/09 18:55:30 [DEBUG] memberlist: Stream connection from=10.16.3.52:34452
2019/04/09 18:55:30 [DEBUG] http: Request GET /v1/kv/?recurse (1.465747ms) from=127.0.0.1:58726
2019/04/09 18:55:30 [DEBUG] http: Request GET /v1/catalog/services (1.313806ms) from=127.0.0.1:58728
2019/04/09 18:55:30 [DEBUG] http: Request GET /v1/health/node/econ-03b2d0d795a41eb0a-c-wo (1.547925ms) from=127.0.0.1:58730
2019/04/09 18:55:30 [DEBUG] http: Request GET /v1/agent/members (103.065µs) from=127.0.0.1:58732
2019/04/09 18:55:33 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.3.52:8301
2019/04/09 18:55:34 [DEBUG] manager: Rebalanced 1 servers, next active server is econ-52485095260256623-c-gwo1.us-west1 (Addr: tcp/10.132.2.48:8300) (DC: us-west1)
2019/04/09 18:55:34 [DEBUG] agent: Skipping remote check "serfHealth" since it is managed automatically
2019/04/09 18:55:34 [DEBUG] agent: Node info in sync
2019/04/09 18:55:40 [DEBUG] serf: forgoing reconnect for random throttling
2019/04/09 18:55:46 [DEBUG] memberlist: Stream connection from=10.16.3.238:39924
2019/04/09 18:56:03 [DEBUG] memberlist: Stream connection from=10.132.2.48:53634
2019/04/09 18:56:03 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.3.238:8301
2019/04/09 18:56:10 [DEBUG] serf: forgoing reconnect for random throttling
2019/04/09 18:56:15 [DEBUG] manager: Rebalanced 4 servers, next active server is econ-03b2d0d795a41eb0a-c-wo.us-west-2 (Addr: tcp/10.16.4.17:8300) (DC: us-west-2)
2019/04/09 18:56:16 [DEBUG] memberlist: Stream connection from=10.16.3.238:39934
2019/04/09 18:56:23 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.3.238:8302
2019/04/09 18:56:30 [DEBUG] http: Request GET /v1/kv/?recurse (1.352894ms) from=127.0.0.1:58804
2019/04/09 18:56:30 [DEBUG] http: Request GET /v1/catalog/services (1.34292ms) from=127.0.0.1:58806
2019/04/09 18:56:30 [DEBUG] http: Request GET /v1/health/node/econ-03b2d0d795a41eb0a-c-wo (1.283166ms) from=127.0.0.1:58808
2019/04/09 18:56:30 [DEBUG] http: Request GET /v1/agent/members (124.328µs) from=127.0.0.1:58810
2019/04/09 18:56:33 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.3.52:8301
2019/04/09 18:56:40 [DEBUG] serf: forgoing reconnect for random throttling
2019/04/09 18:57:03 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.3.238:8301
2019/04/09 18:57:05 [DEBUG] memberlist: Stream connection from=10.16.3.52:40476
2019/04/09 18:57:10 [DEBUG] serf: forgoing reconnect for random throttling
2019/04/09 18:57:16 [DEBUG] memberlist: Stream connection from=10.16.3.238:40004
2019/04/09 18:57:17 [DEBUG] agent: Skipping remote check "serfHealth" since it is managed automatically
2019/04/09 18:57:17 [DEBUG] agent: Node info in sync
2019/04/09 18:57:24 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.37.45:8302
2019/04/09 18:57:30 [DEBUG] http: Request GET /v1/kv/?recurse (1.470712ms) from=127.0.0.1:58874
2019/04/09 18:57:30 [DEBUG] http: Request GET /v1/catalog/services (1.329342ms) from=127.0.0.1:58876
2019/04/09 18:57:30 [DEBUG] http: Request GET /v1/health/node/econ-03b2d0d795a41eb0a-c-wo (1.34464ms) from=127.0.0.1:58878
2019/04/09 18:57:30 [DEBUG] http: Request GET /v1/agent/members (118.731µs) from=127.0.0.1:58880
2019/04/09 18:57:33 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.3.52:8301
2019/04/09 18:57:40 [DEBUG] serf: forgoing reconnect for random throttling
2019/04/09 19:06:39 [DEBUG] agent: Skipping remote check "serfHealth" since it is managed automatically
2019/04/09 19:06:39 [DEBUG] agent: Node info in sync
2019/04/09 19:06:40 [DEBUG] serf: forgoing reconnect for random throttling
2019/04/09 19:06:54 [DEBUG] manager: pinging server "ezk-0760588ca5d7c54e6-a-wo.us-west-2 (Addr: tcp/10.16.1.81:8300) (DC: us-west-2)" failed: rpc error getting client: failed to get conn: dial tcp <nil>->10.16.1.81:8300: connect: connection refused
2019/04/09 19:06:54 [DEBUG] manager: Rebalanced 4 servers, next active server is econ-0aed93bbec0ebdbf5-b-wo.us-west-2 (Addr: tcp/10.16.3.238:8300) (DC: us-west-2)

From one of the server nodes called force-leave the dead node (ezk-0760588ca5d7c54e6-a-wo)

$consul members
Node                         Address           Status  Type    Build  Protocol  DC         Segment
econ-00dfa4c5521dace34-a-wo  10.16.3.52:8301   alive   server  1.2.3  2         us-west-2  <all>
econ-03b2d0d795a41eb0a-c-wo  10.16.4.17:8301   alive   server  1.2.3  2         us-west-2  <all>
econ-0aed93bbec0ebdbf5-b-wo  10.16.3.238:8301  alive   server  1.2.3  2         us-west-2  <all>
ezk-0760588ca5d7c54e6-a-wo   10.16.1.81:8301   left    server  1.2.3  2         us-west-2  <all>

ezk-0760588ca5d7c54e6-a-wo joins as client now:

 consul members
Node                         Address           Status  Type    Build  Protocol  DC         Segment
econ-00dfa4c5521dace34-a-wo  10.16.3.52:8301   alive   server  1.2.3  2         us-west-2  <all>
econ-03b2d0d795a41eb0a-c-wo  10.16.4.17:8301   alive   server  1.2.3  2         us-west-2  <all>
econ-0aed93bbec0ebdbf5-b-wo  10.16.3.238:8301  alive   server  1.2.3  2         us-west-2  <all>
ezk-0760588ca5d7c54e6-a-wo   10.16.1.81:8301   alive   client  1.2.3  2         us-west-2  <default> <---

Debug logs from server after force-leave dead server:

2019/04/09 19:45:42 [DEBUG] manager: pinging server "ezk-0760588ca5d7c54e6-a-wo.us-west-2 (Addr: tcp/10.16.1.81:8300) (DC: us-west-2)" failed: rpc error getting client: failed to get conn: dial tcp <nil>->10.16.1.81:8300: connect: connection refused
2019/04/09 19:45:42 [DEBUG] manager: Rebalanced 4 servers, next active server is econ-03b2d0d795a41eb0a-c-wo.us-west-2 (Addr: tcp/10.16.4.17:8300) (DC: us-west-2)

Autopilot config:

consul operator  autopilot get-config
CleanupDeadServers = true
LastContactThreshold = 24h0m0s
MaxTrailingLogs = 250
ServerStabilizationTime = 10s
RedundancyZoneTag = ""
DisableUpgradeMigration = false
UpgradeVersionTag = ""

Autopilot health:

{
  "Healthy": true,
  "FailureTolerance": 1,
  "Servers": [
    {
      "ID": "6eb86dd1-007c-4809-f9f5-c334de952aa8",
      "Name": "econ-00dfa4c5521dace34-a-wo",
      "Address": "10.16.3.52:8300",
      "SerfStatus": "alive",
      "Version": "1.2.3",
      "Leader": true,
      "LastContact": "0s",
      "LastTerm": 2,
      "LastIndex": 11721,
      "Healthy": true,
      "Voter": true,
      "StableSince": "2019-04-09T18:09:36Z"
    },
    {
      "ID": "d066550e-4c3e-ed14-7de1-5073e11bede4",
      "Name": "econ-03b2d0d795a41eb0a-c-wo",
      "Address": "10.16.4.17:8300",
      "SerfStatus": "alive",
      "Version": "1.2.3",
      "Leader": false,
      "LastContact": "43.064614ms",
      "LastTerm": 2,
      "LastIndex": 11721,
      "Healthy": true,
      "Voter": true,
      "StableSince": "2019-04-09T18:09:36Z"
    },
    {
      "ID": "3670437a-fe5e-8558-f198-0ad66a3a09a8",
      "Name": "econ-0aed93bbec0ebdbf5-b-wo",
      "Address": "10.16.3.238:8300",
      "SerfStatus": "alive",
      "Version": "1.2.3",
      "Leader": false,
      "LastContact": "10.867376ms",
      "LastTerm": 2,
      "LastIndex": 11721,
      "Healthy": true,
      "Voter": true,
      "StableSince": "2019-04-09T18:09:36Z"
    }
  ]
}

After 24hrs i still see consul trying to rebalance with 4 servers.

2019/04/10 18:07:33 [DEBUG] memberlist: Failed to join 10.16.1.81: dial tcp 10.16.1.81:8302: connect: connection refused
2019/04/10 18:07:35 [DEBUG] manager: Rebalanced 4 servers, next active server is econ-00dfa4c5521dace34-a-wo.us-west-2 (Addr: tcp/10.16.3.52:8300) (DC: us-west-2)

Consul info for both Client and Server

Server info:

consul info
agent:
	check_monitors = 0
	check_ttls = 0
	checks = 0
	services = 0
build:
	prerelease =
	revision = 48d287ef
	version = 1.2.3
consul:
	bootstrap = false
	known_datacenters = 3
	leader = false
	leader_addr = 10.16.3.52:8300
	server = true
raft:
	applied_index = 24715
	commit_index = 24715
	fsm_pending = 0
	last_contact = 4.435646ms
	last_log_index = 24715
	last_log_term = 2
	last_snapshot_index = 16387
	last_snapshot_term = 2
	latest_configuration = [{Suffrage:Voter ID:6eb86dd1-007c-4809-f9f5-c334de952aa8 Address:10.16.3.52:8300} {Suffrage:Voter ID:d066550e-4c3e-ed14-7de1-5073e11bede4 Address:10.16.4.17:8300} {Suffrage:Voter ID:3670437a-fe5e-8558-f198-0ad66a3a09a8 Address:10.16.3.238:8300}]
	latest_configuration_index = 362
	num_peers = 2
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Follower
	term = 2
runtime:
	arch = amd64
	cpu_count = 4
	goroutines = 85
	max_procs = 4
	os = linux
	version = go1.10.1
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 2
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 12
	members = 4
	query_queue = 0
	query_time = 1
serf_wan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 1
	failed = 1
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 2721
	members = 8
	query_queue = 0
	query_time = 1

Operating system and Environment details

Amazon Linux AMI release 2018.03

Log Fragments

Included in reproduction steps

The text was updated successfully, but these errors were encountered:

mkeeler · 2019-04-11T18:22:26Z

I believe the problem is in a few places:

consul/agent/consul/client_serf.go

Line 77 in 9ef2829

case serf.EventMemberUpdate: // Ignore

consul/agent/consul/server_serf.go

Line 146 in 9ef2829

case serf.EventMemberUpdate:

consul/agent/router/serf_adapter.go

Line 63 in 9ef2829

case serf.EventMemberUpdate:

Basically when member update events come in (or potentially member join events), if they are servers we should ensure they are tracked as such and if they are not servers we should ensure that they are removed.

rammohanganap · 2019-07-08T18:42:47Z

Any update on this issue?

phamtai97 · 2020-02-11T07:18:28Z

Any update on this issue?

rammohanganap · 2020-02-26T21:22:48Z

Is this issue fixed? any update?

mkeeler · 2020-05-18T18:49:09Z

@rammohanganap Sorry for the extremely delayed response. Does this still occur for you in later releases 1.4.3+ or 1.7.0.

1.4.3 added this fix: #5317

That mostly resolved the case where a server that had been in the failed/left state and was reaped never got fully removed from the RPC routing system.

1.7.0 added this fix: #6420

That one ensures that the routing infrastructure ignores left servers in the event that they might still be reachable somehow.

However looking at the code again, it looks like your case might be a little different. Calling force-leave on a server and replacing with a client might not hit any of the code points that have since been fixed. I think this function might need updating to remove the "server" from the routing infrastructure in the case of it not being a server but rather a client:

consul/agent/consul/client_serf.go

Lines 128 to 144 in 2282847

    
           func (c *Client) nodeUpdate(me serf.MemberEvent) { 
        
           	for _, m := range me.Members { 
        
           		ok, parts := metadata.IsConsulServer(m) 
        
           		if !ok { 
        
           			continue 
        
           		} 
        
           		if parts.Datacenter != c.config.Datacenter { 
        
           			c.logger.Warn("server has joined the wrong cluster: wrong datacenter", 
        
           				"server", m.Name, 
        
           				"datacenter", parts.Datacenter, 
        
           			) 
        
           			continue 
        
           		} 
        
           		c.logger.Info("updating server", "server", parts.String()) 
        
           		c.routers.AddServer(parts) 
        
           	} 
        
           }

rammohanganap · 2022-09-09T16:34:18Z

@mkeeler , we don't see this issue anymore. we can close it.

mkeeler added the type/bug Feature does not function as expected label Apr 11, 2019

jsosulska added the waiting-reply Waiting on response from Original Poster or another individual in the thread label May 18, 2020

rammohanganap closed this as completed Sep 9, 2022

github-actions bot removed the waiting-reply Waiting on response from Original Poster or another individual in the thread label Sep 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dead server not removed from consul manager's servers list when dead server's IP address alive as client #5650

Dead server not removed from consul manager's servers list when dead server's IP address alive as client #5650

rammohanganap commented Apr 11, 2019 •

edited

Loading

mkeeler commented Apr 11, 2019

rammohanganap commented Jul 8, 2019

phamtai97 commented Feb 11, 2020

rammohanganap commented Feb 26, 2020

mkeeler commented May 18, 2020

rammohanganap commented Sep 9, 2022

Dead server not removed from consul manager's servers list when dead server's IP address alive as client #5650

Dead server not removed from consul manager's servers list when dead server's IP address alive as client #5650

Comments

rammohanganap commented Apr 11, 2019 • edited Loading

Overview of the Issue

Reproduction Steps

Consul info for both Client and Server

Operating system and Environment details

Log Fragments

mkeeler commented Apr 11, 2019

rammohanganap commented Jul 8, 2019

phamtai97 commented Feb 11, 2020

rammohanganap commented Feb 26, 2020

mkeeler commented May 18, 2020

rammohanganap commented Sep 9, 2022

rammohanganap commented Apr 11, 2019 •

edited

Loading