Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dead server not removed from consul manager's servers list when dead server's IP address alive as client #5650

Closed
rammohanganap opened this issue Apr 11, 2019 · 6 comments
Labels
type/bug Feature does not function as expected

Comments

@rammohanganap
Copy link

rammohanganap commented Apr 11, 2019

When filing a bug, please include the following headings if possible. Any example text in this template can be deleted.

Overview of the Issue

Dead server not removed from consul manager's servers list when dead server's IP address alive as client. New server was added, killed and force left (consul force-leave ). The same IP address was assigned to a new client and we see consul manager is rebalancing to 4 servers even we have only 3 server even after 24h.

Reproduction Steps

Steps to reproduce this issue, eg:

Server log before adding a server:

2019/04/09 18:50:58 [DEBUG] manager: Rebalanced 1 servers, next active server is econ-52485095260256623-c-gwo1.us-west1 (Addr: tcp/10.132.2.48:8300) (DC: us-west1)
2019/04/09 18:51:03 [DEBUG] manager: Rebalanced 3 servers, next active server is econ-0d5059ea0ad72e4f2-b-ea.us-east-1 (Addr: tcp/10.16.37.45:8300) (DC: us-east-1)
2019/04/09 18:51:03 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.3.52:8301

Server Debug logs after adding new server node (ezk-0760588ca5d7c54e6-a-wo) and killing the consul process ezk-0760588ca5d7c54e6-a-wo:

2019/04/09 18:50:18 [DEBUG] memberlist: Stream connection from=10.16.37.45:44530
2019/04/09 18:50:23 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.37.45:8302
2019/04/09 18:50:30 [DEBUG] http: Request GET /v1/kv/?recurse (1.486555ms) from=127.0.0.1:58348
2019/04/09 18:50:30 [DEBUG] http: Request GET /v1/catalog/services (1.323442ms) from=127.0.0.1:58350
2019/04/09 18:50:30 [DEBUG] http: Request GET /v1/health/node/econ-03b2d0d795a41eb0a-c-wo (1.492054ms) from=127.0.0.1:58352
2019/04/09 18:50:30 [DEBUG] http: Request GET /v1/agent/members (93.618µs) from=127.0.0.1:58354
2019/04/09 18:50:33 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.3.52:8301
2019/04/09 18:50:35 [DEBUG] memberlist: Stream connection from=10.16.3.52:39958
2019/04/09 18:50:46 [DEBUG] memberlist: Stream connection from=10.16.3.238:39546
2019/04/09 18:50:58 [DEBUG] manager: Rebalanced 1 servers, next active server is econ-52485095260256623-c-gwo1.us-west1 (Addr: tcp/10.132.2.48:8300) (DC: us-west1)
2019/04/09 18:51:03 [DEBUG] manager: Rebalanced 3 servers, next active server is econ-0d5059ea0ad72e4f2-b-ea.us-east-1 (Addr: tcp/10.16.37.45:8300) (DC: us-east-1)
2019/04/09 18:51:03 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.3.52:8301
2019/04/09 18:51:18 [DEBUG] memberlist: Stream connection from=10.16.37.45:44600
2019/04/09 18:51:23 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.33.185:8302
2019/04/09 18:51:26 [DEBUG] manager: Rebalanced 3 servers, next active server is econ-00dfa4c5521dace34-a-wo.us-west-2 (Addr: tcp/10.16.3.52:8300) (DC: us-west-2)
2019/04/09 18:51:29 [DEBUG] memberlist: Stream connection from=10.16.3.52:34084
2019/04/09 18:51:30 [DEBUG] http: Request GET /v1/kv/?recurse (1.41693ms) from=127.0.0.1:58426
2019/04/09 18:51:30 [DEBUG] http: Request GET /v1/catalog/services (1.294127ms) from=127.0.0.1:58428
2019/04/09 18:51:30 [DEBUG] http: Request GET /v1/health/node/econ-03b2d0d795a41eb0a-c-wo (1.281487ms) from=127.0.0.1:58430
2019/04/09 18:51:30 [DEBUG] http: Request GET /v1/agent/members (100.231µs) from=127.0.0.1:58432
2019/04/09 18:51:33 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.3.238:8301
2019/04/09 18:51:58 [DEBUG] agent: Skipping remote check "serfHealth" since it is managed automatically
2019/04/09 18:51:58 [DEBUG] agent: Node info in sync
2019/04/09 18:52:03 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.3.52:8301
2019/04/09 18:52:05 [DEBUG] memberlist: Stream connection from=10.16.3.52:40032
2019/04/09 18:52:17 [INFO] serf: EventMemberJoin: ezk-0760588ca5d7c54e6-a-wo 10.16.1.81
2019/04/09 18:52:17 [INFO] consul: Adding LAN server ezk-0760588ca5d7c54e6-a-wo (Addr: tcp/10.16.1.81:8300) (DC: us-west-2)
2019/04/09 18:52:17 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.1.81:8302
2019/04/09 18:52:17 [INFO] serf: EventMemberJoin: ezk-0760588ca5d7c54e6-a-wo.us-west-2 10.16.1.81
2019/04/09 18:52:17 [DEBUG] consul: Successfully performed flood-join for "ezk-0760588ca5d7c54e6-a-wo" at 10.16.1.81:8302
2019/04/09 18:52:17 [INFO] consul: Handled member-join event for server "ezk-0760588ca5d7c54e6-a-wo.us-west-2" in area "wan"
2019/04/09 18:52:17 [DEBUG] serf: messageJoinType: ezk-0760588ca5d7c54e6-a-wo
2019/04/09 18:52:17 [DEBUG] serf: messageJoinType: ezk-0760588ca5d7c54e6-a-wo
2019/04/09 18:52:17 [DEBUG] serf: messageJoinType: ezk-0760588ca5d7c54e6-a-wo
2019/04/09 18:52:17 [DEBUG] serf: messageJoinType: ezk-0760588ca5d7c54e6-a-wo
2019/04/09 18:52:17 [DEBUG] serf: messageJoinType: ezk-0760588ca5d7c54e6-a-wo
2019/04/09 18:52:18 [DEBUG] serf: messageJoinType: econ-03b2d0d795a41eb0a-c-wo.us-west-2
2019/04/09 18:52:18 [DEBUG] serf: messageJoinType: econ-00dfa4c5521dace34-a-wo.us-west-2
2019/04/09 18:52:18 [DEBUG] serf: messageJoinType: econ-00dfa4c5521dace34-a-wo.us-west-2
2019/04/09 18:52:18 [DEBUG] serf: messageJoinType: econ-03b2d0d795a41eb0a-c-wo.us-west-2
2019/04/09 18:52:18 [DEBUG] serf: messageJoinType: econ-03b2d0d795a41eb0a-c-wo.us-west-2
2019/04/09 18:52:18 [DEBUG] serf: messageJoinType: econ-00dfa4c5521dace34-a-wo.us-west-2
2019/04/09 18:52:18 [DEBUG] serf: messageJoinType: econ-00dfa4c5521dace34-a-wo.us-west-2
2019/04/09 18:52:18 [DEBUG] serf: messageJoinType: econ-03b2d0d795a41eb0a-c-wo.us-west-2
2019/04/09 18:52:18 [DEBUG] serf: messageJoinType: econ-03b2d0d795a41eb0a-c-wo.us-west-2
2019/04/09 18:52:18 [DEBUG] serf: messageJoinType: econ-00dfa4c5521dace34-a-wo.us-west-2
2019/04/09 18:52:18 [DEBUG] serf: messageJoinType: econ-00dfa4c5521dace34-a-wo.us-west-2
2019/04/09 18:52:19 [DEBUG] memberlist: Stream connection from=10.16.37.45:44672
2019/04/09 18:52:19 [DEBUG] serf: messageJoinType: ezk-0760588ca5d7c54e6-a-wo.us-west-2
2019/04/09 18:52:19 [DEBUG] serf: messageJoinType: ezk-0760588ca5d7c54e6-a-wo.us-west-2
2019/04/09 18:52:23 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.33.185:8302
2019/04/09 18:52:30 [DEBUG] http: Request GET /v1/kv/?recurse (1.396631ms) from=127.0.0.1:58498
2019/04/09 18:52:30 [DEBUG] http: Request GET /v1/catalog/services (1.349508ms) from=127.0.0.1:58500
2019/04/09 18:52:30 [DEBUG] http: Request GET /v1/health/node/econ-03b2d0d795a41eb0a-c-wo (1.27465ms) from=127.0.0.1:58502
2019/04/09 18:52:30 [DEBUG] http: Request GET /v1/agent/members (98.878µs) from=127.0.0.1:58504
2019/04/09 18:52:33 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.1.81:8301
2019/04/09 18:52:35 [DEBUG] memberlist: Stream connection from=10.16.3.52:40112
2019/04/09 18:53:00 [DEBUG] manager: Rebalanced 1 servers, next active server is econ-52485095260256623-c-gwo1.us-west1 (Addr: tcp/10.132.2.48:8300) (DC: us-west1)
2019/04/09 18:53:03 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.3.238:8301
2019/04/09 18:53:11 [DEBUG] manager: Rebalanced 3 servers, next active server is econ-0d5059ea0ad72e4f2-b-ea.us-east-1 (Addr: tcp/10.16.37.45:8300) (DC: us-east-1)
2019/04/09 18:53:23 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.37.45:8302
2019/04/09 18:53:30 [DEBUG] http: Request GET /v1/kv/?recurse (1.599572ms) from=127.0.0.1:58572
2019/04/09 18:53:30 [DEBUG] http: Request GET /v1/catalog/services (1.711998ms) from=127.0.0.1:58574
2019/04/09 18:53:30 [DEBUG] http: Request GET /v1/health/node/econ-03b2d0d795a41eb0a-c-wo (1.321894ms) from=127.0.0.1:58576
2019/04/09 18:53:30 [DEBUG] http: Request GET /v1/agent/members (96.184µs) from=127.0.0.1:58578
2019/04/09 18:53:33 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.3.238:8301
2019/04/09 18:53:44 [DEBUG] memberlist: Stream connection from=10.16.1.81:60388
2019/04/09 18:53:56 [DEBUG] agent: Skipping remote check "serfHealth" since it is managed automatically
2019/04/09 18:53:56 [DEBUG] agent: Node info in sync
2019/04/09 18:54:03 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.3.238:8301
2019/04/09 18:54:10 [DEBUG] manager: Rebalanced 4 servers, next active server is ezk-0760588ca5d7c54e6-a-wo.us-west-2 (Addr: tcp/10.16.1.81:8300) (DC: us-west-2)
2019/04/09 18:54:23 [DEBUG] memberlist: Initiating push/pull sync with: 10.132.2.48:8302
2019/04/09 18:54:30 [DEBUG] memberlist: Failed ping: ezk-0760588ca5d7c54e6-a-wo (timeout reached)
2019/04/09 18:54:30 [DEBUG] http: Request GET /v1/kv/?recurse (1.372775ms) from=127.0.0.1:58646
2019/04/09 18:54:30 [DEBUG] http: Request GET /v1/catalog/services (1.336123ms) from=127.0.0.1:58648
2019/04/09 18:54:30 [DEBUG] http: Request GET /v1/health/node/econ-03b2d0d795a41eb0a-c-wo (1.281863ms) from=127.0.0.1:58650
2019/04/09 18:54:30 [DEBUG] http: Request GET /v1/agent/members (103.225µs) from=127.0.0.1:58652
2019/04/09 18:54:30 [INFO] memberlist: Suspect ezk-0760588ca5d7c54e6-a-wo has failed, no acks received
2019/04/09 18:54:33 [DEBUG] memberlist: Failed ping: ezk-0760588ca5d7c54e6-a-wo (timeout reached)
2019/04/09 18:54:33 [DEBUG] memberlist: Failed ping: ezk-0760588ca5d7c54e6-a-wo.us-west-2 (timeout reached)
2019/04/09 18:54:33 [INFO] memberlist: Suspect ezk-0760588ca5d7c54e6-a-wo has failed, no acks received
2019/04/09 18:54:33 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.3.52:8301
2019/04/09 18:54:34 [DEBUG] memberlist: Failed ping: ezk-0760588ca5d7c54e6-a-wo (timeout reached)
2019/04/09 18:54:34 [INFO] memberlist: Marking ezk-0760588ca5d7c54e6-a-wo as failed, suspect timeout reached (2 peer confirmations)
2019/04/09 18:54:34 [INFO] serf: EventMemberFailed: ezk-0760588ca5d7c54e6-a-wo 10.16.1.81
2019/04/09 18:54:34 [INFO] consul: Removing LAN server ezk-0760588ca5d7c54e6-a-wo (Addr: tcp/10.16.1.81:8300) (DC: us-west-2)
2019/04/09 18:54:35 [INFO] memberlist: Suspect ezk-0760588ca5d7c54e6-a-wo.us-west-2 has failed, no acks received
2019/04/09 18:54:36 [INFO] memberlist: Suspect ezk-0760588ca5d7c54e6-a-wo has failed, no acks received
2019/04/09 18:54:37 [DEBUG] memberlist: Stream connection from=10.16.33.185:51406
2019/04/09 18:54:40 [INFO] serf: attempting reconnect to ezk-0760588ca5d7c54e6-a-wo 10.16.1.81:8301
2019/04/09 18:54:40 [DEBUG] memberlist: Failed to join 10.16.1.81: dial tcp 10.16.1.81:8301: connect: connection refused
2019/04/09 18:54:43 [DEBUG] serf: messageLeaveType: ezk-0760588ca5d7c54e6-a-wo
2019/04/09 18:54:43 [INFO] serf: EventMemberLeave (forced): ezk-0760588ca5d7c54e6-a-wo 10.16.1.81
2019/04/09 18:54:43 [INFO] consul: Removing LAN server ezk-0760588ca5d7c54e6-a-wo (Addr: tcp/10.16.1.81:8300) (DC: us-west-2)
2019/04/09 18:54:43 [DEBUG] serf: messageLeaveType: ezk-0760588ca5d7c54e6-a-wo
2019/04/09 18:54:43 [DEBUG] serf: messageLeaveType: ezk-0760588ca5d7c54e6-a-wo
2019/04/09 18:55:03 [DEBUG] memberlist: Stream connection from=10.132.2.48:53596
2019/04/09 18:55:03 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.3.52:8301
2019/04/09 18:55:05 [INFO] serf: EventMemberFailed: ezk-0760588ca5d7c54e6-a-wo.us-west-2 10.16.1.81
2019/04/09 18:55:05 [DEBUG] manager: cycled away from server "ezk-0760588ca5d7c54e6-a-wo.us-west-2"
2019/04/09 18:55:05 [INFO] consul: Handled member-failed event for server "ezk-0760588ca5d7c54e6-a-wo.us-west-2" in area "wan"
2019/04/09 18:55:10 [DEBUG] serf: forgoing reconnect for random throttling
2019/04/09 18:55:13 [DEBUG] manager: Rebalanced 3 servers, next active server is econ-0d5059ea0ad72e4f2-b-ea.us-east-1 (Addr: tcp/10.16.37.45:8300) (DC: us-east-1)
2019/04/09 18:55:16 [DEBUG] memberlist: Stream connection from=10.16.3.238:39850
2019/04/09 18:55:23 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.37.152:8302
2019/04/09 18:55:30 [DEBUG] memberlist: Stream connection from=10.16.3.52:34452
2019/04/09 18:55:30 [DEBUG] http: Request GET /v1/kv/?recurse (1.465747ms) from=127.0.0.1:58726
2019/04/09 18:55:30 [DEBUG] http: Request GET /v1/catalog/services (1.313806ms) from=127.0.0.1:58728
2019/04/09 18:55:30 [DEBUG] http: Request GET /v1/health/node/econ-03b2d0d795a41eb0a-c-wo (1.547925ms) from=127.0.0.1:58730
2019/04/09 18:55:30 [DEBUG] http: Request GET /v1/agent/members (103.065µs) from=127.0.0.1:58732
2019/04/09 18:55:33 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.3.52:8301
2019/04/09 18:55:34 [DEBUG] manager: Rebalanced 1 servers, next active server is econ-52485095260256623-c-gwo1.us-west1 (Addr: tcp/10.132.2.48:8300) (DC: us-west1)
2019/04/09 18:55:34 [DEBUG] agent: Skipping remote check "serfHealth" since it is managed automatically
2019/04/09 18:55:34 [DEBUG] agent: Node info in sync
2019/04/09 18:55:40 [DEBUG] serf: forgoing reconnect for random throttling
2019/04/09 18:55:46 [DEBUG] memberlist: Stream connection from=10.16.3.238:39924
2019/04/09 18:56:03 [DEBUG] memberlist: Stream connection from=10.132.2.48:53634
2019/04/09 18:56:03 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.3.238:8301
2019/04/09 18:56:10 [DEBUG] serf: forgoing reconnect for random throttling
2019/04/09 18:56:15 [DEBUG] manager: Rebalanced 4 servers, next active server is econ-03b2d0d795a41eb0a-c-wo.us-west-2 (Addr: tcp/10.16.4.17:8300) (DC: us-west-2)
2019/04/09 18:56:16 [DEBUG] memberlist: Stream connection from=10.16.3.238:39934
2019/04/09 18:56:23 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.3.238:8302
2019/04/09 18:56:30 [DEBUG] http: Request GET /v1/kv/?recurse (1.352894ms) from=127.0.0.1:58804
2019/04/09 18:56:30 [DEBUG] http: Request GET /v1/catalog/services (1.34292ms) from=127.0.0.1:58806
2019/04/09 18:56:30 [DEBUG] http: Request GET /v1/health/node/econ-03b2d0d795a41eb0a-c-wo (1.283166ms) from=127.0.0.1:58808
2019/04/09 18:56:30 [DEBUG] http: Request GET /v1/agent/members (124.328µs) from=127.0.0.1:58810
2019/04/09 18:56:33 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.3.52:8301
2019/04/09 18:56:40 [DEBUG] serf: forgoing reconnect for random throttling
2019/04/09 18:57:03 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.3.238:8301
2019/04/09 18:57:05 [DEBUG] memberlist: Stream connection from=10.16.3.52:40476
2019/04/09 18:57:10 [DEBUG] serf: forgoing reconnect for random throttling
2019/04/09 18:57:16 [DEBUG] memberlist: Stream connection from=10.16.3.238:40004
2019/04/09 18:57:17 [DEBUG] agent: Skipping remote check "serfHealth" since it is managed automatically
2019/04/09 18:57:17 [DEBUG] agent: Node info in sync
2019/04/09 18:57:24 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.37.45:8302
2019/04/09 18:57:30 [DEBUG] http: Request GET /v1/kv/?recurse (1.470712ms) from=127.0.0.1:58874
2019/04/09 18:57:30 [DEBUG] http: Request GET /v1/catalog/services (1.329342ms) from=127.0.0.1:58876
2019/04/09 18:57:30 [DEBUG] http: Request GET /v1/health/node/econ-03b2d0d795a41eb0a-c-wo (1.34464ms) from=127.0.0.1:58878
2019/04/09 18:57:30 [DEBUG] http: Request GET /v1/agent/members (118.731µs) from=127.0.0.1:58880
2019/04/09 18:57:33 [DEBUG] memberlist: Initiating push/pull sync with: 10.16.3.52:8301
2019/04/09 18:57:40 [DEBUG] serf: forgoing reconnect for random throttling
2019/04/09 19:06:39 [DEBUG] agent: Skipping remote check "serfHealth" since it is managed automatically
2019/04/09 19:06:39 [DEBUG] agent: Node info in sync
2019/04/09 19:06:40 [DEBUG] serf: forgoing reconnect for random throttling
2019/04/09 19:06:54 [DEBUG] manager: pinging server "ezk-0760588ca5d7c54e6-a-wo.us-west-2 (Addr: tcp/10.16.1.81:8300) (DC: us-west-2)" failed: rpc error getting client: failed to get conn: dial tcp <nil>->10.16.1.81:8300: connect: connection refused
2019/04/09 19:06:54 [DEBUG] manager: Rebalanced 4 servers, next active server is econ-0aed93bbec0ebdbf5-b-wo.us-west-2 (Addr: tcp/10.16.3.238:8300) (DC: us-west-2)

From one of the server nodes called force-leave the dead node (ezk-0760588ca5d7c54e6-a-wo)

$consul members
Node                         Address           Status  Type    Build  Protocol  DC         Segment
econ-00dfa4c5521dace34-a-wo  10.16.3.52:8301   alive   server  1.2.3  2         us-west-2  <all>
econ-03b2d0d795a41eb0a-c-wo  10.16.4.17:8301   alive   server  1.2.3  2         us-west-2  <all>
econ-0aed93bbec0ebdbf5-b-wo  10.16.3.238:8301  alive   server  1.2.3  2         us-west-2  <all>
ezk-0760588ca5d7c54e6-a-wo   10.16.1.81:8301   left    server  1.2.3  2         us-west-2  <all>

ezk-0760588ca5d7c54e6-a-wo joins as client now:

 consul members
Node                         Address           Status  Type    Build  Protocol  DC         Segment
econ-00dfa4c5521dace34-a-wo  10.16.3.52:8301   alive   server  1.2.3  2         us-west-2  <all>
econ-03b2d0d795a41eb0a-c-wo  10.16.4.17:8301   alive   server  1.2.3  2         us-west-2  <all>
econ-0aed93bbec0ebdbf5-b-wo  10.16.3.238:8301  alive   server  1.2.3  2         us-west-2  <all>
ezk-0760588ca5d7c54e6-a-wo   10.16.1.81:8301   alive   client  1.2.3  2         us-west-2  <default> <---

Debug logs from server after force-leave dead server:

2019/04/09 19:45:42 [DEBUG] manager: pinging server "ezk-0760588ca5d7c54e6-a-wo.us-west-2 (Addr: tcp/10.16.1.81:8300) (DC: us-west-2)" failed: rpc error getting client: failed to get conn: dial tcp <nil>->10.16.1.81:8300: connect: connection refused
2019/04/09 19:45:42 [DEBUG] manager: Rebalanced 4 servers, next active server is econ-03b2d0d795a41eb0a-c-wo.us-west-2 (Addr: tcp/10.16.4.17:8300) (DC: us-west-2)

Autopilot config:

consul operator  autopilot get-config
CleanupDeadServers = true
LastContactThreshold = 24h0m0s
MaxTrailingLogs = 250
ServerStabilizationTime = 10s
RedundancyZoneTag = ""
DisableUpgradeMigration = false
UpgradeVersionTag = ""

Autopilot health:

{
  "Healthy": true,
  "FailureTolerance": 1,
  "Servers": [
    {
      "ID": "6eb86dd1-007c-4809-f9f5-c334de952aa8",
      "Name": "econ-00dfa4c5521dace34-a-wo",
      "Address": "10.16.3.52:8300",
      "SerfStatus": "alive",
      "Version": "1.2.3",
      "Leader": true,
      "LastContact": "0s",
      "LastTerm": 2,
      "LastIndex": 11721,
      "Healthy": true,
      "Voter": true,
      "StableSince": "2019-04-09T18:09:36Z"
    },
    {
      "ID": "d066550e-4c3e-ed14-7de1-5073e11bede4",
      "Name": "econ-03b2d0d795a41eb0a-c-wo",
      "Address": "10.16.4.17:8300",
      "SerfStatus": "alive",
      "Version": "1.2.3",
      "Leader": false,
      "LastContact": "43.064614ms",
      "LastTerm": 2,
      "LastIndex": 11721,
      "Healthy": true,
      "Voter": true,
      "StableSince": "2019-04-09T18:09:36Z"
    },
    {
      "ID": "3670437a-fe5e-8558-f198-0ad66a3a09a8",
      "Name": "econ-0aed93bbec0ebdbf5-b-wo",
      "Address": "10.16.3.238:8300",
      "SerfStatus": "alive",
      "Version": "1.2.3",
      "Leader": false,
      "LastContact": "10.867376ms",
      "LastTerm": 2,
      "LastIndex": 11721,
      "Healthy": true,
      "Voter": true,
      "StableSince": "2019-04-09T18:09:36Z"
    }
  ]
}

After 24hrs i still see consul trying to rebalance with 4 servers.

2019/04/10 18:07:33 [DEBUG] memberlist: Failed to join 10.16.1.81: dial tcp 10.16.1.81:8302: connect: connection refused
2019/04/10 18:07:35 [DEBUG] manager: Rebalanced 4 servers, next active server is econ-00dfa4c5521dace34-a-wo.us-west-2 (Addr: tcp/10.16.3.52:8300) (DC: us-west-2)

Consul info for both Client and Server

Server info:

consul info
agent:
	check_monitors = 0
	check_ttls = 0
	checks = 0
	services = 0
build:
	prerelease =
	revision = 48d287ef
	version = 1.2.3
consul:
	bootstrap = false
	known_datacenters = 3
	leader = false
	leader_addr = 10.16.3.52:8300
	server = true
raft:
	applied_index = 24715
	commit_index = 24715
	fsm_pending = 0
	last_contact = 4.435646ms
	last_log_index = 24715
	last_log_term = 2
	last_snapshot_index = 16387
	last_snapshot_term = 2
	latest_configuration = [{Suffrage:Voter ID:6eb86dd1-007c-4809-f9f5-c334de952aa8 Address:10.16.3.52:8300} {Suffrage:Voter ID:d066550e-4c3e-ed14-7de1-5073e11bede4 Address:10.16.4.17:8300} {Suffrage:Voter ID:3670437a-fe5e-8558-f198-0ad66a3a09a8 Address:10.16.3.238:8300}]
	latest_configuration_index = 362
	num_peers = 2
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Follower
	term = 2
runtime:
	arch = amd64
	cpu_count = 4
	goroutines = 85
	max_procs = 4
	os = linux
	version = go1.10.1
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 2
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 12
	members = 4
	query_queue = 0
	query_time = 1
serf_wan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 1
	failed = 1
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 2721
	members = 8
	query_queue = 0
	query_time = 1

Operating system and Environment details

Amazon Linux AMI release 2018.03

Log Fragments

Included in reproduction steps

@mkeeler mkeeler added the type/bug Feature does not function as expected label Apr 11, 2019
@mkeeler
Copy link
Member

mkeeler commented Apr 11, 2019

I believe the problem is in a few places:

case serf.EventMemberUpdate: // Ignore

case serf.EventMemberUpdate:

case serf.EventMemberUpdate:

Basically when member update events come in (or potentially member join events), if they are servers we should ensure they are tracked as such and if they are not servers we should ensure that they are removed.

@rammohanganap
Copy link
Author

Any update on this issue?

1 similar comment
@phamtai97
Copy link

Any update on this issue?

@rammohanganap
Copy link
Author

Is this issue fixed? any update?

@mkeeler
Copy link
Member

mkeeler commented May 18, 2020

@rammohanganap Sorry for the extremely delayed response. Does this still occur for you in later releases 1.4.3+ or 1.7.0.

1.4.3 added this fix: #5317

That mostly resolved the case where a server that had been in the failed/left state and was reaped never got fully removed from the RPC routing system.

1.7.0 added this fix: #6420

That one ensures that the routing infrastructure ignores left servers in the event that they might still be reachable somehow.

However looking at the code again, it looks like your case might be a little different. Calling force-leave on a server and replacing with a client might not hit any of the code points that have since been fixed. I think this function might need updating to remove the "server" from the routing infrastructure in the case of it not being a server but rather a client:

func (c *Client) nodeUpdate(me serf.MemberEvent) {
for _, m := range me.Members {
ok, parts := metadata.IsConsulServer(m)
if !ok {
continue
}
if parts.Datacenter != c.config.Datacenter {
c.logger.Warn("server has joined the wrong cluster: wrong datacenter",
"server", m.Name,
"datacenter", parts.Datacenter,
)
continue
}
c.logger.Info("updating server", "server", parts.String())
c.routers.AddServer(parts)
}
}

@jsosulska jsosulska added the waiting-reply Waiting on response from Original Poster or another individual in the thread label May 18, 2020
@rammohanganap
Copy link
Author

@mkeeler , we don't see this issue anymore. we can close it.

@github-actions github-actions bot removed the waiting-reply Waiting on response from Original Poster or another individual in the thread label Sep 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Feature does not function as expected
Projects
None yet
Development

No branches or pull requests

4 participants