Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recreating server nodes causes client nodes to flap #845

Closed
discordianfish opened this issue Apr 7, 2015 · 2 comments
Closed

Recreating server nodes causes client nodes to flap #845

discordianfish opened this issue Apr 7, 2015 · 2 comments

Comments

@discordianfish
Copy link
Contributor

Hi,

not sure if this is a design issue or a simple bug:

If you create a (in my case three node) cluster with a bunch of client nodes and remove and recreate all server nodes, basically starting a clean cluster, all clients start flapping like this:

    2015/04/07 01:27:14 [INFO] serf: EventMemberJoin: ip-10-1-11-204 10.1.11.204
    2015/04/07 01:27:14 [INFO] serf: EventMemberJoin: ip-10-1-42-210 10.1.42.210
    2015/04/07 01:27:17 [INFO] memberlist: Suspect ip-10-1-31-116 has failed, no acks received
    2015/04/07 01:27:18 [INFO] memberlist: Marking ip-10-1-31-116 as failed, suspect timeout reached
    2015/04/07 01:27:18 [INFO] serf: EventMemberFailed: ip-10-1-31-116 10.1.31.116
    2015/04/07 01:27:19 [INFO] memberlist: Suspect ip-10-1-42-210 has failed, no acks received
    2015/04/07 01:27:20 [INFO] serf: EventMemberFailed: ip-10-1-42-210 10.1.42.210
    2015/04/07 01:27:21 [INFO] memberlist: Marking ip-10-1-11-204 as failed, suspect timeout reached
    2015/04/07 01:27:21 [INFO] serf: EventMemberFailed: ip-10-1-11-204 10.1.11.204
    2015/04/07 01:27:22 [INFO] memberlist: Suspect ip-10-1-11-204 has failed, no acks received
    2015/04/07 01:27:24 [INFO] serf: EventMemberJoin: ip-10-1-42-211 10.1.42.211
    2015/04/07 01:27:24 [INFO] serf: EventMemberJoin: ip-10-1-11-204 10.1.11.204
    2015/04/07 01:27:24 [INFO] serf: EventMemberJoin: ip-10-1-31-116 10.1.31.116
    2015/04/07 01:27:27 [INFO] serf: EventMemberFailed: ip-10-1-31-116 10.1.31.116
    2015/04/07 01:27:28 [INFO] serf: EventMemberJoin: ip-10-1-31-116 10.1.31.116
    2015/04/07 01:27:30 [INFO] memberlist: Suspect ip-10-1-11-204 has failed, no acks received
    2015/04/07 01:27:32 [INFO] memberlist: Marking ip-10-1-11-204 as failed, suspect timeout reached
    2015/04/07 01:27:32 [INFO] serf: EventMemberFailed: ip-10-1-11-204 10.1.11.204

I need to recreate the clients to make them properly reconnect. I'm not sure if this is required by design but I assumed the client nodes are rather dumb and just connect to the server nodes and use their state, so no way the local state could somehow conflict with the server state. If this isn't the case, is there some more specific documentation around such operational concerns?

@ryanuber
Copy link
Member

ryanuber commented Apr 8, 2015

Consul clients do carry some state locally, which includes information about the cluster as well as the state of local services and checks. The logs you shared above are the gossip layer detecting failures on its peers. Without knowing which nodes were clients and which were servers, there's not much else I can derive from them, but if the failed nodes were the servers you stopped then this would be expected.

The clients should reconnect, though. Are there any differences on the new servers from the old (IP's, hostnames, firewalls, etc)? You might hit #457, since the Raft layer currently does not gracefully handle IP address changes. Can you share your configuration file(s)?

You will run into #839 with the current 0.5 release, but 0.5.1 will fix this. Basically the clients will not re-sync their services and checks to the global catalog.

@discordianfish
Copy link
Contributor Author

In my case the server IPs changed, I expected consul to resolve the provided server address again but it didn't. But #839 sounds like it will fix that issue, although not sure if I still need to get rid of the clients local state about cluster members. Anyway, this can considered a dup of #839, so will close it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants