Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul does not replicate WAN list across its members. #1656

Closed
takeda opened this issue Jan 27, 2016 · 14 comments
Closed

Consul does not replicate WAN list across its members. #1656

takeda opened this issue Jan 27, 2016 · 14 comments
Labels
type/enhancement Proposed improvement or new feature

Comments

@takeda
Copy link

takeda commented Jan 27, 2016

Consul don't replicate WAN list across servers on the same group, for example:

[root@prod-consul-xv-01 ~]# consul members | grep server
prod-consul-xv-01         10.1.11.237:8301  alive   server  0.6.0  2         xv-prod
prod-consul-xv-02         10.1.66.242:8301  alive   server  0.6.0  2         xv-prod
prod-consul-xv-03         10.1.83.251:8301  alive   server  0.6.0  2         xv-prod
prod-consul-xv-04         10.1.43.229:8301  alive   server  0.6.0  2         xv-prod
prod-consul-xv-05         10.1.83.250:8301  alive   server  0.6.0  2         xv-prod
[root@prod-consul-xv-01 ~]# consul members -wan
Node                       Address           Status  Type    Build  Protocol  DC
prod-consul-ca-01.ca-prod  10.5.6.230:8302   alive   server  0.6.0  2         ca-prod
prod-consul-lc-01.lc-prod  10.2.34.249:8302  alive   server  0.6.0  2         lc-prod
prod-consul-xa-01.xa-prod  10.16.1.253:8302  alive   server  0.6.0  2         xa-prod
prod-consul-xf-03.xf-prod  10.33.5.244:8302  alive   server  0.6.0  2         xf-prod
prod-consul-xv-01.xv-prod  10.1.11.237:8302  alive   server  0.6.0  2         xv-prod

And on the second node:

[root@prod-consul-xv-02 ~]# consul members | grep server
prod-consul-xv-01         10.1.11.237:8301  alive   server  0.6.0  2         xv-prod
prod-consul-xv-02         10.1.66.242:8301  alive   server  0.6.0  2         xv-prod
prod-consul-xv-03         10.1.83.251:8301  alive   server  0.6.0  2         xv-prod
prod-consul-xv-04         10.1.43.229:8301  alive   server  0.6.0  2         xv-prod
prod-consul-xv-05         10.1.83.250:8301  alive   server  0.6.0  2         xv-prod
[root@prod-consul-xv-02 ~]# consul members -wan
Node                       Address           Status  Type    Build  Protocol  DC
prod-consul-xv-02.xv-prod  10.1.66.242:8302  alive   server  0.6.0  2         xv-prod

This may or may not be related to #1471. Since depending on the node the client connects to you might see or not other data centers.

@highlyunavailable
Copy link
Contributor

You must join all servers into the WAN pool, it is not automatic. On prod-consul-xv-02, run consul join 10.1.11.237 -wan (where 10.1.11.237 is prod-consul-xv-01, which is already a member of the WAN pool. See: https://www.consul.io/docs/agent/options.html#_join_wan - specifically:

By default, the agent won't -join-wan any nodes when it starts up.

@takeda
Copy link
Author

takeda commented Jan 27, 2016

Could this be a feature request then?

With 5 nodes in 5 datacenters that's already 25 nodes to modify, where there's a significant chance of making a mistake.

Consul's purpose (or rather serf and raft which it relies on) is to make sure that data is replicated within its cluster, so it make sense to replicate DC information as well.

@slackpad slackpad added the type/enhancement Proposed improvement or new feature label Feb 2, 2016
@atrbgithub
Copy link

I've had this issue after rebooting nodes one at a time within a cluster. When they come back up, they don't have the wan config, and it doesn't appear to replicate from the other nodes.

@wwalker
Copy link

wwalker commented Jun 8, 2016

I believe that this is the root of the problem in #1471 ; I'd like to see it such that joining 1 member of a server cluster to a member of another cluster should be persistent and should flow across the 2 clusters.

@takeda
Copy link
Author

takeda commented Jun 15, 2016

@slackpad after taking look at the https://www.consul.io/docs/guides/datacenters.html, particularly this fragment

The join command is used with the -wan flag to indicate we are attempting to join a server in the WAN gossip pool. As with LAN gossip, you only need to join a single existing member, and the gossip protocol will be used to exchange information about all known members. For the initial setup, however, each server will only know about itself and must be added to the cluster.

It appears to me that this is in fact a bug and this issue should be relabeled accordingly

@csghuser
Copy link

csghuser commented Jul 7, 2016

Not sure if this is helpful to anyone, however I worked around this issue by adding the nodes of the opposing datacentre into the consul configuration:

  "retry_join_wan":[
    "192.168.15.232",
    "192.168.15.208",
    "192.168.15.31"
  ],

After a reboot, each node is then able to rejoin the wan.

@sean-
Copy link
Contributor

sean- commented Jul 7, 2016 via email

@sean-
Copy link
Contributor

sean- commented Jul 7, 2016

@csghuser that is the correct way to configure Consul for servers connected to a
WAN. It's not necessary to have this be configured in a full-mesh, however
there is no harm in having symmetry between all Consul servers participating in
the WAN pool. The important part is that all members of the WAN eventually
converge to create a consistent pool.

@takeda
Copy link
Author

takeda commented Jul 7, 2016

@sean- I consider this a bug, because when you connect two DCs together using command you expect it to mean for the entire cluster, this is why @wwalker had connectivity problems.

Take look at for example Riak's MDC setup (http://docs.basho.com/riak/kv/2.1.4/configuring/v3-multi-datacenter/quick-start/) you just issue a single connection between the clusters it even obtains IPs of other nodes to connect to. It doesn't even matter what happens to the node you issued connections from, the cluster is connected.

Consul has all the tools necessary to accomplish the same thing.

I suppose one can use the configuration file, and that should work, it's essentially offloading work to something else. If it's done by hand is prone to mistakes, if is done automatically you'll need service discovery for the consul itself, plus dealing with special case like not listing own IP there.

Anyway as it is right now the join command line is totally useless. If you have 5 datacenters with 5 nodes won't you need to issue 80 joins to ensure that all nodes are connected with all nodes (5 * 4 * 4) so you won't encounter @wwalker's issues and then be on top of that each time a node is replaced (20 joins for a new node)

Edit: referenced wrong person

@slackpad
Copy link
Contributor

slackpad commented Jul 7, 2016

Apologies because the docs are a bit confusing with regards to the "you only need to join a single existing member" part. What it is trying to say is that the LAN and the WAN join both work such that you only have to join with one other existing member of the cluster in order to join the entire cluster (that's why you don't need to do 80 joins in the example above). What's not super clear is that there's no connection between the WAN and LAN clusters, even though there could be.

What we intend to add with the enhancement is automatic WAN joining based on the LAN. You'd have to do at least one WAN join with a server in each datacenter, but after that Consul would recognize that there are servers on the LAN that aren't present in the WAN and would auto-join them. This should make it much harder to get into a situation where you've only WAN-joined a subset of your servers, which is fairly easy to do today.

@slackpad
Copy link
Contributor

slackpad commented Jul 7, 2016

It's not really a bug, more of a missing feature :-)

@takeda
Copy link
Author

takeda commented Jul 7, 2016

Sounds good. That addition would help a lot.

@csghuser
Copy link

Yep, agreed.

All you should need to do is to perform the wan join once, and then within both dc's the wan config should propagate to all server nodes.

Currently this doesn't seem to happen, and when the servers reboot they seem to lose all knowledge of the wan they were part of. Unless you manually add it into the config file as above.

sdinakar85 added a commit to sdinakar85/consul that referenced this issue Oct 7, 2016
Updated the slightly confusing documentation on how to join the clusters over WAN. Also the inputs from hashicorp#1656 is taken in account in this documentation update.
@slackpad
Copy link
Contributor

WAN join flooding made it into Consul 0.8:

WAN Join Flooding: A new routine was added that looks for Consul servers in the LAN and makes sure that they are joined into the WAN as well. This catches up up newly-added servers onto the WAN as soon as they join the LAN, keeping them in sync automatically. [GH-2801]

Closed in #2801.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement Proposed improvement or new feature
Projects
None yet
Development

No branches or pull requests

7 participants