Multiple disparate consul clusters somehow discovering each other #1833

Amit-PivotalLabs · 2016-03-15T07:48:38Z

I have 3 deployments within a single VPC, but the three deployments shouldn't know about one another. Each deployment has 5 VMs: 3 consul servers + 2 foobar servers with colocated consul agents. The first deployment uses "dc1" as its datacenter, so from nodes in the first deployment I can nslookup the following and get the expected results:

foobar-1.foobar.service.cf.internal (resolves to the IP of one of the foobar servers in dc1)
foobar-1.foobar.service.dc1.cf.internal(resolves to the IP of the same foobar server in dc1)
foobar-1.node.cf.internal (resolves to the IP of the same foobar server in dc1)
foobar-1.node.dc1.cf.internal (resolves to the IP of the same foobar server in dc1)
foobar-2.foobar.service.cf.internal (resolves to the IP of the other foobar server in dc1)
foobar-2.foobar.service.dc1.cf.internal (resolves to the IP of the other foobar server in dc1)
foobar-2.node.cf.internal (resolves to the IP of the other foobar server in dc1)
foobar-2.node.dc1.cf.internal (resolves to the IP of the other foobar server in dc1)
foobar.service.cf.internal (resolves to the two IPs of the foobar servers in dc1)
foobar.service.dc1.cf.internal (resolves to the two IPs of the foobar servers in dc1)

Similarly, from nodes in dc3, I can make the same queries with "dc3" replacing "dc1" in all the statements above.

None of the nodes in dc1 should know about the nodes in the other dc's, however if from a node in dc1 I make the above queries with "dc3" replacing "dc1", I get successful results as though I were on a node in dc3. What's weirder is that this is not symmetric. From dc3, I can't query dc1, which is actually the behaviour I would expect.

It's also not entirely consistent. At one point I was able to query dc1 from dc3. I can include any start commands, info logs, nslookup queries, curl queries, etc. that would help explain what's going on.

Thanks,
Amit

The text was updated successfully, but these errors were encountered:

slackpad · 2016-03-15T08:05:32Z

Hi @Amit-PivotalLabs it sounds like they've been WAN joined at some point - https://www.consul.io/docs/guides/datacenters.html.

Does consul members -wan show a mix of servers from the different DCs if you run it from some of the servers? If you haven't joined them all, and if your WAN links are a little flaky, they can sometimes be unable to reach other, or possibly due to network rules they might not all be able to communicate in all directions. You'd need 8302/tcp, 8302/udp, and 8300/tcp between all the WAN servers for them to all communicate with each other, and you'd have to consul join -wan them at least one time to form a cluster.

Amit-PivotalLabs · 2016-03-15T08:43:56Z

Hi @slackpad

Thanks for the fast response. They may have been WAN joined at some point, I can double-check that. Do you think this behaviour should only be seen if they have been WAN joined at some point, otherwise this would be unexpected, yes?

Yes, consul members -wan shows a fairly eclectic mix.

DC1

Consul Server 1

# /var/vcap/packages/consul/bin/consul members -wan
Node             Address         Status  Type    Build  Protocol  DC
consul-z1-0.dc1  10.0.20.4:8302  alive   server  0.5.2  2         dc1
consul-z1-0.dc2  10.0.21.4:8302  alive   server  0.5.2  2         dc2
consul-z1-1.dc1  10.0.20.5:8302  alive   server  0.5.2  2         dc1
consul-z1-0.dc3  10.0.22.4:8302  alive   server  0.5.2  2         dc3

Consul Server 2

# /var/vcap/packages/consul/bin/consul members -wan
Node             Address         Status  Type    Build  Protocol  DC
consul-z1-1.dc1  10.0.20.5:8302  alive   server  0.5.2  2         dc1
consul-z1-0.dc3  10.0.22.4:8302  alive   server  0.5.2  2         dc3
consul-z1-0.dc1  10.0.20.4:8302  alive   server  0.5.2  2         dc1
consul-z1-0.dc2  10.0.21.4:8302  alive   server  0.5.2  2         dc2

Consul Server 3

# /var/vcap/packages/consul/bin/consul members -wan
Node             Address         Status  Type    Build  Protocol  DC
consul-z1-2.dc1  10.0.20.6:8302  alive   server  0.5.2  2         dc1

DC2

Consul Server 1

# /var/vcap/packages/consul/bin/consul members -wan
Node             Address         Status  Type    Build  Protocol  DC
consul-z1-0.dc2  10.0.21.4:8302  alive   server  0.5.2  2         dc2
consul-z1-0.dc1  10.0.20.4:8302  alive   server  0.5.2  2         dc1
consul-z1-0.dc3  10.0.22.4:8302  alive   server  0.5.2  2         dc3
consul-z1-1.dc1  10.0.20.5:8302  alive   server  0.5.2  2         dc1

Consul Server 2

# /var/vcap/packages/consul/bin/consul members -wan
Node             Address         Status  Type    Build  Protocol  DC
consul-z1-1.dc2  10.0.21.5:8302  alive   server  0.5.2  2         dc2

Consul Server 3

 /var/vcap/packages/consul/bin/consul members -wan
Node             Address         Status  Type    Build  Protocol  DC
consul-z1-2.dc2  10.0.21.6:8302  alive   server  0.5.2  2         dc2

DC3

Consul Server 1

# /var/vcap/packages/consul/bin/consul members -wan
Node             Address         Status  Type    Build  Protocol  DC
consul-z1-0.dc3  10.0.22.4:8302  alive   server  0.5.2  2         dc3
consul-z1-0.dc1  10.0.20.4:8302  alive   server  0.5.2  2         dc1
consul-z1-1.dc1  10.0.20.5:8302  alive   server  0.5.2  2         dc1
consul-z1-0.dc2  10.0.21.4:8302  alive   server  0.5.2  2         dc2

Consul Server 2

# /var/vcap/packages/consul/bin/consul members -wan
Node             Address         Status  Type    Build  Protocol  DC
consul-z1-1.dc3  10.0.22.5:8302  alive   server  0.5.2  2         dc3

Consul Server 3

# /var/vcap/packages/consul/bin/consul members -wan
Node             Address         Status  Type    Build  Protocol  DC
consul-z1-2.dc3  10.0.22.6:8302  alive   server  0.5.2  2         dc3

All the VMs are within the same VPC, all part of the same security group that allows all TCP and UDP traffic within the security group. Looks like they also had some Network ACLs set up possibly preventing traffic between clusters. I've deleted the ACLs and restarted the consul processes on all the consul servers. The consul members -wan still shows a mix.

I'm now also trying the following nslookups on all 15 VMs:

foobar.service.cf.internal (resolves on all machines, to the two IPs of the foobar nodes within the respective DC)
foobar.service.dc1.cf.internal (resolves perfectly on all DC1 nodes, I get NXDOMAIN roughly 50-50 on all other nodes, seemingly independent of whether it's a DC2 node or DC3, independent of whether it's a foobar node or a consul server node, and retrying a couple times continues to give random results, different than the first try)
similarly for the other to dc's. (dc2 works for DC2, but has random success for nodes in DC1 or DC3).

This is just a standard AWS VPC. What would need to be done to make this less flaky?

slackpad · 2016-03-15T14:50:26Z

Hi @Amit-PivotalLabs correct - they have to be WAN joined otherwise the different datacenters wouldn't know about each other. If you'd like this to not be flaky the best thing would be to go to one of your Consul servers and consul join -wan each of the other servers. This will put them all into a single cluster and will give you the best chance of routing between the datacenters. You'll also want to arrange for newly-added servers to -retry-join-wan a list of servers to try at startup. If you can use the Atlas features of Consul it can handle the local and WAN join bits for you.

Amit-PivotalLabs · 2016-03-15T16:48:49Z

Thanks @slackpad

Atlas won't work for us as this needs to work in airtight on-prem networks.

Could you clarify your advice about reducing flakiness? Does it suffice to have one server in one DC join -wan to one server in each of the other two DCs? Would it somehow make things more robust if every server join -wan'd every other server in every DC?

If some consul servers are configured to -retry-join-wan some IPs intended for a consul cluster in another DC, and that other DC doesn't exist yet, will the first DC come up and work fine? I.e. it will keep trying to join -wan the other DC, but until the other DC is up it will do it's own thing fine, correct?

slackpad · 2016-03-15T21:01:40Z

Using -retry-join-wan is a good way to go - the servers will still start but will in the background keep attempting to join at least one of the WAN servers they are given. In general it's not sufficient to have one server in one DC contact just one in another DC, you can end up with sets of servers that don't find each other as you have now. If it's possible to enumerate them all then that would definitely be more robust and reduce the likelihood of "islands". Another approach is to have them all -retry-join the same server which acts as a hub to connect them all together.

slackpad · 2017-05-03T03:31:36Z

Closing this out as I don't think there's forward work left here. Consul 0.8.0 added WAN "join flooding":

WAN Join Flooding: A new routine was added that looks for Consul servers in the LAN and makes sure that they are joined into the WAN as well. This catches up up newly-added servers onto the WAN as soon as they join the LAN, keeping them in sync automatically. [GH-2801]

slackpad closed this as completed May 3, 2017

This was referenced May 7, 2024

[Snyk] Security upgrade ember-intl from 5.7.2 to 6.5.4 ekmixon/consul#549

Open

[Snyk] Fix for 2 vulnerabilities ekmixon/consul#550

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple disparate consul clusters somehow discovering each other #1833

Multiple disparate consul clusters somehow discovering each other #1833

Amit-PivotalLabs commented Mar 15, 2016

slackpad commented Mar 15, 2016

Amit-PivotalLabs commented Mar 15, 2016

slackpad commented Mar 15, 2016

Amit-PivotalLabs commented Mar 15, 2016

slackpad commented Mar 15, 2016

slackpad commented May 3, 2017

Multiple disparate consul clusters somehow discovering each other #1833

Multiple disparate consul clusters somehow discovering each other #1833

Comments

Amit-PivotalLabs commented Mar 15, 2016

slackpad commented Mar 15, 2016

Amit-PivotalLabs commented Mar 15, 2016

DC1

Consul Server 1

Consul Server 2

Consul Server 3

DC2

Consul Server 1

Consul Server 2

Consul Server 3

DC3

Consul Server 1

Consul Server 2

Consul Server 3

slackpad commented Mar 15, 2016

Amit-PivotalLabs commented Mar 15, 2016

slackpad commented Mar 15, 2016

slackpad commented May 3, 2017