join_cluster doesn't seem to work in some cases #31

benschw · 2014-10-05T21:02:13Z

In the following example to set up a client agent, join_cluster doesn't work (but including start_join does.)

note: using join_cluster works fine for creating my server cluster in this same environment

#doesn't work:
class { 'consul':
    join_cluster => hiera('join_addr'),
    config_hash => {
        'datacenter' => 'dc1',
        'data_dir'   => '/opt/consul',
        'log_level'  => 'INFO',
        'node_name'  => $::hostname,
        'bind_addr'  => $::ipaddress_eth1,
        'server'     => false,
    }
}

#works
class { 'consul':
    config_hash => {
        'datacenter' => 'dc1',
        'data_dir'   => '/opt/consul',
        'log_level'  => 'INFO',
        'node_name'  => $::hostname,
        'bind_addr'  => $::ipaddress_eth1,
        'server'     => false,
        'start_join' => [hiera('join_addr')],
    }
}

Is this a bug? is there some reason I should be joining the cluster differently as a client than as a server?

The text was updated successfully, but these errors were encountered:

solarkennedy · 2014-10-06T00:56:35Z

Is it possible the config key changed between different versions of consul?

benschw · 2014-10-06T01:40:47Z

I wouldn't think so -- I am installing consul in the same way. join_cluster and start_join work for the server agent, only start_join for the client agent.

The only difference I can think of, is by adding a service definition in the client, a "join" is triggered after that definition. but I confirmed that a "consul members" cli call shows the node hasn't joined the cluster (not just that the reload isn't triggered.)

EvanKrall · 2014-10-06T01:54:41Z

Can you clarify how it doesn't work? Do you get puppet errors? Does the consul agent fail to start? Does it start up but not join the cluster?

benschw · 2014-10-06T03:28:01Z

it starts but doesn't join the cluster; no errors

EvanKrall · 2014-10-06T04:06:30Z

I'd suspect the issue is in run_service.pp. Is your version of puppet-consul is from before or after #29 got merged? Do you see Exec['consul join ...'] running?

I'm also noticing the onlyif line in run_service.pp uses grep -P; I wonder if your version of grep supports -P (perl regex).

benschw · 2014-10-06T21:46:07Z

I have an up to date copy and support for grep -P. It is running the consul join exec too

solarkennedy · 2014-10-07T03:19:17Z

Can you run puppet with --debug and pastebin it?

Can you confirm if, with join_cluster, the right join command is being executed?

benschw · 2014-10-07T23:07:56Z

http://pastebin.com/ud7hzavT (i missed part of the beginning because my buffer ran out, but I think I got everything pertinent. Let me know if you want me to try again)

The applicable area seems to be around line 386

EvanKrall · 2014-10-08T00:43:56Z

Yeah, it appears that the onlyif command is preventing the exec from running. What does consul info | grep -P "num_peers\s*=\s*0" return for you?

benschw · 2014-10-08T03:17:59Z

empty. full output:

$ consul info 
WARNING: It is highly recommended to set GOMAXPROCS higher than 1

agent:
    check_monitors = 1
    check_ttls = 0
    checks = 1
    services = 1
build:
    prerelease = 
    revision = 
    version = 0.3.1
consul:
    known_servers = 0
    server = false
runtime:
    arch = amd64
    cpu_count = 1
    goroutines = 34
    max_procs = 1
    os = linux
    version = go1.3rc1
serf_lan:
    event_queue = 0
    event_time = 1
    failed = 0
    intent_queue = 0
    left = 0
    member_time = 1
    members = 1
    query_queue = 0
    query_time = 1

solarkennedy · 2014-10-08T03:41:50Z

num_peers isn't in there, so the grep fails, so the exec never runs.

@EvanKrall what if we made the condition more strict the other way?

    exec { 'join consul cluster':
      cwd         => $consul::config_dir,
      path        => [$consul::bin_dir,'/bin','/usr/bin'],
      command     => "consul join ${consul::join_cluster}",
      unless     => 'consul info | grep -P "num_peers\s*=\s*[1-9]"',
      subscribe   => Service['consul'],
    }

? Seems kinda lame and error prone.

Also, on non-servers, I don't see num_peers at all? Maybe we have need to do

unless => 'consul members | grep "${consul::join_cluster}"'

But that isn't super because the output has short hostnames, and maybe an ip is provided, etc.

I don't know what to do.

EvanKrall · 2014-10-08T04:03:56Z

Yeah, looking at the consul info output on some servers versus non-servers, there isn't a great way to tell whether or not you've already joined with the server in question.

I'm gonna go check the HTTP API docs to see whether anything looks promising.

EvanKrall · 2014-10-08T04:12:16Z

Looks like http://localhost:8500/v1/status/peers or http://localhost:8500/v1/catalog/service/consul will provide you with the IPs of the servers you're associated with currently; maybe we can do a hostname lookup on each member of ${consul::join_cluster} (it might be more than one) and then check whether that IP is in the list of peers.

tehranian · 2014-10-27T00:50:57Z

I also hit this issue in playing with join_cluster. I believe we can use known_servers = 0 under the consul heading to determine if the client has connected to any servers or not. Once the client joins the cluster, the value of known_servers seems to be be non-zero.

I locally changed the Exec for "join consul cluster" to the following and it now works for me:

onlyif => 'consul info | grep "known_servers = 0"',

I was planning to send this patch as a pull request. Let me know what you think.

Thanks,
Dan

hopperd · 2014-11-03T16:18:04Z

Wouldn't a more reliable option be to check the output of consul members and check to see if that hostname is listed?

consul members -wan

tehranian · 2014-11-03T23:49:41Z

Hi @Split3 ,

Yea, that's kind of what I suggested in my pull request #42 . It's not quite so straightforward thought. Some edge cases:

The members listed by consul members is the list of node names. Since Consul allows node names to be user-specified, a node's name might not necessarily be it's hostname. Hostname is just the default value.
Since the hostname passed in to join_cluster may in fact be a DNS alias (a CNAME), then this would not work in that situation either.

As I brought up in #42 I think the DNS name needs to be resolved into an IP, and then we have to check consul members for that IP address. This is the most reliable way I can think of.

solarkennedy · 2014-11-04T04:37:16Z

This is all pretty crazy.

Is there any other way do this without embedding so much crazyness in the puppet code?

What if we had a script that could wrap this until this is exposed to the api, and then hide that complexity in the script?

Are people really putting IPs into the node names and then requesting joins based on hostnames?
Maybe we could just validate the inputs and request that people use fqdns?

EvanKrall · 2014-11-04T04:54:42Z

Maybe we just do the simple consul members | grep for the hostname you're joining to. If someone is doing something strange (e.g. joining a hostname that isn't what that box is calling itself), then the worst that happens is we issue no-op consul join commands.

Later on, if someone is doing that, and cares about puppet convergence, then we can have them implement a better solution.

solarkennedy · 2014-11-04T05:02:45Z

Ok. In that case I would prefer it be a bit more discover-able as to why it is not converging. Like:

   if $consul::join_cluster {          if $consul::join_cluster {
     exec { 'join consul cluster':           exec { 'join consul cluster':
       cwd         => $consul::config_dir,
       path        => [$consul::bin_dir,'/bin','/usr/bin'],
       command     => "consul join ${consul::join_cluster}",
       unless      => 'consul members -wan | grep ${consul::join_cluster}',
       logoutput => true,   
       subscribe   => Service['consul'],
     } ~>
     exec { "/bin/echo WARNING: Consul not joined to the WAN cluster. Does ${consul::join_cluster} match the cluster member names?":
       unless      => 'consul members -wan | grep ${consul::join_cluster}',
        refreshonly => true,
    }
   }

EvanKrall · 2014-11-04T05:04:02Z

One nitpick about that suggestion: ${consul::join_cluster} may have multiple entries. We'll need to tailor the grep command to handle that.

hopperd · 2014-11-04T20:26:14Z

Would another option be to push this logic down to the init scripts instead and pass it to the agent/server when consul starts up

consul agent join <hostname>

solarkennedy · 2014-11-05T03:52:52Z

I have no hope that the init scripts would get this right :(
Also, puppet really shouldn't be in the business of doing init scripts if you ask me. I know in real life it must, but I wish it didn't.

hopperd · 2014-11-07T18:33:16Z

I say then forget the option all together and instead have users use start_join in the config_hash instead. Thus pushing the entire responsibility down to consul itself.

tehranian · 2014-11-07T18:45:01Z

I kind of like the idea of pushing the responsibility down to Consul via start_join in config_hash.

benschw · 2014-11-08T05:50:04Z

A nice side effect of not using start join, is your server nodes can all be started by the same puppet code and not have to treat the first member as a special case

EvanKrall · 2014-11-10T02:55:22Z

I just noticed retry_join/-retry-join in the docs.:

-retry-join - Similar to -join, but allows retrying a join if the first attempt fails. This is useful for cases where we know the address will become available eventually.

This should fix the bootstrapping problem, hopefully without us having to do any sketchy onlyif clauses. It even appears to try to reconnect to servers in this list if it loses connectivity to them.

solarkennedy · 2014-11-11T04:24:29Z

Soo.. should I take out the join_cluster all together, let people use retry_join and leave this up as an exercise to the reader?

tehranian · 2014-11-11T16:59:36Z

re: leave join_cluster as an exercise to the reader - Fine by me. Probably worth updating the README docs to provide an example though.

solarkennedy · 2014-12-23T05:42:14Z

I've removed the join_cluster functionality.

tehranian pushed a commit to tehranian/puppet-consul that referenced this issue Oct 28, 2014

Modify "join_cluster" to key off of "known_servers". Fixes voxpupuli#31.

4053608

kostyrev mentioned this issue Dec 21, 2014

join_cluster not working on agents #56

Closed

solarkennedy closed this as completed Dec 23, 2014

solarkennedy mentioned this issue Jan 10, 2015

Allow Consul clients to join cluster #61

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

join_cluster doesn't seem to work in some cases #31

join_cluster doesn't seem to work in some cases #31

benschw commented Oct 5, 2014

solarkennedy commented Oct 6, 2014

benschw commented Oct 6, 2014

EvanKrall commented Oct 6, 2014

benschw commented Oct 6, 2014

EvanKrall commented Oct 6, 2014

benschw commented Oct 6, 2014

solarkennedy commented Oct 7, 2014

benschw commented Oct 7, 2014

EvanKrall commented Oct 8, 2014

benschw commented Oct 8, 2014

solarkennedy commented Oct 8, 2014

EvanKrall commented Oct 8, 2014

EvanKrall commented Oct 8, 2014

tehranian commented Oct 27, 2014

hopperd commented Nov 3, 2014

tehranian commented Nov 3, 2014

solarkennedy commented Nov 4, 2014

EvanKrall commented Nov 4, 2014

solarkennedy commented Nov 4, 2014

EvanKrall commented Nov 4, 2014

hopperd commented Nov 4, 2014

solarkennedy commented Nov 5, 2014

hopperd commented Nov 7, 2014

tehranian commented Nov 7, 2014

benschw commented Nov 8, 2014

EvanKrall commented Nov 10, 2014

solarkennedy commented Nov 11, 2014

tehranian commented Nov 11, 2014

solarkennedy commented Dec 23, 2014

join_cluster doesn't seem to work in some cases #31

join_cluster doesn't seem to work in some cases #31

Comments

benschw commented Oct 5, 2014

solarkennedy commented Oct 6, 2014

benschw commented Oct 6, 2014

EvanKrall commented Oct 6, 2014

benschw commented Oct 6, 2014

EvanKrall commented Oct 6, 2014

benschw commented Oct 6, 2014

solarkennedy commented Oct 7, 2014

benschw commented Oct 7, 2014

EvanKrall commented Oct 8, 2014

benschw commented Oct 8, 2014

solarkennedy commented Oct 8, 2014

EvanKrall commented Oct 8, 2014

EvanKrall commented Oct 8, 2014

tehranian commented Oct 27, 2014

hopperd commented Nov 3, 2014

tehranian commented Nov 3, 2014

solarkennedy commented Nov 4, 2014

EvanKrall commented Nov 4, 2014

solarkennedy commented Nov 4, 2014

EvanKrall commented Nov 4, 2014

hopperd commented Nov 4, 2014

solarkennedy commented Nov 5, 2014

hopperd commented Nov 7, 2014

tehranian commented Nov 7, 2014

benschw commented Nov 8, 2014

EvanKrall commented Nov 10, 2014

solarkennedy commented Nov 11, 2014

tehranian commented Nov 11, 2014

solarkennedy commented Dec 23, 2014