Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

join_cluster doesn't seem to work in some cases #31

Closed
benschw opened this issue Oct 5, 2014 · 29 comments
Closed

join_cluster doesn't seem to work in some cases #31

benschw opened this issue Oct 5, 2014 · 29 comments

Comments

@benschw
Copy link
Contributor

benschw commented Oct 5, 2014

In the following example to set up a client agent, join_cluster doesn't work (but including start_join does.)

note: using join_cluster works fine for creating my server cluster in this same environment

#doesn't work:
class { 'consul':
    join_cluster => hiera('join_addr'),
    config_hash => {
        'datacenter' => 'dc1',
        'data_dir'   => '/opt/consul',
        'log_level'  => 'INFO',
        'node_name'  => $::hostname,
        'bind_addr'  => $::ipaddress_eth1,
        'server'     => false,
    }
}

#works
class { 'consul':
    config_hash => {
        'datacenter' => 'dc1',
        'data_dir'   => '/opt/consul',
        'log_level'  => 'INFO',
        'node_name'  => $::hostname,
        'bind_addr'  => $::ipaddress_eth1,
        'server'     => false,
        'start_join' => [hiera('join_addr')],
    }
}

Is this a bug? is there some reason I should be joining the cluster differently as a client than as a server?

@solarkennedy
Copy link
Contributor

Is it possible the config key changed between different versions of consul?

@benschw
Copy link
Contributor Author

benschw commented Oct 6, 2014

I wouldn't think so -- I am installing consul in the same way. join_cluster and start_join work for the server agent, only start_join for the client agent.

The only difference I can think of, is by adding a service definition in the client, a "join" is triggered after that definition. but I confirmed that a "consul members" cli call shows the node hasn't joined the cluster (not just that the reload isn't triggered.)

@EvanKrall
Copy link
Contributor

Can you clarify how it doesn't work? Do you get puppet errors? Does the consul agent fail to start? Does it start up but not join the cluster?

@benschw
Copy link
Contributor Author

benschw commented Oct 6, 2014

it starts but doesn't join the cluster; no errors

@EvanKrall
Copy link
Contributor

I'd suspect the issue is in run_service.pp. Is your version of puppet-consul is from before or after #29 got merged? Do you see Exec['consul join ...'] running?

I'm also noticing the onlyif line in run_service.pp uses grep -P; I wonder if your version of grep supports -P (perl regex).

@benschw
Copy link
Contributor Author

benschw commented Oct 6, 2014

I have an up to date copy and support for grep -P. It is running the consul join exec too

@solarkennedy
Copy link
Contributor

Can you run puppet with --debug and pastebin it?

Can you confirm if, with join_cluster, the right join command is being executed?

@benschw
Copy link
Contributor Author

benschw commented Oct 7, 2014

http://pastebin.com/ud7hzavT (i missed part of the beginning because my buffer ran out, but I think I got everything pertinent. Let me know if you want me to try again)

The applicable area seems to be around line 386

@EvanKrall
Copy link
Contributor

Yeah, it appears that the onlyif command is preventing the exec from running. What does consul info | grep -P "num_peers\s*=\s*0" return for you?

@benschw
Copy link
Contributor Author

benschw commented Oct 8, 2014

empty. full output:

$ consul info 
WARNING: It is highly recommended to set GOMAXPROCS higher than 1

agent:
    check_monitors = 1
    check_ttls = 0
    checks = 1
    services = 1
build:
    prerelease = 
    revision = 
    version = 0.3.1
consul:
    known_servers = 0
    server = false
runtime:
    arch = amd64
    cpu_count = 1
    goroutines = 34
    max_procs = 1
    os = linux
    version = go1.3rc1
serf_lan:
    event_queue = 0
    event_time = 1
    failed = 0
    intent_queue = 0
    left = 0
    member_time = 1
    members = 1
    query_queue = 0
    query_time = 1

@solarkennedy
Copy link
Contributor

num_peers isn't in there, so the grep fails, so the exec never runs.

@EvanKrall what if we made the condition more strict the other way?

    exec { 'join consul cluster':
      cwd         => $consul::config_dir,
      path        => [$consul::bin_dir,'/bin','/usr/bin'],
      command     => "consul join ${consul::join_cluster}",
      unless     => 'consul info | grep -P "num_peers\s*=\s*[1-9]"',
      subscribe   => Service['consul'],
    }

? Seems kinda lame and error prone.

Also, on non-servers, I don't see num_peers at all? Maybe we have need to do

unless => 'consul members | grep "${consul::join_cluster}"'

But that isn't super because the output has short hostnames, and maybe an ip is provided, etc.

I don't know what to do.

@EvanKrall
Copy link
Contributor

Yeah, looking at the consul info output on some servers versus non-servers, there isn't a great way to tell whether or not you've already joined with the server in question.

I'm gonna go check the HTTP API docs to see whether anything looks promising.

@EvanKrall
Copy link
Contributor

Looks like http://localhost:8500/v1/status/peers or http://localhost:8500/v1/catalog/service/consul will provide you with the IPs of the servers you're associated with currently; maybe we can do a hostname lookup on each member of ${consul::join_cluster} (it might be more than one) and then check whether that IP is in the list of peers.

@tehranian
Copy link

I also hit this issue in playing with join_cluster. I believe we can use known_servers = 0 under the consul heading to determine if the client has connected to any servers or not. Once the client joins the cluster, the value of known_servers seems to be be non-zero.

I locally changed the Exec for "join consul cluster" to the following and it now works for me:

onlyif => 'consul info | grep "known_servers = 0"',

I was planning to send this patch as a pull request. Let me know what you think.

Thanks,
Dan

tehranian pushed a commit to tehranian/puppet-consul that referenced this issue Oct 28, 2014
@hopperd
Copy link
Contributor

hopperd commented Nov 3, 2014

Wouldn't a more reliable option be to check the output of consul members and check to see if that hostname is listed?

consul members -wan

@tehranian
Copy link

Hi @Split3 ,

Yea, that's kind of what I suggested in my pull request #42 . It's not quite so straightforward thought. Some edge cases:

  • The members listed by consul members is the list of node names. Since Consul allows node names to be user-specified, a node's name might not necessarily be it's hostname. Hostname is just the default value.
  • Since the hostname passed in to join_cluster may in fact be a DNS alias (a CNAME), then this would not work in that situation either.

As I brought up in #42 I think the DNS name needs to be resolved into an IP, and then we have to check consul members for that IP address. This is the most reliable way I can think of.

@solarkennedy
Copy link
Contributor

This is all pretty crazy.

Is there any other way do this without embedding so much crazyness in the puppet code?

What if we had a script that could wrap this until this is exposed to the api, and then hide that complexity in the script?

Are people really putting IPs into the node names and then requesting joins based on hostnames?
Maybe we could just validate the inputs and request that people use fqdns?

@EvanKrall
Copy link
Contributor

Maybe we just do the simple consul members | grep for the hostname you're joining to. If someone is doing something strange (e.g. joining a hostname that isn't what that box is calling itself), then the worst that happens is we issue no-op consul join commands.

Later on, if someone is doing that, and cares about puppet convergence, then we can have them implement a better solution.

@solarkennedy
Copy link
Contributor

Ok. In that case I would prefer it be a bit more discover-able as to why it is not converging. Like:

   if $consul::join_cluster {          if $consul::join_cluster {
     exec { 'join consul cluster':           exec { 'join consul cluster':
       cwd         => $consul::config_dir,
       path        => [$consul::bin_dir,'/bin','/usr/bin'],
       command     => "consul join ${consul::join_cluster}",
       unless      => 'consul members -wan | grep ${consul::join_cluster}',
       logoutput => true,   
       subscribe   => Service['consul'],
     } ~>
     exec { "/bin/echo WARNING: Consul not joined to the WAN cluster. Does ${consul::join_cluster} match the cluster member names?":
       unless      => 'consul members -wan | grep ${consul::join_cluster}',
        refreshonly => true,
    }
   }

@EvanKrall
Copy link
Contributor

One nitpick about that suggestion: ${consul::join_cluster} may have multiple entries. We'll need to tailor the grep command to handle that.

@hopperd
Copy link
Contributor

hopperd commented Nov 4, 2014

Would another option be to push this logic down to the init scripts instead and pass it to the agent/server when consul starts up

consul agent join <hostname>

@solarkennedy
Copy link
Contributor

I have no hope that the init scripts would get this right :(
Also, puppet really shouldn't be in the business of doing init scripts if you ask me. I know in real life it must, but I wish it didn't.

@hopperd
Copy link
Contributor

hopperd commented Nov 7, 2014

I say then forget the option all together and instead have users use start_join in the config_hash instead. Thus pushing the entire responsibility down to consul itself.

@tehranian
Copy link

I kind of like the idea of pushing the responsibility down to Consul via start_join in config_hash.

@benschw
Copy link
Contributor Author

benschw commented Nov 8, 2014

A nice side effect of not using start join, is your server nodes can all be started by the same puppet code and not have to treat the first member as a special case

@EvanKrall
Copy link
Contributor

I just noticed retry_join/-retry-join in the docs.:

-retry-join - Similar to -join, but allows retrying a join if the first attempt fails. This is useful for cases where we know the address will become available eventually.

This should fix the bootstrapping problem, hopefully without us having to do any sketchy onlyif clauses. It even appears to try to reconnect to servers in this list if it loses connectivity to them.

@solarkennedy
Copy link
Contributor

Soo.. should I take out the join_cluster all together, let people use retry_join and leave this up as an exercise to the reader?

@tehranian
Copy link

re: leave join_cluster as an exercise to the reader - Fine by me. Probably worth updating the README docs to provide an example though.

@solarkennedy
Copy link
Contributor

I've removed the join_cluster functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants