-
-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
join_cluster doesn't seem to work in some cases #31
Comments
Is it possible the config key changed between different versions of consul? |
I wouldn't think so -- I am installing consul in the same way. join_cluster and start_join work for the server agent, only start_join for the client agent. The only difference I can think of, is by adding a service definition in the client, a "join" is triggered after that definition. but I confirmed that a "consul members" cli call shows the node hasn't joined the cluster (not just that the reload isn't triggered.) |
Can you clarify how it doesn't work? Do you get puppet errors? Does the consul agent fail to start? Does it start up but not join the cluster? |
it starts but doesn't join the cluster; no errors |
I'd suspect the issue is in run_service.pp. Is your version of puppet-consul is from before or after #29 got merged? Do you see Exec['consul join ...'] running? I'm also noticing the |
I have an up to date copy and support for |
Can you run puppet with --debug and pastebin it? Can you confirm if, with join_cluster, the right join command is being executed? |
http://pastebin.com/ud7hzavT (i missed part of the beginning because my buffer ran out, but I think I got everything pertinent. Let me know if you want me to try again) The applicable area seems to be around line 386 |
Yeah, it appears that the onlyif command is preventing the exec from running. What does |
empty. full output:
|
num_peers isn't in there, so the grep fails, so the exec never runs. @EvanKrall what if we made the condition more strict the other way? exec { 'join consul cluster':
cwd => $consul::config_dir,
path => [$consul::bin_dir,'/bin','/usr/bin'],
command => "consul join ${consul::join_cluster}",
unless => 'consul info | grep -P "num_peers\s*=\s*[1-9]"',
subscribe => Service['consul'],
} ? Seems kinda lame and error prone. Also, on non-servers, I don't see num_peers at all? Maybe we have need to do unless => 'consul members | grep "${consul::join_cluster}"' But that isn't super because the output has short hostnames, and maybe an ip is provided, etc. I don't know what to do. |
Yeah, looking at the I'm gonna go check the HTTP API docs to see whether anything looks promising. |
Looks like http://localhost:8500/v1/status/peers or http://localhost:8500/v1/catalog/service/consul will provide you with the IPs of the servers you're associated with currently; maybe we can do a hostname lookup on each member of |
I also hit this issue in playing with I locally changed the
I was planning to send this patch as a pull request. Let me know what you think. Thanks, |
Wouldn't a more reliable option be to check the output of consul members and check to see if that hostname is listed?
|
Hi @Split3 , Yea, that's kind of what I suggested in my pull request #42 . It's not quite so straightforward thought. Some edge cases:
As I brought up in #42 I think the DNS name needs to be resolved into an IP, and then we have to check |
This is all pretty crazy. Is there any other way do this without embedding so much crazyness in the puppet code? What if we had a script that could wrap this until this is exposed to the api, and then hide that complexity in the script? Are people really putting IPs into the node names and then requesting joins based on hostnames? |
Maybe we just do the simple Later on, if someone is doing that, and cares about puppet convergence, then we can have them implement a better solution. |
Ok. In that case I would prefer it be a bit more discover-able as to why it is not converging. Like: if $consul::join_cluster { if $consul::join_cluster {
exec { 'join consul cluster': exec { 'join consul cluster':
cwd => $consul::config_dir,
path => [$consul::bin_dir,'/bin','/usr/bin'],
command => "consul join ${consul::join_cluster}",
unless => 'consul members -wan | grep ${consul::join_cluster}',
logoutput => true,
subscribe => Service['consul'],
} ~>
exec { "/bin/echo WARNING: Consul not joined to the WAN cluster. Does ${consul::join_cluster} match the cluster member names?":
unless => 'consul members -wan | grep ${consul::join_cluster}',
refreshonly => true,
}
} |
One nitpick about that suggestion: |
Would another option be to push this logic down to the init scripts instead and pass it to the agent/server when consul starts up
|
I have no hope that the init scripts would get this right :( |
I say then forget the option all together and instead have users use start_join in the config_hash instead. Thus pushing the entire responsibility down to consul itself. |
I kind of like the idea of pushing the responsibility down to Consul via |
A nice side effect of not using start join, is your server nodes can all be started by the same puppet code and not have to treat the first member as a special case |
I just noticed
This should fix the bootstrapping problem, hopefully without us having to do any sketchy |
Soo.. should I take out the join_cluster all together, let people use retry_join and leave this up as an exercise to the reader? |
re: leave |
I've removed the join_cluster functionality. |
In the following example to set up a client agent,
join_cluster
doesn't work (but includingstart_join
does.)note: using
join_cluster
works fine for creating my server cluster in this same environmentIs this a bug? is there some reason I should be joining the cluster differently as a client than as a server?
The text was updated successfully, but these errors were encountered: