-
Notifications
You must be signed in to change notification settings - Fork 782
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
issue with haproxy reloads #442
Comments
Hi @consultantRR It looks like this might be a duplicate of #428. Are you running in a container by chance? |
@sethvargo No I am not. I originally saw #428 and was hopeful for an answer, but since it seemed to be container specific, I decided to open a separate issue. For clarity: I am running consul-template in Amazon Linux and it is being invoked (as root) as follows:
|
@consultantRR Can you change that to print to a logfile and also run in debug mode and paste the output here after an haproxy restart please? |
Okay @sethvargo, I have a debug log covering the time period in which 8 haproxy restarts took place. I had hoped to simplify this log example and show only 1 haproxy restart, but I was not able to reproduce the issue with only a single restart. Due to length, I will paste log lines covering 1 restart at the end of this post. (Looks like business as usual to me - I see no issues.) In this example, each restart took place about 6-7 seconds after the previous one. Each time, I invoked this restart by taking a node referenced in the template in or out of Consul maintenance mode. Prior to this example log, one haproxy process was running. After this example log (8 restarts), three haproxy processes were left running permanently. To be clear, this was the invocation of consul-template for this test:
Here are the first few lines of the log, showing the config:
Maybe a red herring, but possibly of interest: During this 8-restart test, I had a separate haproxy node running with the exact same config (and template) as the node I have logged here. Only difference was that consul-template on that node was not logging. Invocation for consul-template on that node was:
On this node, after the same 8-restart test, 8 haproxy processes were left running (as opposed to 3 haproxy processes on the logged node). In further tests, extra haproxy processes do seem to stack up much more quickly on this node, then the one that debug logging is now enabled on. I may try to craft a methodology for a simpler, more isolated and controlled test which still shows this behavior. If so, I will post results here. For now, here is part of the debug log, covering 1 restart:
|
@sethvargo I just reverted this node to a former consul-template version and performed this same test (8 restarts) with consul-template By comparison, at the end of this new test the other haproxy node (running |
Hi @consultantRR Can you share your actual Consul Template template please? |
The full diff between 0.10.0 and 0.11.0 is here: v0.10.0...v0.11.0. I'm going to try to get a reproduction together, but I have been unsuccessful thus far |
Here it is, with only two minor redactions:
|
@consultantRR just to be clear - you aren't using the vault integration at all, right? |
Not at this time. The two |
Hi @consultantRR I did some digging today, and I was able to reproduce this issue exactly once, and then it stopped reproducing. Are you able to reproduce this with something that isn't haproxy? What's interesting to me is that haproxy orphans itself (it's still running even after the command returns and consul template quits), but I wonder if there's a race condition there somehow. |
Hi @sethvargo - I haven't reproduced this with something other than haproxy, but also, I can't say that I have tried. ;) Just to think out loud about what may be happening, here is the reload command again (it should be noted that the
So anyway, here is my understanding of how that command works, from TFM:
What's interesting is that firing this command manually multiple times from a shell works exactly as expected every time. And what's even more interesting to me is that this behavior doesn't seem to show up with consul-template |
It's anecdotal, but I did observe with two versions of |
There is almost definitely a discrepancy between |
@consultantRR okay - I spent a lot of time on this today, and I have a way to reproduce it. I am able to reproduce this 100% of the time when
I was able to reproduce this under CT master, 0.11.0, 0.10.0, and 0.9.0. Because of this, I think the version issue is actually a red herring. I think the reason it "works" on CT v0.10.0 is that you already having a running haproxy process on those nodes, and you're trying to use CT v0.11.0 on a node that doesn't have an haproxy instance already running. I could be totally wrong, but that's my only way to reproduce this issue at the moment, because if haproxy isn't running, the PID is invalid, and haproxy does something really strange and hangs onto the subprocess its spawns, but it doesn't hang the parent process, so CT thinks it has exited. Now, when the PID exists, they are both very happy: Here is CT v0.11.0
and here is CT v0.10.0
Obviously the PIDs are different because these are different VMs, but it's the same exact script and configuration on all the machines. Each time I change the key in Consul, the PID changes (meaning CT successfully reloads the process). I'm not really sure where to go from here. I'm out of debugging possibilities, and I'm fairly certain this is a problem with the way haproxy issues its reloads. Please let me know what you think. |
Here is the |
Random sanity check question - is it possible there are multiple consul-template instances running on any of these problem machines? |
First of all, thank you @sethvargo, immensely, for the attention to this and the time spent thus far. It is much, much appreciated. And is a hallmark of HashiCorp, and your work in particular, Seth. I am not sure what to think about this either. I can say that my tests today have included: Starting both CT 10 and 11 with haproxy already running. In all cases CT 11 has exhibited this behavior (with the template I posted above - haven't tried other templates) and in no cases has CT 10 exhibited this behavior. In fact, when switching back from CT 10 to CT 11, I see this behavior again on the same node. And vice versa (switching from CT 11 to CT 10 eliminates this behavior in all cases on the same node). One thing I have noticed: In all cases (again this is always under CT 11 and never under CT 10) in which the node gets into a state in which >1 haproxy processes are erroneously running, exactly 2 invocations of Meaning, the "remedy" for
And btw, thanks @slackpad for the sanity check - (sanity is always good!) - no, in all cases referenced above, I can confirm only a single instance of CT running. |
Hi @consultantRR Could you try setting |
To be sure, it didn't seem to like
But it was OK with |
My 2 cents: I have seen orphan haproxy instances frequently (and I'm not using consul-template with haproxy yet) |
In other words, it would be interesting to see whether the 'orphan' haproxy instances are really orphan or they are just waiting for some side to actually close the connection. |
@zonzamas I'm not super familiar with haproxy - is there an easy way for @consultantRR to do that? |
I use standard linux tools Being PID the actual PID from an oprhan netstat -apn | grep PID to get established connections strace -fp PID to see what a proccess it is actually doing from netstat -apn I get something like:
|
I will do some controlled testing on this @zonzamas - but offhand, I don't believe this is the issue as: (1) the processes stay running for hours and only increase in number, they don't decrease Actually, I should clarify: when I say "pre-v11 CT", I specifically mean v10. I have not tested versions prior to v10 in the context of this issue. |
I'm still at a loss for what might be causing this. I'm going to release CT 0.11.1 without this fix because I do not want to delay further. Sorry! |
Thanks @consultantRR and @sethvargo for the work on this. |
Until there's an official release from HashiCorp, I've a build of 0.12.2 with Go 1.6 I'd have used an older build, but I'd already started relying on 0.11 syntax and features 😅 |
Hello, same problem is resolved with consul template 0.12.2 and go 1.6rc1 for me. Thank you @duggan for your dockerfile! |
Use a build of consul-template 0.12.2 compiled with Go 1.6rc1 while waiting for official builds to get a fix for hashicorp/consul-template#442 Build is from https://github.com/duggan/build-consul-template
I don't have much to add to the above. But we upgrade from 0.10.x to 0.12.2 today and we ended up with dozens of haproxies. |
We tried 0.12.2 on CoreOS 835.12.0 using haproxy 1.6.3 without success. There are no errors in consul-template logs or haproxy logs. After a few hours our network service(Midokura) started to get overloaded with a large number of SYN requests. We rolled back to 0.10.2 and the network service(Midokura) recovered.
|
@brycechesternewman unless you used @duggan 's custom build against unreleased Go 1.6, the issue is still there. |
go 1.6 was released today: https://golang.org/dl/#go1.6 @sethvargo can a new release of consul-template be cut please? |
0.13.0 is released and compiled with go 1.6! |
Our long national nightmare is over. :) |
thank you @sethvargo ! |
I'm running the 0.13.0 release pulled from the following URL and still seeing occurrences of multiple HAProxy processes running after a service reload. Is anybody else still seeing this issue? |
Yes, just today we found five extra haproxy's with 0.13.0. Clearly this On Mon, Mar 14, 2016 at 7:51 PM, Chris Dickson notifications@github.com
Barry Kaplan |
@sethvargo Should I open a new issue to investigate the persistence of this bug, or can we reopen this one? |
FWIW, 0.13.0 fixes the problem for me. haproxy reload is supposed to leave behind the previous previous process with the |
Victor, it mostly fixed the problem for us as well. At least until 0.13 was On Wed, Mar 16, 2016 at 1:25 AM, Victor Trac notifications@github.com
Barry Kaplan |
We've added |
Hi everyone, Since this issue seems to be getting a lot of noise, I'm going to clarify that this has been fixed in Consul Template 0.13+. Please open any other concerns as separate issues. haproxy itself will leave behind previous processes for a short period of time to finish existing requests, so unless you see a multitude of haproxy processes running each time CT restarts the process, this issue does not apply to you 😄 |
Using consul-template (
v0.11.0
) with haproxy, I am seeing an issue in which multiple haproxy processes stack up over time as consul-template rewrites the haproxy.cfg file and fires off the reload command.In this scenario, consul-template is running as root.
Here is the config:
Manually running the reload command (
haproxy -f /etc/haproxy/haproxy.cfg -q -p /var/run/haproxy.pid -sf $( cat /var/run/haproxy.pid
) works fine and does not stack up multiple haproxy processes.But, for example, after a few consul-template rewrites of haproxy.cfg, here is what I see:
Any thoughts on what might be happening here, and why the behavior differs from consul-template running this reload command, and my running this reload command manually (outside the purview of consul-template) from a shell?
The text was updated successfully, but these errors were encountered: