Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue with haproxy reloads #442

Closed
ProbablyRusty opened this issue Oct 21, 2015 · 94 comments
Closed

issue with haproxy reloads #442

ProbablyRusty opened this issue Oct 21, 2015 · 94 comments

Comments

@ProbablyRusty
Copy link

Using consul-template (v0.11.0) with haproxy, I am seeing an issue in which multiple haproxy processes stack up over time as consul-template rewrites the haproxy.cfg file and fires off the reload command.

In this scenario, consul-template is running as root.

Here is the config:

consul = "127.0.0.1:8500"

template {
  source = "/etc/haproxy/haproxy.template"
  destination = "/etc/haproxy/haproxy.cfg"
  command = "haproxy -f /etc/haproxy/haproxy.cfg -q -p /var/run/haproxy.pid -sf $( cat /var/run/haproxy.pid )"
}

Manually running the reload command (haproxy -f /etc/haproxy/haproxy.cfg -q -p /var/run/haproxy.pid -sf $( cat /var/run/haproxy.pid) works fine and does not stack up multiple haproxy processes.

But, for example, after a few consul-template rewrites of haproxy.cfg, here is what I see:

10258 ?        Ss     0:00 haproxy -f /etc/haproxy/haproxy.cfg -q -p /var/run/haproxy.pid -sf 10029
10262 ?        Ss     0:00 haproxy -f /etc/haproxy/haproxy.cfg -q -p /var/run/haproxy.pid -sf 10258
10270 ?        Ss     0:00 haproxy -f /etc/haproxy/haproxy.cfg -q -p /var/run/haproxy.pid -sf 10266
10369 ?        Ss     0:00 haproxy -f /etc/haproxy/haproxy.cfg -q -p /var/run/haproxy.pid -sf 10365
10427 ?        Ss     0:00 haproxy -f /etc/haproxy/haproxy.cfg -q -p /var/run/haproxy.pid -sf 10423
10483 ?        Ss     0:00 haproxy -f /etc/haproxy/haproxy.cfg -q -p /var/run/haproxy.pid -sf 10479

Any thoughts on what might be happening here, and why the behavior differs from consul-template running this reload command, and my running this reload command manually (outside the purview of consul-template) from a shell?

@sethvargo
Copy link
Contributor

Hi @consultantRR

It looks like this might be a duplicate of #428. Are you running in a container by chance?

@ProbablyRusty
Copy link
Author

@sethvargo No I am not. I originally saw #428 and was hopeful for an answer, but since it seemed to be container specific, I decided to open a separate issue.

For clarity:

I am running consul-template in Amazon Linux and it is being invoked (as root) as follows:

nohup consul-template -config /etc/haproxy/consul-template.hcl >/dev/null 2>&1 &

@sethvargo
Copy link
Contributor

@consultantRR

Can you change that to print to a logfile and also run in debug mode and paste the output here after an haproxy restart please?

@ProbablyRusty
Copy link
Author

Okay @sethvargo, I have a debug log covering the time period in which 8 haproxy restarts took place. I had hoped to simplify this log example and show only 1 haproxy restart, but I was not able to reproduce the issue with only a single restart. Due to length, I will paste log lines covering 1 restart at the end of this post. (Looks like business as usual to me - I see no issues.)

In this example, each restart took place about 6-7 seconds after the previous one. Each time, I invoked this restart by taking a node referenced in the template in or out of Consul maintenance mode.

Prior to this example log, one haproxy process was running. After this example log (8 restarts), three haproxy processes were left running permanently.

To be clear, this was the invocation of consul-template for this test:

nohup consul-template -log-level debug -config /etc/haproxy/consul-template.hcl >/var/log/consul-template.log 2>&1 &

Here are the first few lines of the log, showing the config:

nohup: ignoring input
2015/10/21 15:59:14 [DEBUG] (config) loading configs from "/etc/haproxy/consul-template.hcl"
2015/10/21 15:59:14 [DEBUG] (logging) enabling syslog on LOCAL0
2015/10/21 15:59:14 [INFO] consul-template v0.11.0
2015/10/21 15:59:14 [INFO] (runner) creating new runner (dry: false, once: false)
2015/10/21 15:59:14 [DEBUG] (runner) final config (tokens suppressed):

{
  "path": "/etc/haproxy/consul-template.hcl",
  "consul": "127.0.0.1:8500",
  "auth": {
    "enabled": false,
    "username": "",
    "password": ""
  },
  "vault": {
    "renew": true,
    "ssl": {
      "enabled": true,
      "verify": true
    }
  },
  "ssl": {
    "enabled": false,
    "verify": true
  },
  "syslog": {
    "enabled": true,
    "facility": "LOCAL0"
  },
  "max_stale": 1000000000,
  "templates": [
    {
      "source": "/etc/haproxy/haproxy.template",
      "destination": "/etc/haproxy/haproxy.cfg",
      "command": "haproxy -f /etc/haproxy/haproxy.cfg -q -p /var/run/haproxy.pid -sf $( cat /var/run/haproxy.pid )",
      "perms": 420
    }
  ],
  "retry": 5000000000,
  "wait": {
    "min": 0,
    "max": 0
  },
  "pid_file": "",
  "log_level": "debug"
}

2015/10/21 15:59:14 [INFO] (runner) creating consul/api client
2015/10/21 15:59:14 [DEBUG] (runner) setting consul address to 127.0.0.1:8500

Maybe a red herring, but possibly of interest:

During this 8-restart test, I had a separate haproxy node running with the exact same config (and template) as the node I have logged here. Only difference was that consul-template on that node was not logging. Invocation for consul-template on that node was:

nohup consul-template -config /etc/haproxy/consul-template.hcl >/dev/null 2>&1 &

On this node, after the same 8-restart test, 8 haproxy processes were left running (as opposed to 3 haproxy processes on the logged node). In further tests, extra haproxy processes do seem to stack up much more quickly on this node, then the one that debug logging is now enabled on.

I may try to craft a methodology for a simpler, more isolated and controlled test which still shows this behavior. If so, I will post results here.

For now, here is part of the debug log, covering 1 restart:

2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 1 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 1 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 2 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 2 services after health check status filtering
2015/10/21 16:17:55 [INFO] (view) "service(prod.*redacted*)" received data from consul
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" starting fetch
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 2 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 2 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 2 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 2 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 2 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 2 services after health check status filtering
2015/10/21 16:17:55 [INFO] (view) "service(prod.*redacted*)" received data from consul
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" starting fetch
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 2 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 2 services after health check status filtering
2015/10/21 16:17:55 [INFO] (view) "service(prod.*redacted*)" received data from consul
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" starting fetch
2015/10/21 16:17:55 [INFO] (view) "service(prod.*redacted*)" received data from consul
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" starting fetch
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 2 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] (runner) receiving dependency "service(prod.*redacted*)"
2015/10/21 16:17:55 [DEBUG] (runner) receiving dependency "service(prod.*redacted*)"
2015/10/21 16:17:55 [DEBUG] (runner) receiving dependency "service(prod.*redacted*)"
2015/10/21 16:17:55 [DEBUG] (runner) receiving dependency "service(prod.*redacted*)"
2015/10/21 16:17:55 [INFO] (runner) running
2015/10/21 16:17:55 [DEBUG] (runner) checking template /etc/haproxy/haproxy.template
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 2 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 2 services after health check status filtering
2015/10/21 16:17:55 [INFO] (view) "service(prod.*redacted*)" received data from consul
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" starting fetch
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 2 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 2 services after health check status filtering
2015/10/21 16:17:55 [INFO] (view) "service(prod.*redacted*)" received data from consul
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" starting fetch
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 2 services after health check status filtering
2015/10/21 16:17:55 [INFO] (view) "service(prod.*redacted*)" received data from consul
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" starting fetch
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 2 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 2 services after health check status filtering
2015/10/21 16:17:55 [INFO] (view) "service(prod.*redacted*)" received data from consul
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" starting fetch
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 2 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 2 services after health check status filtering
2015/10/21 16:17:55 [INFO] (view) "service(prod.*redacted*)" received data from consul
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 2 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 2 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 2 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 2 services after health check status filtering
2015/10/21 16:17:55 [INFO] (view) "service(prod.*redacted*)" received data from consul
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" starting fetch
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [INFO] (view) "service(prod.*redacted*)" received data from consul
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" starting fetch
2015/10/21 16:17:55 [DEBUG] (runner) checking ctemplate &{Source:/etc/haproxy/haproxy.template Destination:/etc/haproxy/haproxy.cfg Command:haproxy -f /etc/haproxy/haproxy.cfg -q -p /var/run/haproxy.pid -sf $( cat /var/run/haproxy.pid ) Perms:-rw-r--r--}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 2 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 2 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" starting fetch
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [INFO] (view) "service(prod.*redacted*)" received data from consul
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" starting fetch
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 0 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 0 services after health check status filtering
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" no new data (contents were the same)
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") Consul returned 2 services
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") 2 services after health check status filtering
2015/10/21 16:17:55 [INFO] (view) "service(prod.*redacted*)" received data from consul
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [INFO] (view) "service(prod.*redacted*)" received data from consul
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" starting fetch
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] (view) "service(prod.*redacted*)" starting fetch
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] (runner) wouldRender: true, didRender: true
2015/10/21 16:17:55 [DEBUG] (runner) appending command: haproxy -f /etc/haproxy/haproxy.cfg -q -p /var/run/haproxy.pid -sf $( cat /var/run/haproxy.pid )
2015/10/21 16:17:55 [INFO] (runner) diffing and updating dependencies
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "key(*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "services" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] ("service(prod.*redacted*)") querying Consul with &{Datacenter: AllowStale:true RequireConsistent:false WaitIndex:1447410 WaitTime:1m0s Token:}
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "key(*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) "service(prod.*redacted*)" is still needed
2015/10/21 16:17:55 [DEBUG] (runner) running command: `haproxy -f /etc/haproxy/haproxy.cfg -q -p /var/run/haproxy.pid -sf $( cat /var/run/haproxy.pid )`
2015/10/21 16:17:55 [INFO] (runner) watching 35 dependencies

@ProbablyRusty
Copy link
Author

@sethvargo I just reverted this node to a former consul-template version and performed this same test (8 restarts) with consul-template v0.10.0 and at the end of the test, one 1 haproxy process was running.

By comparison, at the end of this new test the other haproxy node (running v0.11.0 with no logging) had 8 haproxy processes running.

@sethvargo
Copy link
Contributor

Hi @consultantRR

Can you share your actual Consul Template template please?

@sethvargo
Copy link
Contributor

The full diff between 0.10.0 and 0.11.0 is here: v0.10.0...v0.11.0. I'm going to try to get a reproduction together, but I have been unsuccessful thus far

@ProbablyRusty
Copy link
Author

Here it is, with only two minor redactions:

global
    daemon
    maxconn 4096
    log 127.0.0.1   local0
    daemon

defaults
    log global
    mode http
    timeout connect 5000ms
    timeout client 60000ms
    timeout server 60000ms
    option http-server-close

listen haproxyadmin
    bind *:8999
    stats enable
    stats auth haproxyadmin:{{key "*redacted*"}}

listen http_health_check 0.0.0.0:8080
    mode health
    option httpchk

frontend http_proxy
    bind *:8888
    acl non_ssl hdr(X-Forwarded-Proto) -i http
    redirect prefix {{key "*redacted*"}} code 301 if non_ssl
{{range services}}{{$services := . }}{{range .Tags}}{{if eq . "microservice"}}
    acl {{$services.Name}} path_reg -i ^\/{{$services.Name}}(\/.*|\?.*)?${{end}}{{end}}{{end}}
{{range services}}{{$services := . }}{{range .Tags}}{{if eq . "microservice"}}
    use_backend {{$services.Name}} if {{$services.Name}}{{end}}{{end}}{{end}}
{{range services}}{{$services := . }}{{range .Tags}}{{if eq . "microservice"}}
backend {{$services.Name}}{{$this_service := $services.Name | regexReplaceAll "(.+)" "prod.$1"}}
    balance roundrobin{{range service $this_service}}
    server {{.Name}} {{.Address}}:{{.Port}} maxconn 8192{{end}}{{end}}{{end}}
    {{end}}

@sethvargo
Copy link
Contributor

@consultantRR just to be clear - you aren't using the vault integration at all, right?

@ProbablyRusty
Copy link
Author

Not at this time.

The two key references in the template above are Consul KV keys.

@sethvargo
Copy link
Contributor

Hi @consultantRR

I did some digging today, and I was able to reproduce this issue exactly once, and then it stopped reproducing.

Are you able to reproduce this with something that isn't haproxy? What's interesting to me is that haproxy orphans itself (it's still running even after the command returns and consul template quits), but I wonder if there's a race condition there somehow.

@ProbablyRusty
Copy link
Author

Hi @sethvargo - I haven't reproduced this with something other than haproxy, but also, I can't say that I have tried. ;)

Just to think out loud about what may be happening, here is the reload command again (it should be noted that the reload command in the haproxy init.d script in Amazon Linux is basically the exact same thing as this - in fact this is what I used for months, and only switched it to the explicit command below when beginning to troubleshoot this newfound problem):

haproxy -f /etc/haproxy/haproxy.cfg -q -p /var/run/haproxy.pid -sf $( cat /var/run/haproxy.pid )

So anyway, here is my understanding of how that command works, from TFM:

The '-st' and '-sf' command line options are used to inform previously running
processes that a configuration is being reloaded. They will receive the SIGTTOU
signal to ask them to temporarily stop listening to the ports so that the new
process can grab them. If anything wrong happens, the new process will send
them a SIGTTIN to tell them to re-listen to the ports and continue their normal
work. Otherwise, it will either ask them to finish (-sf) their work then softly
exit, or immediately terminate (-st), breaking existing sessions.

What's interesting is that firing this command manually multiple times from a shell works exactly as expected every time. And what's even more interesting to me is that this behavior doesn't seem to show up with consul-template v0.10.0 - I don't currently have a good idea as to how/why that would work differently in v0.11.0.

@ProbablyRusty
Copy link
Author

It's anecdotal, but I did observe with two versions of v0.11.0 side by side, one with debug logging and one with normal logging straight to /dev/null that both exhibited this behavior, but the one with no logging pretty consistently orphaned more haproxies than the one with debug logging. (Before I switched the first one from no logs to debug logs, it was orphaning processes pretty consistently with the rate of orphans of the second node.) Anyway, if a race condition, maybe the extra overhead of logging does actually affect the behavior.

@ProbablyRusty
Copy link
Author

There is almost definitely a discrepancy between v0.10.0 and v0.11.0 though. I just checked back on two nodes in the same environment, same config, same consul dc, same template, one with v0.10.0 and one with v0.11.0, and after several hours, one has a single haproxy process running, and the other has 40.

@sethvargo
Copy link
Contributor

@consultantRR okay - I spent a lot of time on this today, and I have a way to reproduce it.

I am able to reproduce this 100% of the time when /var/run/haproxy.pid:

  1. Does not exist
  2. Exists but is empty
  3. Exists with a PID that isn't valid

I was able to reproduce this under CT master, 0.11.0, 0.10.0, and 0.9.0.

Because of this, I think the version issue is actually a red herring. I think the reason it "works" on CT v0.10.0 is that you already having a running haproxy process on those nodes, and you're trying to use CT v0.11.0 on a node that doesn't have an haproxy instance already running. I could be totally wrong, but that's my only way to reproduce this issue at the moment, because if haproxy isn't running, the PID is invalid, and haproxy does something really strange and hangs onto the subprocess its spawns, but it doesn't hang the parent process, so CT thinks it has exited.

Now, when the PID exists, they are both very happy:

Here is CT v0.11.0

root      1973  0.0  1.8  11440  7056 ?        Sl   04:43   0:00 consul-template -config /etc/haproxy/consul-template.hcl
root      2068  0.0  0.3  12300  1128 ?        Ss   04:44   0:00 haproxy -f /etc/haproxy/haproxy.cfg -q -p /var/run/haproxy.pid -sf 2046

and here is CT v0.10.0

root      1974  0.0  1.8   9468  6808 ?        Sl   04:42   0:00 consul-template -config /etc/haproxy/consul-template.hcl
root      2057  0.0  0.3  12300  1128 ?        Ss   04:44   0:00 haproxy -f /etc/haproxy/haproxy.cfg -q -p /var/run/haproxy.pid -sf 2035

Obviously the PIDs are different because these are different VMs, but it's the same exact script and configuration on all the machines.

Each time I change the key in Consul, the PID changes (meaning CT successfully reloads the process).

I'm not really sure where to go from here. I'm out of debugging possibilities, and I'm fairly certain this is a problem with the way haproxy issues its reloads.

Please let me know what you think.

@sethvargo
Copy link
Contributor

Here is the Vagrantfile I am using: https://gist.github.com/sethvargo/e1e81544a769f6df8686

@slackpad
Copy link
Contributor

nohup consul-template -log-level debug -config /etc/haproxy/consul-template.hcl >/var/log/consul-template.log 2>&1 &

Random sanity check question - is it possible there are multiple consul-template instances running on any of these problem machines?

@ProbablyRusty
Copy link
Author

First of all, thank you @sethvargo, immensely, for the attention to this and the time spent thus far. It is much, much appreciated. And is a hallmark of HashiCorp, and your work in particular, Seth.

I am not sure what to think about this either.

I can say that my tests today have included:

Starting both CT 10 and 11 with haproxy already running.
Stopping all haproxy processes and restarting haproxy while CT is running.

In all cases CT 11 has exhibited this behavior (with the template I posted above - haven't tried other templates) and in no cases has CT 10 exhibited this behavior.

In fact, when switching back from CT 10 to CT 11, I see this behavior again on the same node. And vice versa (switching from CT 11 to CT 10 eliminates this behavior in all cases on the same node).

One thing I have noticed:

In all cases (again this is always under CT 11 and never under CT 10) in which the node gets into a state in which >1 haproxy processes are erroneously running, exactly 2 invocations of service haproxy stop are needed to stop all haproxy processes, no matter how many are running.

Meaning, the "remedy" for n number of haproxies (as long as n>1) looks like this:

# ps ax | grep haproxy.cfg | grep -v grep | wc -l
6
# service haproxy stop
Stopping haproxy:                                          [  OK  ]
# service haproxy stop
Stopping haproxy:                                          [  OK  ]
# service haproxy stop
Stopping haproxy:                                          [FAILED]
# service haproxy start
Starting haproxy:                                            [  OK  ]
# ps ax | grep haproxy.cfg | grep -v grep | wc -l
1

And btw, thanks @slackpad for the sanity check - (sanity is always good!) - no, in all cases referenced above, I can confirm only a single instance of CT running.

@sethvargo
Copy link
Contributor

Hi @consultantRR

Could you try setting max_stale = 0 in your config and see if that makes any difference?

@ProbablyRusty
Copy link
Author

To be sure, it didn't seem to like max_stale = 0, @sethvargo:

* error converting max_stale to string at "/etc/haproxy/consul-template.hcl"

But it was OK with max_stale = "0", and sadly, the behavior was the same. Wound up with 3 haproxies after about 12 reloads.

@sethvargo
Copy link
Contributor

@consultantRR well I'm officially out of ideas 😦 Maybe @armon or @ryanuber does?

@zonzamas
Copy link

My 2 cents:

I have seen orphan haproxy instances frequently (and I'm not using consul-template with haproxy yet)
In my case I think it is related to websocket connections kept alive for a long time. If I run a netstat -apn | grep PID I see a dozen established connections.

@zonzamas
Copy link

In other words, it would be interesting to see whether the 'orphan' haproxy instances are really orphan or they are just waiting for some side to actually close the connection.

@sethvargo
Copy link
Contributor

@zonzamas I'm not super familiar with haproxy - is there an easy way for @consultantRR to do that?

@zonzamas
Copy link

I use standard linux tools

Being PID the actual PID from an oprhan

netstat -apn | grep PID to get established connections

strace -fp PID to see what a proccess it is actually doing

from netstat -apn I get something like:

tcp        0      0 10.0.1.6:57222          10.0.34.216:5672        ESTABLISHED 18749/haproxy
tcp        0      0 10.0.1.6:5672           10.0.1.60:58411         ESTABLISHED 18749/haproxy
tcp        0      0 10.0.1.6:39279          10.0.32.60:5290         ESTABLISHED 18749/haproxy
tcp        0      0 10.0.1.6:5672           10.0.2.206:53716        ESTABLISHED 18749/haproxy
tcp        0      0 10.0.1.6:35486          10.0.16.60:5290         ESTABLISHED 18749/haproxy
tcp        0      0 127.0.0.1:81            127.0.0.1:56015         ESTABLISHED 18749/haproxy
tcp        0      0 127.0.0.1:81            127.0.0.1:38882         ESTABLISHED 18749/haproxy
tcp        0      0 10.0.1.6:57163          10.0.32.60:5290         ESTABLISHED 18749/haproxy
tcp        0      0 10.0.1.6:5672           10.0.2.206:53740        ESTABLISHED 18749/haproxy
tcp        0      0 10.0.1.6:36761          10.0.16.60:5290         ESTABLISHED 18749/haproxy

@ProbablyRusty
Copy link
Author

I will do some controlled testing on this @zonzamas - but offhand, I don't believe this is the issue as:

(1) the processes stay running for hours and only increase in number, they don't decrease
(2) this simply does not happen on any node (for me) running pre-v11 CT
(3) switching CT versions back and forth on the same node (respectively) begin or end this behavior immediately after the version switch

Actually, I should clarify: when I say "pre-v11 CT", I specifically mean v10. I have not tested versions prior to v10 in the context of this issue.

@sethvargo
Copy link
Contributor

I'm still at a loss for what might be causing this. I'm going to release CT 0.11.1 without this fix because I do not want to delay further. Sorry!

@pmbauer
Copy link

pmbauer commented Oct 27, 2015

Thanks @consultantRR and @sethvargo for the work on this.
We've nothing to add other than we observed the exact same behavior after upgrading to consul-template v0.11.1 from v0.10.0; after which we reverted back to v0.10.0. We are running haproxy in a container along with consul-template.

@duggan
Copy link

duggan commented Jan 26, 2016

Until there's an official release from HashiCorp, I've a build of 0.12.2 with Go 1.6 beta2 rc1 rc2 release via Circle CI with included Dockerfile, etc, for those who would like to build it themselves. Some might find it useful.

I'd have used an older build, but I'd already started relying on 0.11 syntax and features 😅

@sboily
Copy link

sboily commented Jan 29, 2016

Hello, same problem is resolved with consul template 0.12.2 and go 1.6rc1 for me. Thank you @duggan for your dockerfile!

sboily added a commit to sboily/ctid-scale that referenced this issue Jan 29, 2016
ewr added a commit to scpr-cookbooks/scpr-consul-haproxy that referenced this issue Feb 1, 2016
Use a build of consul-template 0.12.2 compiled with Go 1.6rc1 while waiting
for official builds to get a fix for hashicorp/consul-template#442

Build is from https://github.com/duggan/build-consul-template
@memelet
Copy link

memelet commented Feb 3, 2016

I don't have much to add to the above. But we upgrade from 0.10.x to 0.12.2 today and we ended up with dozens of haproxies.

@brycechesternewman
Copy link

We tried 0.12.2 on CoreOS 835.12.0 using haproxy 1.6.3 without success. There are no errors in consul-template logs or haproxy logs. After a few hours our network service(Midokura) started to get overloaded with a large number of SYN requests. We rolled back to 0.10.2 and the network service(Midokura) recovered.

Average:        IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s
Average:           lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Average:         eth0 111390.07      0.00 124128.80      0.00      0.00      0.00      0.68
Average:         eth1  18888.36      0.00   4970.44      0.00      0.00      0.00      0.68
Average:         eth2   1880.14   7096.23    121.18   9651.90      0.00      0.00      0.00
```22:23:30.261584 IP (tos 0x0, ttl 63, id 46818, offset 0, flags [DF], proto TCP (6), length 60)
    172.16.1.173.51858 > 172.16.2.13.webcache: Flags [S], cksum 0x5c09 (incorrect -> 0x2383), seq 614922376, win 29200, options [mss 1460,sackOK,TS val 14044778 ecr 0,nop,wscale 7], length 0
22:23:30.261641 IP (tos 0x0, ttl 63, id 52396, offset 0, flags [DF], proto TCP (6), length 60)
    172.16.1.173.33782 > 172.16.2.12.webcache: Flags [S], cksum 0x5c08 (incorrect -> 0x1013), seq 4101341894, win 29200, options [mss 1460,sackOK,TS val 14044778 ecr 0,nop,wscale 7], length 0
22:23:30.263228 IP (tos 0x0, ttl 63, id 22526, offset 0, flags [DF], proto TCP (6), length 60)
    172.16.1.173.41282 > 172.16.1.199.webcache: Flags [S], cksum 0x5bc3 (incorrect -> 0xe280), seq 1848711573, win 29200, options [mss 1460,sackOK,TS val 14044779 ecr 0,nop,wscale 7], length 0

@justinclayton
Copy link

@brycechesternewman unless you used @duggan 's custom build against unreleased Go 1.6, the issue is still there.

@skippy
Copy link

skippy commented Feb 17, 2016

go 1.6 was released today: https://golang.org/dl/#go1.6

@sethvargo can a new release of consul-template be cut please?

@sethvargo
Copy link
Contributor

0.13.0 is released and compiled with go 1.6!

@ProbablyRusty
Copy link
Author

Our long national nightmare is over. :)

@skippy
Copy link

skippy commented Feb 18, 2016

thank you @sethvargo !

@hobochili
Copy link

I'm running the 0.13.0 release pulled from the following URL and still seeing occurrences of multiple HAProxy processes running after a service reload. Is anybody else still seeing this issue?

@systeminsightsbuild
Copy link

Yes, just today we found five extra haproxy's with 0.13.0. Clearly this
issue persists. And we have already upgraded production to 0.13.0, so I
guess we need to scramble to roll back.

On Mon, Mar 14, 2016 at 7:51 PM, Chris Dickson notifications@github.com
wrote:

I'm running the 0.13.0 release pulled from the following URL and still
seeing occurrences of multiple HAProxy processes running after a service
reload. Is anybody else still seeing this issue?

https://releases.hashicorp.com/consul-template/0.13.0/consul-template_0.13.0_linux_amd64.zip


Reply to this email directly or view it on GitHub
#442 (comment)
.

Barry Kaplan
barry@systeminsights.com
CIO, CISO
System Insights

@hobochili
Copy link

@sethvargo Should I open a new issue to investigate the persistence of this bug, or can we reopen this one?

@victortrac
Copy link

FWIW, 0.13.0 fixes the problem for me. haproxy reload is supposed to leave behind the previous previous process with the -sf [pid] flag to let the previous process finish out its requests. On a busy system that reloads frequently, it's expected to see some number of previous haproxy processes still running. If you refresh the ps command, you should see the haproxy -sf [pid] rotate PIDs.

@systeminsightsbuild
Copy link

Victor, it mostly fixed the problem for us as well. At least until 0.13 was
released to now. Iit can't be just that as the old haproxy's are still
handling requests. Well, at least they are issuing 503s since their view of
the services they are routing to is now invalid. And we have never ever
seen this problem prior to 0.12, so clearly something new is happening.

On Wed, Mar 16, 2016 at 1:25 AM, Victor Trac notifications@github.com
wrote:

FWIW, 0.13.0 fixes the problem for me. haproxy reload is supposed to leave
behind the previous previous process with the -sf [pid] flag to let the
previous process finish out its requests. On a busy system that reloads
frequently, it's expected to see some number of previous haproxy processes
still running. If you refresh the ps command, you should see the haproxy
-sf [pid] rotate PIDs.


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#442 (comment)

Barry Kaplan
barry@systeminsights.com
CIO, CISO
System Insights

@bodiug
Copy link

bodiug commented Mar 23, 2016

We've added wait = "1s" to /etc/consul-template/config.d/consul-template.conf to work-around this problem for now.

@sethvargo
Copy link
Contributor

Hi everyone,

Since this issue seems to be getting a lot of noise, I'm going to clarify that this has been fixed in Consul Template 0.13+. Please open any other concerns as separate issues. haproxy itself will leave behind previous processes for a short period of time to finish existing requests, so unless you see a multitude of haproxy processes running each time CT restarts the process, this issue does not apply to you 😄

@hashicorp hashicorp locked and limited conversation to collaborators Mar 23, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests