Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul-ESM rewrites check interval/timeout to default values #38

Closed
angryp opened this issue Apr 26, 2019 · 1 comment
Closed

Consul-ESM rewrites check interval/timeout to default values #38

angryp opened this issue Apr 26, 2019 · 1 comment
Labels

Comments

@angryp
Copy link

angryp commented Apr 26, 2019

Hello!

Versions in use:

consul --version
Consul v1.4.4
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

consul-esm --version
v0.3.3

Consul members:

Node      Address               Status  Type    Build  Protocol  DC    Segment
consul-1  10.10.10.1:8301       alive   server  1.4.4  2         main  <all>
consul-2  10.10.10.2:8301       alive   server  1.4.4  2         main  <all>
consul-3  10.10.10.3:8301       alive   server  1.4.4  2         main  <all>

Consul-1 configuration is as follows:

{
  "datacenter": "main",
  "data_dir": "/var/consul",
  "log_level": "INFO",
  "log_file": "/var/log/consul/consul.log",
  "node_name": "consul-1",
  "server": true,
  "bind_addr": "10.10.10.1",
  "advertise_addr": "10.10.10.1",
  "client_addr": "0.0.0.0",
  "enable_script_checks": true,
  "recursors": ["127.0.0.1"],
  "telemetry": {
     "disable_hostname": true,
     "prometheus_retention_time": "120s"
  }
}

Consul-2 and consul-3 nodes are set with "start_join" and "retry_join" directives containing first ones IP address, so that Consul nodes could form a cluster. Note the rest of configuration also persists, meaning every node is acting as a server.

Besides Consul itself, each node runs consul-esm service. This is the configuration in use on all nodes:

log_level = "INFO"
enable_syslog = false
syslog_facility = ""
consul_service = "consul-esm"
consul_service_tag = ""
consul_kv_path = "consul-esm/"
external_node_meta {
    "external-node" = "true"
}
node_reconnect_timeout = "72h"
node_probe_interval = "10s"
http_addr = "localhost:8500"
token = ""
datacenter = "main"
ca_file = ""
ca_path = ""
cert_file = ""
key_file = ""
tls_server_name = ""
ping_type = "udp"

Flags for launching services are:

/usr/local/bin/consul agent -ui -config-dir=/etc/consul.d -config-file=/etc/consul.json
/usr/local/bin/consul-esm -config-dir=/etc/consul-esm.d -config-file=/etc/consul-esm.hcl

With this being said, here are instructions to reproduce a bug. First, register a new node with custom intervals.

curl -X PUT -d '{"Datacenter":"main", "Node":"my.hardware.device", "Address":"my.hardware.device", "Service":{"ID":"my.hardware.device", "Service":"my.hardware.device"}, "NodeMeta":{"external-node":"true", "external-probe":"false", "type":"hardware", "class":"network", "serial":"xxxxx"}, "Checks":[{"Node":"my.hardware.device", "CheckID":"firstcheck", "Name":"firstcheck", "Notes":"", "Status":"warning", "Definition":{"HTTP":"http://consul.check.node:8081", "Interval":"60s", "Timeout":"10s", "Method":"GET", "Header":{"hostname":["my.hardware.device"]}}}, {"Node":"my.hardware.device", "CheckID":"secondcheck", "Name":"secondcheck", "Notes":"", "Status":"warning", "Definition":{"HTTP":"http://consul.check.node:8082", "Interval":"60s", "Timeout":"10s", "Method":"GET", "Header":{"hostname":["my.hardware.device"]}}}]}' http://consul-1:8500/v1/catalog/register

Secondly, ensure check configuration is correct. Note interval is still correct.

curl http://consul-1:8500/v1/health/node/my.hardware.device

[{"Node":"my.hardware.device","CheckID":"firstcheck","Name":"firstcheck","Status":"warning","Notes":"","Output":"","ServiceID":"","ServiceName":"","ServiceTags":[],"Definition":{"Interval":"1m0s","Timeout":"10s","HTTP":"http://consul.check.node:8081","Header":{"hostname":["my.hardware.device"]},"Method":"GET"},"CreateIndex":19510337,"ModifyIndex":19510337},{"Node":"my.hardware.device","CheckID":"secondcheck","Name":"secondcheck","Status":"warning","Notes":"","Output":"","ServiceID":"","ServiceName":"","ServiceTags":[],"Definition":{"Interval":"1m0s","Timeout":"10s","HTTP":"http://consul.check.node:8082","Header":{"hostname":["my.hardware.device"]},"Method":"GET"},"CreateIndex":19510337,"ModifyIndex":19510337}]

Finally, wait 1 minute and query for health checks once again. Note interval and timeout settings are absent despite the results.

curl http://consul-1:8500/v1/health/node/my.hardware.device

[{"Node":"my.hardware.device","CheckID":"firstcheck","Name":"firstcheck","Status":"passing","Notes":"","Output":"HTTP GET http://consul.check.node:8081: 200 OK Output: There is a host","ServiceID":"","ServiceName":"","ServiceTags":[],"Definition":{"HTTP":"http://consul.check.node:8081","Header":{"hostname":["my.hardware.device"]},"Method":"GET"},"CreateIndex":19510337,"ModifyIndex":19510342},{"Node":"my.hardware.device","CheckID":"secondcheck","Name":"secondcheck","Status":"critical","Notes":"","Output":"HTTP GET http://consul.check.node:8082: 404 Not Found Output: There is no host","ServiceID":"","ServiceName":"","ServiceTags":[],"Definition":{"HTTP":"http://consul.check.node:8082","Header":{"hostname":["my.hardware.device"]},"Method":"GET"},"CreateIndex":19510337,"ModifyIndex":19510348}]

In fact, checks will be executed with default interval now as seen from the HTTP server log:

10.10.10.2 - - [26/Apr/2019:07:59:57 +0000] "GET / HTTP/1.1" 404 47 "-" "Consul Health Check"
10.10.10.2 - - [26/Apr/2019:08:00:28 +0000] "GET / HTTP/1.1" 404 47 "-" "Consul Health Check"
10.10.10.2 - - [26/Apr/2019:08:01:07 +0000] "GET / HTTP/1.1" 404 47 "-" "Consul Health Check"
10.10.10.2 - - [26/Apr/2019:08:01:38 +0000] "GET / HTTP/1.1" 404 47 "-" "Consul Health Check"
10.10.10.2 - - [26/Apr/2019:08:02:08 +0000] "GET / HTTP/1.1" 404 47 "-" "Consul Health Check"
10.10.10.2 - - [26/Apr/2019:08:02:39 +0000] "GET / HTTP/1.1" 404 47 "-" "Consul Health Check"

Let me know if you would require any more information.

@eikenb eikenb added the bug label Oct 3, 2019
@lornasong
Copy link
Member

Hi @angryp, sincere apologies for the late reply. Thanks so much for the details of your issue.

I was able to reproduce your issue where the custom check definition interval disappears after a health check when using Consul version 1.4.4.

It looks like this issue has been resolved in versions 1.5.0 and onwards. More specifically it looks like it was resolved by this pull request: hashicorp/consul#5553.

In your example, the entities are registered with checks with status value warning. When the health check is performed and the status is changed, this status update is sent to the transaction API which was the source of erasing the interval and timeout values. The issue for the above linked PR describes a similar issue as yours hashicorp/consul#5477

Hope this helps. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants