Consul-ESM rewrites check interval/timeout to default values #38

angryp · 2019-04-26T08:07:57Z

Hello!

Versions in use:

consul --version
Consul v1.4.4
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

consul-esm --version
v0.3.3

Consul members:

Node      Address               Status  Type    Build  Protocol  DC    Segment
consul-1  10.10.10.1:8301       alive   server  1.4.4  2         main  <all>
consul-2  10.10.10.2:8301       alive   server  1.4.4  2         main  <all>
consul-3  10.10.10.3:8301       alive   server  1.4.4  2         main  <all>

Consul-1 configuration is as follows:

{
  "datacenter": "main",
  "data_dir": "/var/consul",
  "log_level": "INFO",
  "log_file": "/var/log/consul/consul.log",
  "node_name": "consul-1",
  "server": true,
  "bind_addr": "10.10.10.1",
  "advertise_addr": "10.10.10.1",
  "client_addr": "0.0.0.0",
  "enable_script_checks": true,
  "recursors": ["127.0.0.1"],
  "telemetry": {
     "disable_hostname": true,
     "prometheus_retention_time": "120s"
  }
}

Consul-2 and consul-3 nodes are set with "start_join" and "retry_join" directives containing first ones IP address, so that Consul nodes could form a cluster. Note the rest of configuration also persists, meaning every node is acting as a server.

Besides Consul itself, each node runs consul-esm service. This is the configuration in use on all nodes:

log_level = "INFO"
enable_syslog = false
syslog_facility = ""
consul_service = "consul-esm"
consul_service_tag = ""
consul_kv_path = "consul-esm/"
external_node_meta {
    "external-node" = "true"
}
node_reconnect_timeout = "72h"
node_probe_interval = "10s"
http_addr = "localhost:8500"
token = ""
datacenter = "main"
ca_file = ""
ca_path = ""
cert_file = ""
key_file = ""
tls_server_name = ""
ping_type = "udp"

Flags for launching services are:

/usr/local/bin/consul agent -ui -config-dir=/etc/consul.d -config-file=/etc/consul.json
/usr/local/bin/consul-esm -config-dir=/etc/consul-esm.d -config-file=/etc/consul-esm.hcl

With this being said, here are instructions to reproduce a bug. First, register a new node with custom intervals.

curl -X PUT -d '{"Datacenter":"main", "Node":"my.hardware.device", "Address":"my.hardware.device", "Service":{"ID":"my.hardware.device", "Service":"my.hardware.device"}, "NodeMeta":{"external-node":"true", "external-probe":"false", "type":"hardware", "class":"network", "serial":"xxxxx"}, "Checks":[{"Node":"my.hardware.device", "CheckID":"firstcheck", "Name":"firstcheck", "Notes":"", "Status":"warning", "Definition":{"HTTP":"http://consul.check.node:8081", "Interval":"60s", "Timeout":"10s", "Method":"GET", "Header":{"hostname":["my.hardware.device"]}}}, {"Node":"my.hardware.device", "CheckID":"secondcheck", "Name":"secondcheck", "Notes":"", "Status":"warning", "Definition":{"HTTP":"http://consul.check.node:8082", "Interval":"60s", "Timeout":"10s", "Method":"GET", "Header":{"hostname":["my.hardware.device"]}}}]}' http://consul-1:8500/v1/catalog/register

Secondly, ensure check configuration is correct. Note interval is still correct.

curl http://consul-1:8500/v1/health/node/my.hardware.device

[{"Node":"my.hardware.device","CheckID":"firstcheck","Name":"firstcheck","Status":"warning","Notes":"","Output":"","ServiceID":"","ServiceName":"","ServiceTags":[],"Definition":{"Interval":"1m0s","Timeout":"10s","HTTP":"http://consul.check.node:8081","Header":{"hostname":["my.hardware.device"]},"Method":"GET"},"CreateIndex":19510337,"ModifyIndex":19510337},{"Node":"my.hardware.device","CheckID":"secondcheck","Name":"secondcheck","Status":"warning","Notes":"","Output":"","ServiceID":"","ServiceName":"","ServiceTags":[],"Definition":{"Interval":"1m0s","Timeout":"10s","HTTP":"http://consul.check.node:8082","Header":{"hostname":["my.hardware.device"]},"Method":"GET"},"CreateIndex":19510337,"ModifyIndex":19510337}]

Finally, wait 1 minute and query for health checks once again. Note interval and timeout settings are absent despite the results.

curl http://consul-1:8500/v1/health/node/my.hardware.device

[{"Node":"my.hardware.device","CheckID":"firstcheck","Name":"firstcheck","Status":"passing","Notes":"","Output":"HTTP GET http://consul.check.node:8081: 200 OK Output: There is a host","ServiceID":"","ServiceName":"","ServiceTags":[],"Definition":{"HTTP":"http://consul.check.node:8081","Header":{"hostname":["my.hardware.device"]},"Method":"GET"},"CreateIndex":19510337,"ModifyIndex":19510342},{"Node":"my.hardware.device","CheckID":"secondcheck","Name":"secondcheck","Status":"critical","Notes":"","Output":"HTTP GET http://consul.check.node:8082: 404 Not Found Output: There is no host","ServiceID":"","ServiceName":"","ServiceTags":[],"Definition":{"HTTP":"http://consul.check.node:8082","Header":{"hostname":["my.hardware.device"]},"Method":"GET"},"CreateIndex":19510337,"ModifyIndex":19510348}]

In fact, checks will be executed with default interval now as seen from the HTTP server log:

10.10.10.2 - - [26/Apr/2019:07:59:57 +0000] "GET / HTTP/1.1" 404 47 "-" "Consul Health Check"
10.10.10.2 - - [26/Apr/2019:08:00:28 +0000] "GET / HTTP/1.1" 404 47 "-" "Consul Health Check"
10.10.10.2 - - [26/Apr/2019:08:01:07 +0000] "GET / HTTP/1.1" 404 47 "-" "Consul Health Check"
10.10.10.2 - - [26/Apr/2019:08:01:38 +0000] "GET / HTTP/1.1" 404 47 "-" "Consul Health Check"
10.10.10.2 - - [26/Apr/2019:08:02:08 +0000] "GET / HTTP/1.1" 404 47 "-" "Consul Health Check"
10.10.10.2 - - [26/Apr/2019:08:02:39 +0000] "GET / HTTP/1.1" 404 47 "-" "Consul Health Check"

Let me know if you would require any more information.

The text was updated successfully, but these errors were encountered:

lornasong · 2019-12-12T22:46:06Z

Hi @angryp, sincere apologies for the late reply. Thanks so much for the details of your issue.

I was able to reproduce your issue where the custom check definition interval disappears after a health check when using Consul version 1.4.4.

It looks like this issue has been resolved in versions 1.5.0 and onwards. More specifically it looks like it was resolved by this pull request: hashicorp/consul#5553.

In your example, the entities are registered with checks with status value warning. When the health check is performed and the status is changed, this status update is sent to the transaction API which was the source of erasing the interval and timeout values. The issue for the above linked PR describes a similar issue as yours hashicorp/consul#5477

Hope this helps. Thanks!

eikenb added the bug label Oct 3, 2019

lornasong closed this as completed Dec 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consul-ESM rewrites check interval/timeout to default values #38

Consul-ESM rewrites check interval/timeout to default values #38

angryp commented Apr 26, 2019 •

edited

Loading

lornasong commented Dec 12, 2019

Consul-ESM rewrites check interval/timeout to default values #38

Consul-ESM rewrites check interval/timeout to default values #38

Comments

angryp commented Apr 26, 2019 • edited Loading

lornasong commented Dec 12, 2019

angryp commented Apr 26, 2019 •

edited

Loading