Service checks revert to failed state from server #818

alpduhuez · 2015-03-26T19:05:54Z

I am seeing a case where a script service check will run and get stuck in the error state and will not go back to healthy. If I run the check directly it is fine. I can start and stop the service and watch the check script correctly report the right value. But each time it is run by consul, it reports an error. The only way to fix it is to restart consul on that agent. And then the check will pass again for a little bit but will start failing again.

I have 3 servers right now and just 6 agents at the moment with 0.5 Consul

I was poking in the data.mdb file on one of the servers and I see the old error state in there that shows as the reason the service is failed. This old error is the one that is reported in the UI as the reason the service is failing.

Observed:
The service check returns 0 on the agent, but after a crash a failed state that is cached on the servers overwrites the successful check on the agent. The agent has to be restarted to correct the service check.

Agent Config

{
  "data_dir": "/app/consul.d/data",
  "retry_join": [
    "testconsul01",
    "testconsul02",
    "testconsul03"
  ],
  "ui_dir": "/app/consul.d/data/ui",
  "client_addr": "192.168.51.9",
  "datacenter": "test",
  "log_level": "DEBUG",
  "ports": {
    "dns": 8600,
    "http": 8500,
    "rpc": 8400,
    "serf_lan": 8301,
    "serf_wan": 8302,
    "server": 8300
  },
  "enable_debug": true
}

Service Config

{
  "service": {
    "name": "manifest-service-redis",
    "id": "manifest-service-redis",
    "port": 6379,
    "tags": [
      "redis-cache",
      "agent",
      "consul_service_name"
    ],
    "check": {
      "id": "redis-alive",
      "name": "Redis Alive",
      "interval": "2s",
      "script": "/app/consul.d/checks/redis-check.rb"
    }
  }
}

Server Config

{
  "data_dir": "/app/consul.d/data",
  "server": true,
  "bootstrap_expect": 3,
  "retry_join": [
    "testconsul01",
    "testconsul02",
    "testconsul03"
  ],
  "ui_dir": "/app/consul.d/data/ui",
  "client_addr": "192.168.50.231",
  "datacenter": "test",
  "log_level": "DEBUG",
  "ports": {
    "dns": 8600,
    "http": 8500,
    "rpc": 8400,
    "serf_lan": 8301,
    "serf_wan": 8302,
    "server": 8300
  },
  "enable_debug": true
}

Agent Monitor Log
You can see that the check was failing. From another ssh session all I did was restart consul and the local check took precedence.

2015/03/26 11:54:44 [INFO] agent.rpc: Accepted client: 192.168.51.9:46997
2015/03/26 11:54:44 [WARN] agent: Check 'service:manifest-service-redis' is now warning
2015/03/26 11:54:46 [WARN] agent: Check 'service:manifest-service-redis' is now warning
^C[brandon.man@testmanrs02 ~]sudo /etc/init.d/consul stop
Stopping consul....
consul stopped
[brandon.man@testmanrs02 ~]$ sudo /etc/init.d/consul start
Starting consul
[brandon.man@testmanrs02 ~]$ /app/consul.d/consul monitor -rpc-addr 192.168.51.9:8400
2015/03/26 11:55:02 [INFO] serf: Ignoring previous leave in snapshot
2015/03/26 11:55:02 [INFO] serf: Ignoring previous leave in snapshot
2015/03/26 11:55:02 [INFO] serf: Ignoring previous leave in snapshot
2015/03/26 11:55:02 [INFO] serf: Ignoring previous leave in snapshot
2015/03/26 11:55:02 [INFO] serf: Ignoring previous leave in snapshot
2015/03/26 11:55:02 [INFO] serf: Ignoring previous leave in snapshot
2015/03/26 11:55:02 [INFO] serf: Ignoring previous leave in snapshot
2015/03/26 11:55:02 [INFO] serf: Ignoring previous leave in snapshot
2015/03/26 11:55:02 [INFO] serf: Ignoring previous leave in snapshot
2015/03/26 11:55:02 [INFO] serf: Ignoring previous leave in snapshot
2015/03/26 11:55:02 [INFO] serf: Ignoring previous leave in snapshot
2015/03/26 11:55:02 [INFO] serf: Ignoring previous leave in snapshot
2015/03/26 11:55:02 [INFO] serf: Ignoring previous leave in snapshot
2015/03/26 11:55:02 [INFO] serf: Ignoring previous leave in snapshot
2015/03/26 11:55:02 [INFO] serf: Ignoring previous leave in snapshot
2015/03/26 11:55:02 [INFO] serf: Ignoring previous leave in snapshot
2015/03/26 11:55:02 [INFO] serf: Ignoring previous leave in snapshot
2015/03/26 11:55:02 [INFO] serf: Ignoring previous leave in snapshot
2015/03/26 11:55:02 [INFO] serf: EventMemberJoin: testmanrs02.corp.company.com 192.168.51.9
2015/03/26 11:55:02 [WARN] Service tag "consul_service_name" will not be discoverable via DNS due to invalid characters. Valid characters include all alpha-numerics and dashes.
2015/03/26 11:55:02 [INFO] serf: Attempting re-join to previously known node: testconsul02.corp.company.com: 192.168.50.231:8301
2015/03/26 11:55:02 [INFO] serf: EventMemberJoin: testconsul01.corp.company.com 192.168.50.238
2015/03/26 11:55:02 [INFO] serf: EventMemberJoin: testmanifest04.corp.company.com 192.168.50.202
2015/03/26 11:55:02 [INFO] serf: EventMemberJoin: testconsul03.corp.company.com 192.168.50.232
2015/03/26 11:55:02 [INFO] serf: EventMemberJoin: testconsul02.corp.company.com 192.168.50.231
2015/03/26 11:55:02 [INFO] serf: EventMemberJoin: testjake03.corp.company.com 192.168.50.239
2015/03/26 11:55:02 [INFO] serf: EventMemberJoin: testmanrs01.corp.company.com 192.168.51.7
2015/03/26 11:55:02 [INFO] serf: EventMemberJoin: testmanifest03.corp.company.com 192.168.50.201
2015/03/26 11:55:02 [INFO] serf: Re-joined to previously known node: testconsul02.corp.company.com: 192.168.50.231:8301
2015/03/26 11:55:02 [INFO] consul: adding server testconsul01.corp.company.com (Addr: 192.168.50.238:8300) (DC: test)
2015/03/26 11:55:02 [INFO] consul: adding server testconsul03.corp.company.com (Addr: 192.168.50.232:8300) (DC: test)
2015/03/26 11:55:02 [INFO] consul: adding server testconsul02.corp.company.com (Addr: 192.168.50.231:8300) (DC: test)
2015/03/26 11:55:02 [INFO] agent: Joining cluster...
2015/03/26 11:55:02 [INFO] agent: (LAN) joining: [testconsul01 testconsul02 testconsul03]
2015/03/26 11:55:02 [INFO] agent: (LAN) joined: 3 Err: <nil>
2015/03/26 11:55:02 [INFO] agent: Join completed. Synced with 3 initial agents
2015/03/26 11:55:02 [INFO] agent: Synced service 'manifest-service-redis'
2015/03/26 11:55:03 [INFO] agent: Synced check 'service:manifest-service-redis'
2015/03/26 11:55:04 [INFO] agent.rpc: Accepted client: 192.168.51.9:47015

Crashing Agent Monitor Log
You can see in this log, it looks like a crash. After that the service check is in error. This seems to be case where the stale server data takes precedence. Only restarting the agent consul will correct it.

^C[brandon.man@testmanrs01 ~]$ /app/consul.d/consul monitor -rpc-addr 192.168.51.7:8400
2015/03/26 11:44:36 [INFO] serf: EventMemberLeave: testjake03.corp.company.com 192.168.50.239
2015/03/26 11:44:37 [INFO] serf: EventMemberJoin: testjake03.corp.company.com 192.168.50.239
2015/03/26 11:45:17 [INFO] serf: EventMemberLeave: testconsul03.corp.company.com 192.168.50.232
2015/03/26 11:45:17 [INFO] consul: removing server testconsul03.corp.company.com (Addr: 192.168.50.232:8300) (DC: test)
2015/03/26 11:45:18 [INFO] serf: EventMemberJoin: testconsul03.corp.company.com 192.168.50.232
2015/03/26 11:45:18 [INFO] consul: adding server testconsul03.corp.company.com (Addr: 192.168.50.232:8300) (DC: test)
2015/03/26 11:45:55 [INFO] serf: EventMemberLeave: testmanrs02.corp.company.com 192.168.51.9
2015/03/26 11:45:56 [INFO] serf: EventMemberJoin: testmanrs02.corp.company.com 192.168.51.9
2015/03/26 11:48:31 [INFO] agent.rpc: Accepted client: 192.168.51.7:41387
2015/03/26 11:54:56 [INFO] serf: EventMemberLeave: testmanrs02.corp.company.com 192.168.51.9
2015/03/26 11:55:02 [INFO] serf: EventMemberJoin: testmanrs02.corp.company.com 192.168.51.9
2015/03/26 11:59:33 [INFO] consul: client starting leave
2015/03/26 11:59:34 [INFO] serf: EventMemberLeave: testmanrs01.corp.company.com 192.168.51.7

Remote side ended the monitor! This usually means that the
remote side has exited or crashed.
[brandon.man@testmanrs01 ~]$ /app/consul.d/consul monitor -rpc-addr 192.168.51.7:8400
2015/03/26 11:59:35 [INFO] serf: Ignoring previous leave in snapshot
2015/03/26 11:59:35 [INFO] serf: Ignoring previous leave in snapshot
2015/03/26 11:59:35 [INFO] serf: Ignoring previous leave in snapshot
2015/03/26 11:59:35 [INFO] serf: Ignoring previous leave in snapshot
2015/03/26 11:59:35 [INFO] serf: Ignoring previous leave in snapshot
2015/03/26 11:59:35 [INFO] serf: Ignoring previous leave in snapshot
2015/03/26 11:59:35 [INFO] serf: Ignoring previous leave in snapshot
2015/03/26 11:59:35 [INFO] serf: Ignoring previous leave in snapshot
2015/03/26 11:59:35 [INFO] serf: Ignoring previous leave in snapshot
2015/03/26 11:59:35 [INFO] serf: Ignoring previous leave in snapshot
2015/03/26 11:59:35 [INFO] serf: Ignoring previous leave in snapshot
2015/03/26 11:59:35 [INFO] serf: Ignoring previous leave in snapshot
2015/03/26 11:59:35 [INFO] serf: Ignoring previous leave in snapshot
2015/03/26 11:59:35 [INFO] serf: EventMemberJoin: testmanrs01.corp.company.com 192.168.51.7
2015/03/26 11:59:35 [WARN] Service tag "consul_service_name" will not be discoverable via DNS due to invalid characters. Valid characters include all alpha-numerics and dashes.
2015/03/26 11:59:35 [INFO] serf: Attempting re-join to previously known node: testmanifest04.corp.company.com: 192.168.50.202:8301
2015/03/26 11:59:35 [INFO] serf: EventMemberJoin: testconsul02.corp.company.com 192.168.50.231
2015/03/26 11:59:35 [INFO] serf: EventMemberJoin: testconsul01.corp.company.com 192.168.50.238
2015/03/26 11:59:35 [INFO] serf: EventMemberJoin: testmanifest04.corp.company.com 192.168.50.202
2015/03/26 11:59:35 [INFO] serf: EventMemberJoin: testjake03.corp.company.com 192.168.50.239
2015/03/26 11:59:35 [INFO] serf: EventMemberJoin: testmanifest03.corp.company.com 192.168.50.201
2015/03/26 11:59:35 [WARN] memberlist: Refuting a suspect message (from: testmanrs01.corp.company.com)
2015/03/26 11:59:35 [INFO] serf: EventMemberJoin: testconsul03.corp.company.com 192.168.50.232
2015/03/26 11:59:35 [INFO] serf: EventMemberJoin: testmanrs02.corp.company.com 192.168.51.9
2015/03/26 11:59:35 [INFO] serf: Re-joined to previously known node: testmanifest04.corp.company.com: 192.168.50.202:8301
2015/03/26 11:59:35 [INFO] consul: adding server testconsul02.corp.company.com (Addr: 192.168.50.231:8300) (DC: test)
2015/03/26 11:59:35 [INFO] consul: adding server testconsul01.corp.company.com (Addr: 192.168.50.238:8300) (DC: test)
2015/03/26 11:59:35 [INFO] consul: adding server testconsul03.corp.company.com (Addr: 192.168.50.232:8300) (DC: test)
2015/03/26 11:59:35 [INFO] agent: Joining cluster...
2015/03/26 11:59:35 [INFO] agent: (LAN) joining: [testconsul01 testconsul02 testconsul03]
2015/03/26 11:59:35 [INFO] agent: Synced service 'manifest-service-redis'
2015/03/26 11:59:35 [INFO] agent: (LAN) joined: 3 Err: <nil>
2015/03/26 11:59:35 [INFO] agent: Join completed. Synced with 3 initial agents
2015/03/26 11:59:35 [WARN] Service tag "consul_service_name" will not be discoverable via DNS due to invalid characters. Valid characters include all alpha-numerics and dashes.
2015/03/26 11:59:35 [INFO] agent: Synced service 'manifest-service-redis'
2015/03/26 11:59:35 [WARN] agent: Check 'service:manifest-service-redis' is now warning
2015/03/26 11:59:35 [INFO] agent: Synced check 'service:manifest-service-redis'
2015/03/26 11:59:37 [WARN] agent: Check 'service:manifest-service-redis' is now warning
2015/03/26 11:59:39 [WARN] agent: Check 'service:manifest-service-redis' is now warning
2015/03/26 11:59:41 [WARN] agent: Check 'service:manifest-service-redis' is now warning
2015/03/26 11:59:43 [WARN] agent: Check 'service:manifest-service-redis' is now warning
2015/03/26 11:59:45 [WARN] agent: Check 'service:manifest-service-redis' is now warning
2015/03/26 11:59:47 [WARN] agent: Check 'service:manifest-service-redis' is now warning
2015/03/26 11:59:49 [WARN] agent: Check 'service:manifest-service-redis' is now warning

The text was updated successfully, but these errors were encountered:

alpduhuez · 2015-04-07T19:58:04Z

I just got back from vacation, thought I would check in on thoughts on what I am seeing. I restarted all the nodes before I left and everything was in good order. I found it back in the same state today w/ the service check failed but redis running ok. Here is the output from monitor from one of the three servers.

This output is from testconsul02. It has testconsul01/03 in its member list but they are unable to elect.

Should I just clear all the nodes' data directories?
Should I use a different config for the servers?
It looks like on each Chef run, consul is restarted. Could it be that all servers are restarted close to the same time, would that do it?

Any thoughts? Am I missing something simple?

Thanks!

testconsul02 server output

2015/04/07 12:39:25 [WARN] memberlist: Refuting a suspect message (from: testconsul02.corp.company.com)
2015/04/07 12:39:25 [INFO] serf: Ignoring previous leave in snapshot
2015/04/07 12:39:25 [INFO] serf: Ignoring previous leave in snapshot
2015/04/07 12:39:25 [INFO] serf: EventMemberJoin: testmanrs01.corp.company.com 192.168.51.7
2015/04/07 12:39:25 [INFO] serf: Ignoring previous leave in snapshot
2015/04/07 12:39:25 [INFO] serf: Ignoring previous leave in snapshot
2015/04/07 12:39:25 [INFO] serf: EventMemberJoin: testmanrs02.corp.company.com 192.168.51.9
2015/04/07 12:39:25 [INFO] serf: Ignoring previous leave in snapshot
2015/04/07 12:39:25 [INFO] serf: Ignoring previous leave in snapshot
2015/04/07 12:39:25 [INFO] serf: EventMemberJoin: testconsul01.corp.company.com 192.168.50.238
2015/04/07 12:39:25 [INFO] serf: EventMemberJoin: testconsul03.corp.company.com 192.168.50.232
2015/04/07 12:39:25 [INFO] consul: adding server testconsul01.corp.company.com (Addr: 192.168.50.238:8300) (DC: test)
2015/04/07 12:39:25 [INFO] serf: Ignoring previous leave in snapshot
2015/04/07 12:39:25 [INFO] serf: Re-joined to previously known node: testmanifest01: 192.168.48.154:8301
2015/04/07 12:39:25 [INFO] serf: Ignoring previous leave in snapshot
2015/04/07 12:39:25 [INFO] serf: Ignoring previous leave in snapshot
2015/04/07 12:39:25 [INFO] consul: adding server testconsul03.corp.company.com (Addr: 192.168.50.232:8300) (DC: test)
2015/04/07 12:39:25 [INFO] serf: Ignoring previous leave in snapshot
2015/04/07 12:39:25 [INFO] serf: Ignoring previous leave in snapshot
2015/04/07 12:39:25 [INFO] serf: EventMemberJoin: testconsul02.corp.company.com.test 192.168.50.231
2015/04/07 12:39:25 [WARN] serf: Failed to re-join any previously known node
2015/04/07 12:39:25 [INFO] consul: adding server testconsul02.corp.company.com.test (Addr: 192.168.50.231:8300) (DC: test)
2015/04/07 12:39:25 [ERR] agent: failed to sync remote state: No cluster leader
2015/04/07 12:39:25 [INFO] agent: Joining cluster...
2015/04/07 12:39:25 [INFO] agent: (LAN) joining: [testconsul01 testconsul02 testconsul03]
2015/04/07 12:39:25 [INFO] agent: (LAN) joined: 3 Err: <nil>
2015/04/07 12:39:25 [INFO] agent: Join completed. Synced with 3 initial agents
2015/04/07 12:39:27 [WARN] raft: EnableSingleNode disabled, and no known peers. Aborting election.
2015/04/07 12:39:51 [ERR] agent: failed to sync remote state: No cluster leader
2015/04/07 12:39:53 [INFO] agent.rpc: Accepted client: 192.168.50.231:59725
2015/04/07 12:40:17 [ERR] agent: failed to sync remote state: No cluster leader
2015/04/07 12:40:46 [ERR] agent: failed to sync remote state: No cluster leader

ryanuber · 2015-05-06T19:53:41Z

Hey @alpduhuez,

Sorry for the delay. I'm not sure exactly what's going on here. Deleting the data directories could help us at least understand if this could be some corruption issue, however it's a fairly destructive action and should only be done if the cluster state and data are unimportant. Were you able to get this sorted out? There are some further improvements to Consul and Raft in the master branch, which should be released soon as 0.5.1, so it might also be worthwhile to give that a try if you can.

alpduhuez · 2015-05-14T19:34:03Z

@ryanuber thank you for looking. Deleting the data is in fact what I did do to get back to square one. I eventually found that the state was being caused by chef runs. Every time the nodes converged, the consul recipe scheduled a restart. Over time, all 3 servers convergence would eventually line up and cause a down state. If did the manual recovery steps, it would come back.

Far as the health checks. It was also a chef recipe bug that would hork up the service check, sometimes.

I've fixed our recipes now so they only install consul and don't modify and the bugs as well. Since I've done that, the cluster has been happy.

* Acceptance: wait for previous installation to exit In some cases the previous installation is listed as uninstalled by Helm but its pods are still terminating. This can cause issues for the newly running tests when they query Kubernetes to get information about the new installation, e.g. getting the server pod IPs might return the IPs from the old installation as well. * Make static-client/server exit faster

alpduhuez closed this as completed May 14, 2015

ekmixon mentioned this issue May 14, 2022

[Snyk] Security upgrade ember-exam from 4.0.9 to 8.0.0 ekmixon/consul#242

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Service checks revert to failed state from server #818

Service checks revert to failed state from server #818

alpduhuez commented Mar 26, 2015

alpduhuez commented Apr 7, 2015

ryanuber commented May 6, 2015

alpduhuez commented May 14, 2015

Service checks revert to failed state from server #818

Service checks revert to failed state from server #818

Comments

alpduhuez commented Mar 26, 2015

alpduhuez commented Apr 7, 2015

ryanuber commented May 6, 2015

alpduhuez commented May 14, 2015