service health check not immediately available for session #231

abursavich · 2014-06-26T05:25:48Z

This isn't a big problem for me because my retry logic will quickly recover from it, but I figured I'd file an issue. I have a service that comes up, registers itself with a health check, and then tries to acquire a lock using the serfHealth check along with its service check for the session. I've noticed that sometimes the first attempt(s) to create a session fail with a "missing check registration" error. Anecdotally from the logs, the error seems to happen when the session creation occurs before the agent (service/health) sync. After the agent sync everything works smoothly.

Sample logs:

==> WARNING: Bootstrap mode enabled! Do not enable unless necessary
==> WARNING: It is highly recommended to set GOMAXPROCS higher than 1
==> Starting Consul agent...
==> Starting Consul agent RPC...
==> Consul agent running!
         Node name: 'localhost.localdomain'
        Datacenter: 'dc1'
            Server: true (bootstrap: true)
       Client Addr: 127.0.0.1 (HTTP: 8500, DNS: 8600, RPC: 8400)
      Cluster Addr: 10.0.2.15 (LAN: 8301, WAN: 8302)
    Gossip encrypt: false, RPC-TLS: false, TLS-Incoming: false

==> Log data will now stream in as it occurs:

    2014/06/25 22:03:15 [INFO] serf: EventMemberJoin: localhost.localdomain 10.0.2.15
    2014/06/25 22:03:15 [INFO] serf: EventMemberJoin: localhost.localdomain.dc1 10.0.2.15
    2014/06/25 22:03:15 [INFO] raft: Node at 10.0.2.15:8300 [Follower] entering Follower state
    2014/06/25 22:03:15 [INFO] consul: adding server localhost.localdomain (Addr: 10.0.2.15:8300) (DC: dc1)
    2014/06/25 22:03:15 [INFO] consul: adding server localhost.localdomain.dc1 (Addr: 10.0.2.15:8300) (DC: dc1)
    2014/06/25 22:03:15 [DEBUG] dns: request for {_test.consul. 255 1} (94.705us)
    2014/06/25 22:03:15 [DEBUG] dns: request for {_test.consul. 255 1} (141.358us)
    2014/06/25 22:03:15 [ERR] agent: failed to sync remote state: No cluster leader
    2014/06/25 22:03:17 [WARN] raft: Heartbeat timeout reached, starting election
    2014/06/25 22:03:17 [INFO] raft: Node at 10.0.2.15:8300 [Candidate] entering Candidate state
    2014/06/25 22:03:17 [DEBUG] raft: Votes needed: 1
    2014/06/25 22:03:17 [DEBUG] raft: Vote granted. Tally: 1
    2014/06/25 22:03:17 [INFO] raft: Election won. Tally: 1
    2014/06/25 22:03:17 [INFO] raft: Node at 10.0.2.15:8300 [Leader] entering Leader state
    2014/06/25 22:03:17 [INFO] consul: cluster leadership acquired
    2014/06/25 22:03:17 [INFO] consul: New leader elected: localhost.localdomain
    2014/06/25 22:03:17 [INFO] consul: member 'localhost.localdomain' joined, marking health alive
    2014/06/25 22:03:18 [DEBUG] http: Request /v1/agent/self (237.181us)
    2014/06/25 22:03:18 [DEBUG] http: Request /v1/agent/service/register (127.111us)
    2014/06/25 22:03:18 [DEBUG] Check 'service:redacted-1' status is now passing
    2014/06/25 22:03:18 [DEBUG] http: Request /v1/agent/check/pass/service:redacted-1 (108.952us)
    2014/06/25 22:03:18 [DEBUG] http: Request /v1/catalog/service/redacted (202.819us)
    2014/06/25 22:03:18 [ERR] http: Request /v1/session/create, error: Missing check 'service:redacted-1' registration
    2014/06/25 22:03:18 [DEBUG] http: Request /v1/session/create (1.520863ms)
    2014/06/25 22:03:18 [DEBUG] http: Request /v1/catalog/service/redacted?index=3 (32.946089ms)
    2014/06/25 22:03:18 [INFO] agent: Synced service 'redacted-1'
    2014/06/25 22:03:18 [DEBUG] http: Request /v1/catalog/service/redacted?index=5 (1.052369ms)
    2014/06/25 22:03:18 [INFO] agent: Synced check 'service:redacted-1'
    2014/06/25 22:03:19 [DEBUG] http: Request /v1/session/create (2.277945ms)
    2014/06/25 22:03:19 [DEBUG] http: Request /v1/kv/service/redacted/leader?acquire=433db3cc-d442-6ef9-536f-08dc1882f148 (1.204903ms)
    2014/06/25 22:03:19 [DEBUG] http: Request /v1/kv/service/redacted/leader (103.923us)

The text was updated successfully, but these errors were encountered:

armon · 2014-06-27T04:59:24Z

Thanks for reporting this, I've run into it myself. I think the issue we have is that the /v1/agent endpoint's simply update the local state, and kick off the anti-entropy run. However since we respond to the client before this is finished, there is a race. I think we need to change the /v1/agent endpoints to block until the sync is done (or at least attempted, since it may fail).

armon added the type/bug Feature does not function as expected label Oct 14, 2014

ryanuber mentioned this issue Feb 21, 2015

Block until service/check sync is attempted from /v1/agent API's #723

Merged

ryanuber closed this as completed in #723 Feb 23, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

service health check not immediately available for session #231

service health check not immediately available for session #231

abursavich commented Jun 26, 2014

armon commented Jun 27, 2014

service health check not immediately available for session #231

service health check not immediately available for session #231

Comments

abursavich commented Jun 26, 2014

armon commented Jun 27, 2014