Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

service health check not immediately available for session #231

Closed
abursavich opened this issue Jun 26, 2014 · 1 comment · Fixed by #723
Closed

service health check not immediately available for session #231

abursavich opened this issue Jun 26, 2014 · 1 comment · Fixed by #723
Labels
type/bug Feature does not function as expected

Comments

@abursavich
Copy link
Contributor

This isn't a big problem for me because my retry logic will quickly recover from it, but I figured I'd file an issue. I have a service that comes up, registers itself with a health check, and then tries to acquire a lock using the serfHealth check along with its service check for the session. I've noticed that sometimes the first attempt(s) to create a session fail with a "missing check registration" error. Anecdotally from the logs, the error seems to happen when the session creation occurs before the agent (service/health) sync. After the agent sync everything works smoothly.

Sample logs:

==> WARNING: Bootstrap mode enabled! Do not enable unless necessary
==> WARNING: It is highly recommended to set GOMAXPROCS higher than 1
==> Starting Consul agent...
==> Starting Consul agent RPC...
==> Consul agent running!
         Node name: 'localhost.localdomain'
        Datacenter: 'dc1'
            Server: true (bootstrap: true)
       Client Addr: 127.0.0.1 (HTTP: 8500, DNS: 8600, RPC: 8400)
      Cluster Addr: 10.0.2.15 (LAN: 8301, WAN: 8302)
    Gossip encrypt: false, RPC-TLS: false, TLS-Incoming: false

==> Log data will now stream in as it occurs:

    2014/06/25 22:03:15 [INFO] serf: EventMemberJoin: localhost.localdomain 10.0.2.15
    2014/06/25 22:03:15 [INFO] serf: EventMemberJoin: localhost.localdomain.dc1 10.0.2.15
    2014/06/25 22:03:15 [INFO] raft: Node at 10.0.2.15:8300 [Follower] entering Follower state
    2014/06/25 22:03:15 [INFO] consul: adding server localhost.localdomain (Addr: 10.0.2.15:8300) (DC: dc1)
    2014/06/25 22:03:15 [INFO] consul: adding server localhost.localdomain.dc1 (Addr: 10.0.2.15:8300) (DC: dc1)
    2014/06/25 22:03:15 [DEBUG] dns: request for {_test.consul. 255 1} (94.705us)
    2014/06/25 22:03:15 [DEBUG] dns: request for {_test.consul. 255 1} (141.358us)
    2014/06/25 22:03:15 [ERR] agent: failed to sync remote state: No cluster leader
    2014/06/25 22:03:17 [WARN] raft: Heartbeat timeout reached, starting election
    2014/06/25 22:03:17 [INFO] raft: Node at 10.0.2.15:8300 [Candidate] entering Candidate state
    2014/06/25 22:03:17 [DEBUG] raft: Votes needed: 1
    2014/06/25 22:03:17 [DEBUG] raft: Vote granted. Tally: 1
    2014/06/25 22:03:17 [INFO] raft: Election won. Tally: 1
    2014/06/25 22:03:17 [INFO] raft: Node at 10.0.2.15:8300 [Leader] entering Leader state
    2014/06/25 22:03:17 [INFO] consul: cluster leadership acquired
    2014/06/25 22:03:17 [INFO] consul: New leader elected: localhost.localdomain
    2014/06/25 22:03:17 [INFO] consul: member 'localhost.localdomain' joined, marking health alive
    2014/06/25 22:03:18 [DEBUG] http: Request /v1/agent/self (237.181us)
    2014/06/25 22:03:18 [DEBUG] http: Request /v1/agent/service/register (127.111us)
    2014/06/25 22:03:18 [DEBUG] Check 'service:redacted-1' status is now passing
    2014/06/25 22:03:18 [DEBUG] http: Request /v1/agent/check/pass/service:redacted-1 (108.952us)
    2014/06/25 22:03:18 [DEBUG] http: Request /v1/catalog/service/redacted (202.819us)
    2014/06/25 22:03:18 [ERR] http: Request /v1/session/create, error: Missing check 'service:redacted-1' registration
    2014/06/25 22:03:18 [DEBUG] http: Request /v1/session/create (1.520863ms)
    2014/06/25 22:03:18 [DEBUG] http: Request /v1/catalog/service/redacted?index=3 (32.946089ms)
    2014/06/25 22:03:18 [INFO] agent: Synced service 'redacted-1'
    2014/06/25 22:03:18 [DEBUG] http: Request /v1/catalog/service/redacted?index=5 (1.052369ms)
    2014/06/25 22:03:18 [INFO] agent: Synced check 'service:redacted-1'
    2014/06/25 22:03:19 [DEBUG] http: Request /v1/session/create (2.277945ms)
    2014/06/25 22:03:19 [DEBUG] http: Request /v1/kv/service/redacted/leader?acquire=433db3cc-d442-6ef9-536f-08dc1882f148 (1.204903ms)
    2014/06/25 22:03:19 [DEBUG] http: Request /v1/kv/service/redacted/leader (103.923us)
@armon
Copy link
Member

armon commented Jun 27, 2014

Thanks for reporting this, I've run into it myself. I think the issue we have is that the /v1/agent endpoint's simply update the local state, and kick off the anti-entropy run. However since we respond to the client before this is finished, there is a race. I think we need to change the /v1/agent endpoints to block until the sync is done (or at least attempted, since it may fail).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Feature does not function as expected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants