Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul unable to deregister service due to ACL, showing empty accessorID #8078

Closed
tommyalatalo opened this issue Jun 10, 2020 · 5 comments
Closed
Labels
theme/consul-nomad Consul & Nomad shared usability theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics type/bug Feature does not function as expected

Comments

@tommyalatalo
Copy link

tommyalatalo commented Jun 10, 2020

Overview of the Issue

Sometimes when stopping a Nomad job which uses the Vault Consul engine to get an ACL token the service fails to deregister from Consul with the below error message. Note the empty accessorID.

2020-05-18T13:13:40.600Z [WARN]  agent: Service deregistration blocked by ACLs: service=api-e827a3f70da4-9997 accessorID=
2020-05-18T13:13:40.601Z [WARN]  agent: Check deregistration blocked by ACLs: check=service:api-e827a3f70da4-9997:2 accessorID=
2020-05-18T13:13:40.602Z [WARN]  agent: Check deregistration blocked by ACLs: check=api-e827a3f70da4-9997-ttl accessorID=

In my case the job is the fabio proxy which self-registers the ttl health check (i.e. not done by Nomad as far as I know).

Reproduction Steps

  1. Start a Nomad job which fetches Consul credentials using Vault's Consul secret engine.
  2. Stop the job with "nomad job stop -purge"
  3. See the job stop but some service and healthcheck may remain and are not possible to deregister using consul service deregister or via hashi-ui which uses the same rest api. (Seems to be specifically "TTL" checks registered by both fabio and rabbitmq that end up as zombies for some reason)

Only way to get rid of the zombie service and the error message in the Consul logs is to do consul leave on each of our server nodes and then restart consul.

Consul info for both Client and Server

Running versions:

  • Consul 1.7.3
  • Nomad 0.11.1
  • Vault 1.4.0
  • Ubuntu 18.04 LTS

Other info

This issue is possibly linked to #7669

@mkeeler
Copy link
Member

mkeeler commented Jun 10, 2020

Having an accessorID being blank indicates that either the request was made with the anonymous token (changed in 1.8.0 to output the accessor id of the anonymous token) or that the token used for the registration has been deleted.

My guess here is that Nomad deleted the Consul token at about the same time as deregistering the service with the Consul agent. Then when the Consul agent went to fully remove the service from the Catalog the token was no longer valid for use.

Thinking out loud a bit here, but I am wondering if Consul should unconditionally perform deregistrations during anti-entropy with the agent token instead of the token used to register/deregister the service with the local agent?

@mkeeler mkeeler added type/bug Feature does not function as expected theme/consul-nomad Consul & Nomad shared usability theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics labels Jun 10, 2020
@tommyalatalo
Copy link
Author

tommyalatalo commented Jun 10, 2020

Having an accessorID being blank indicates that either the request was made with the anonymous token (changed in 1.8.0 to output the accessor id of the anonymous token) or that the token used for the registration has been deleted.

My guess here is that Nomad deleted the Consul token at about the same time as deregistering the service with the Consul agent. Then when the Consul agent went to fully remove the service from the Catalog the token was no longer valid for use.

Thinking out loud a bit here, but I am wondering if Consul should unconditionally perform deregistrations during anti-entropy with the agent token instead of the token used to register/deregister the service with the local agent?

Your mention that this is Nomad clearing out the ACL token before Consul has had the chance to use it to deregister the service seems quite plausible since I assume the anonymous token isn't used at all in this case where Nomad fetches the Consul token with specific policies from Vault when running the job.

On initial though using the agent token to force removal of services during anti-entropy seems like a good idea. I don't know if there would be any drawbacks of that approach except you're giving Consul higher permissions to deregister any service by using the agent token in this situation?

For us this is becoming a pretty big problem because it keeps occurring a lot, like several times a day if were trying out things in the cluster (starting and stopping jobs, testing shutdown of nodes etc). Having these zombie services remain is really annoying, and the workaround of having to restart the servers one by one to clear out the service and its health checks is iffy at best.

@pierresouchay
Copy link
Contributor

We had this in the past and sometimes have this issue as well (while very unfrequently since we took measures to avoid it).

Our main original issue was agent being re-installed cleaned up from all services, hence the patch we made: #5217

With this PR, the server now accepts the deregistration if performed with own's agent token (since we can consider the agent is authoritative regarding the services it is not hosting, for registration, it is different). So, if the agent accept deregistration, but get denied access during anti-entropy, probably it should retry with its node agent to effectively remove it. From a security perspective, I think it makes sense as this token is enough to effectively leave the agent and thus removing all its services at once.

We had this issue yesterday for a very long running service for which ACL had been removed (but the case of temp ACLs definitely make sense of course)

@v-byte-cpu
Copy link

v-byte-cpu commented Sep 5, 2020

@mkeeler I also have this issue. From documentation https://www.consul.io/docs/agent/checks :

Checks may also contain a token field to provide an ACL token. This token is used for any interaction with the catalog for the check, including anti-entropy syncs and deregistration.

So consul agent caches Consul token used for check registration. I periodically rotate Consul tokens for Nomad clients and hence very long running service for which ACL token had been removed can't be deregistered from catalog. I agree with @pierresouchay that it makes sense to retry with node agent token.

@dnephin
Copy link
Contributor

dnephin commented Feb 1, 2021

Thank you for the bug reports! We now have at least 3 issues about this problem, so i'm going to close this issue and track the fix in #9577. There's a summary of the problem, and some possible solutions in this comment #9577 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/consul-nomad Consul & Nomad shared usability theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics type/bug Feature does not function as expected
Projects
None yet
Development

No branches or pull requests

5 participants