-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong ACLs lead to desynchronization of Consul agent with Cluster #3676
Labels
Milestone
Comments
slackpad
added
type/bug
Feature does not function as expected
theme/acls
ACL and token generation
labels
Dec 20, 2017
is there any progress regarding this issue? |
This one #4405 would also help to diagnose those issues |
mkeeler
pushed a commit
that referenced
this issue
Jan 4, 2019
…= false (#4771) Fixes: #3676 This fixes a bug were registering an agent with a non-existent ACL token can prevent other services registered with a good token from being synced to the server when using `acl_enforce_version_8 = false`. ## Background When `acl_enforce_version_8` is off the agent does not check the ACL token validity before storing the service in its state. When syncing a service registered with a missing ACL token we fall into the default error handling case (https://github.com/hashicorp/consul/blob/master/agent/local/state.go#L1255) and stop the sync (https://github.com/hashicorp/consul/blob/master/agent/local/state.go#L1082) without setting its Synced property to true like in the permission denied case. This means that the sync will always stop at the faulty service(s). The order in which the services are synced is random since we iterate on a map. So eventually all services with good ACL tokens will be synced, this can however take some time and is influenced by the cluster size, the bigger the slower because retries are less frequent. Having a service in this state also prevent all further sync of checks as they are done after the services. ## Changes This change modify the sync process to continue even if there is an error. This fixes the issue described above as well as making the sync more error tolerant: if the server repeatedly refuses a service (the ACL token could have been deleted by the time the service is synced, the servers were upgraded to a newer version that has more strict checks on the service definition...). Then all services and check that can be synced will, and those that don't will be marked as errors in the logs instead of blocking the whole process.
@mkeeler thank you |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Hello
Consul Version: 0.9.3 (client & server)
We are using ACLs to register services and HealthCheck in a on-demand cluster, thus people might be able to use wrong tokens for the service they want to describe.
We found out that when several services have wrong ACLs, the state in the local Consul Agent is not the same as seen in the cluster.
While it seems legit for users having wrong token, however, sometimes, it seems that legit services cannot sync their state neither.
We are registering lots of services using the localhost:8500/v1/agent/service/register endpoint, and this is especially visible for health check statuses that are not properly updated. The case seems linked to having several services with wrong tokens (those services might include several heathchecks). Forbidding applications with wrong tokens seems to avoid this issue, but this is problematic since it desync the agent with its state reported by the cluster.
Example of messages as seen in the Consul Agent:
Here mysuper-service-31519 has clearly a wrong token and its registration fails.
But after a while, having several services doing similar thing completely break agent synchronization, so the state in the cluster is not updated anymore.
A restart of the agent solves the issue most of the time (while we had cases where we had to cleanup the repository of checks to solve the issue). It is especially strange that the service registered with his token is then properly updated (including its healthchecks) -> so I suspect that after a restart, Consul is loosing the token context and does not use the ACL anymore.
We are trying to create for now a clear reproduction of the bug, but it seems complicated (having several clients doing the same), and seems to happen after a while.
Do you have an idea where it could come from?
The text was updated successfully, but these errors were encountered: