Wrong ACLs lead to desynchronization of Consul agent with Cluster #3676

pierresouchay · 2017-11-09T19:05:22Z

Hello

Consul Version: 0.9.3 (client & server)

We are using ACLs to register services and HealthCheck in a on-demand cluster, thus people might be able to use wrong tokens for the service they want to describe.

We found out that when several services have wrong ACLs, the state in the local Consul Agent is not the same as seen in the cluster.

While it seems legit for users having wrong token, however, sometimes, it seems that legit services cannot sync their state neither.

We are registering lots of services using the localhost:8500/v1/agent/service/register endpoint, and this is especially visible for health check statuses that are not properly updated. The case seems linked to having several services with wrong tokens (those services might include several heathchecks). Forbidding applications with wrong tokens seems to avoid this issue, but this is problematic since it desync the agent with its state reported by the cluster.

Example of messages as seen in the Consul Agent:

Nov 02 01:50:16 mesos-slave060-par consul[11906]: 2017/11/02 01:50:16 [ERR] consul: RPC failed to server 10.251.1.97:8300: rpc error: rpc error: Permission denied
Nov 02 01:50:16 mesos-slave060-par consul[11906]: 2017/11/02 01:50:16 [WARN] agent: Service 'mysuper-service-31519' registration blocked by ACLs

Here mysuper-service-31519 has clearly a wrong token and its registration fails.
But after a while, having several services doing similar thing completely break agent synchronization, so the state in the cluster is not updated anymore.

A restart of the agent solves the issue most of the time (while we had cases where we had to cleanup the repository of checks to solve the issue). It is especially strange that the service registered with his token is then properly updated (including its healthchecks) -> so I suspect that after a restart, Consul is loosing the token context and does not use the ACL anymore.

We are trying to create for now a clear reproduction of the bug, but it seems complicated (having several clients doing the same), and seems to happen after a while.

Do you have an idea where it could come from?

The text was updated successfully, but these errors were encountered:

wdauchy · 2018-08-14T11:45:31Z

is there any progress regarding this issue?

pierresouchay · 2018-08-14T12:47:22Z

This one #4405 would also help to diagnose those issues

…= false (#4771) Fixes: #3676 This fixes a bug were registering an agent with a non-existent ACL token can prevent other services registered with a good token from being synced to the server when using `acl_enforce_version_8 = false`. ## Background When `acl_enforce_version_8` is off the agent does not check the ACL token validity before storing the service in its state. When syncing a service registered with a missing ACL token we fall into the default error handling case (https://github.com/hashicorp/consul/blob/master/agent/local/state.go#L1255) and stop the sync (https://github.com/hashicorp/consul/blob/master/agent/local/state.go#L1082) without setting its Synced property to true like in the permission denied case. This means that the sync will always stop at the faulty service(s). The order in which the services are synced is random since we iterate on a map. So eventually all services with good ACL tokens will be synced, this can however take some time and is influenced by the cluster size, the bigger the slower because retries are less frequent. Having a service in this state also prevent all further sync of checks as they are done after the services. ## Changes This change modify the sync process to continue even if there is an error. This fixes the issue described above as well as making the sync more error tolerant: if the server repeatedly refuses a service (the ACL token could have been deleted by the time the service is synced, the servers were upgraded to a newer version that has more strict checks on the service definition...). Then all services and check that can be synced will, and those that don't will be marked as errors in the logs instead of blocking the whole process.

pierresouchay · 2019-01-04T16:24:43Z

@mkeeler thank you

slackpad mentioned this issue Nov 14, 2017

Service Registration/De-registration slows down due to existing services with expired token. #3363

Closed

slackpad added this to the 1.0.2 milestone Nov 14, 2017

slackpad modified the milestones: 1.0.2, 1.0.3 Dec 13, 2017

slackpad added type/bug Feature does not function as expected theme/acls ACL and token generation labels Dec 20, 2017

slackpad modified the milestones: 1.0.3, Next Jan 13, 2018

ShimmerGlass mentioned this issue Oct 9, 2018

[Fix] Services sometimes not being synced with acl_enforce_version_8 = false #4771

Merged

pierresouchay mentioned this issue Dec 7, 2018

Race Condition while registering services on local agent #4998

Closed

mkeeler closed this as completed in #4771 Jan 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong ACLs lead to desynchronization of Consul agent with Cluster #3676

Wrong ACLs lead to desynchronization of Consul agent with Cluster #3676

pierresouchay commented Nov 9, 2017

wdauchy commented Aug 14, 2018

pierresouchay commented Aug 14, 2018

pierresouchay commented Jan 4, 2019

Wrong ACLs lead to desynchronization of Consul agent with Cluster #3676

Wrong ACLs lead to desynchronization of Consul agent with Cluster #3676

Comments

pierresouchay commented Nov 9, 2017

wdauchy commented Aug 14, 2018

pierresouchay commented Aug 14, 2018

pierresouchay commented Jan 4, 2019