Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix] Services sometimes not being synced with acl_enforce_version_8 = false #4771

Merged
merged 1 commit into from
Jan 4, 2019

Conversation

ShimmerGlass
Copy link
Contributor

@ShimmerGlass ShimmerGlass commented Oct 9, 2018

Fixes: #3676

This fixes a bug were registering an agent with a non-existent ACL token can prevent other
services registered with a good token from being synced to the server when using
acl_enforce_version_8 = false.

Background

When acl_enforce_version_8 is off the agent does not check the ACL token validity before
storing the service in its state.
When syncing a service registered with a missing ACL token we fall into the default error
handling case (https://github.com/hashicorp/consul/blob/master/agent/local/state.go#L1255)
and stop the sync (https://github.com/hashicorp/consul/blob/master/agent/local/state.go#L1082)
without setting its Synced property to true like in the permission denied case.
This means that the sync will always stop at the faulty service(s).
The order in which the services are synced is random since we iterate on a map. So eventually
all services with good ACL tokens will be synced, this can however take some time and is influenced
by the cluster size, the bigger the slower because retries are less frequent.
Having a service in this state also prevent all further sync of checks as they are done after
the services.

Changes

This change modify the sync process to continue even if there is an error.
This fixes the issue described above as well as making the sync more error tolerant: if the server repeatedly refuses
a service (the ACL token could have been deleted by the time the service is synced, the servers
were upgraded to a newer version that has more strict checks on the service definition...).
Then all services and check that can be synced will, and those that don't will be marked as errors in
the logs instead of blocking the whole process.

@ShimmerGlass ShimmerGlass force-pushed the fix-bad-acl-register-lock branch from df1cb76 to 4d74088 Compare October 9, 2018 12:14
Copy link
Contributor

@pierresouchay pierresouchay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, very happy we finally found the reason for #3676

@pierresouchay
Copy link
Contributor

@mkeeler @banks Can someone have a look to this?

While I understand this bug is only with old ACL system, we spent so much time fighting with this bug :-) -> #3676

@banks
Copy link
Member

banks commented Nov 9, 2018

So are you still running with acl_enforce_version_8 = false in production? Have you looked into what it would take to update?

1.4.0 is introducing an even bigger change to ACL (that will remain backward compatible for now but eventually require a migration) and we'd be interested to know what is going to stop people migrating to that. What was the blocker for going from <0.8 ACL to 0.8?

I'll add this to list to discuss later when we've had some more time to think about if this has any problems associated with it especially since it will be released along with or after an entirely new ACL system...

@banks banks added the needs-discussion Topic needs discussion with the larger Consul maintainers before committing to for a release label Nov 9, 2018
@banks banks added this to the 1.4.1 milestone Nov 9, 2018
@@ -1079,7 +1079,7 @@ func (l *State) SyncChanges() error {
l.logger.Printf("[DEBUG] agent: Service %q in sync", id)
}
if err != nil {
return err
l.logger.Printf("[WARN] agent: Failed to register service %q: %s", id, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be preferable if we could turn the error into a nil inside each of the sync*, delete* methods in the same way we do for v8 ACL permission failures.

That would mean this won't have any other unexpected changes of behaviour for totally unrelated issues that currently error here. I can't immediately think of a reason it would be dangerous not to stop here on an error although it might be pointless in many cases to keep trying RPCs while there is no connection to server or something. But the fact that we already special case some error conditions like ACL denied in those methods rant her than globally catching everything here increases the chance of breaking an assumption with this change.

Can you clarify what the error does look like in the old ACL that doesn't exist any more case? Is there a reliable way to sniff for it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea behind this was to prevent other cases when an error stops the whole process in the future. But I understand your point of view.
The condition catches ErrDisabled but not ErrNotFound, which is causing the issue.
I will change this to catch both errors and revert the continue on error in the loop.

@pierresouchay
Copy link
Contributor

@banks yes, we are preparing it, but we will apply this after black Friday only since the impact might be huge.

Still, as long as option acl_enforce_version_8=false is present the bug is there ;) (And it took us more than 1 year to figure out the issue and create this patch)

@banks
Copy link
Member

banks commented Nov 9, 2018 via email

@pierresouchay
Copy link
Contributor

@banks The problem arises on few configurations to be fair, it requires the following conditions:

  • Having a large enough cluster for having a significant time between anti-entropy checks (on our clusters, around 7 minutes)
  • Having a large numbers of checks not flapping too regularly on a service having a wrong ACL
  • Having highly dynamic services (many services changed regularly), in our case, a new service is added/removed around every minute

When registering a service with wrong ACL, what happens is:

  • for each check, try register the check, but if it fails, stop here
  • then, when the check flaps, do the same
  • on anti-entropy check, do the same

When syncing the check, the each element of the map containing the check is run. With golang, the order is each time different, but if you have enough checks with wrong ACLs, the probability of having a check with wrong ACL is quite high (we observe this on machines having more than 30 services with several health checks each), so basically, the full map of checks is never completely synced -> those servers never sync their services. It happens ONLY if your checks are stable enough (not flapping too often), otherwise, sync of services are re-triggered between anti-entropy checks.

In our case, the probability is such high that those nodes never sync fully. We have a workaround for several months running: when we register a service locally, a script does check that the service is properly published in the catalog (so, not blocked by ACL), if not after a grave period, kill the service and de-register it. It works as a workaround, and we plan to switch acl_enforce_version_8 = true to avoid this, but this is a real issue we had for months in productions. Eventually, all the nodes of our clusters where not properly synced due to this error. Note that we had this exact same behaviour beforce the fix we did in Connect that caused an outage on our side ( #4620 ) -> since most Connect services had wrong ACL, all nodes where desync -> Consul Cluster not working at all (all checks having their state in Critical since all of our Checks have initial status := critical) and all services disappearing since all critical

@banks
Copy link
Member

banks commented Nov 9, 2018

Thanks for the additional info.

I think the question inline is can we do the same thing with a more targeted change that doesn't potentially affect other types of errors too. I can't think of a reason that any other error would cause problems with this change, but it would take me a lot of thinking to convince myself it's safe in any possible error case to just keep going with the loop. If we can scope it down to only affecting the case you hit then it would be a no-brainer to merge instead of a some-brainer 😄.

Continue on error in the case of ACL not found when syncing the agent
state. This fixes a bug where a service registered with a missing ACL
token could prevent the agent from syncing its state entirely with
enforce_acl_version_8 = false.
@ShimmerGlass ShimmerGlass force-pushed the fix-bad-acl-register-lock branch from 4d74088 to 24a45b4 Compare November 9, 2018 13:36
@ShimmerGlass
Copy link
Contributor Author

@banks reworked the patch as you suggested, it now only adds a acl.IsErrNotFound check (see my response above).

Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic thanks. We'll talk about when to merge in next release discussion.

@ShimmerGlass
Copy link
Contributor Author

Awesome, thanks :)

@pierresouchay
Copy link
Contributor

@banks Thank you

@banks banks removed the needs-discussion Topic needs discussion with the larger Consul maintainers before committing to for a release label Nov 28, 2018
@pierresouchay
Copy link
Contributor

A Merge maybe ?
Note that a quite similar behavior can be triggered by a race condition when registering services using the HTTP API, see #4998 and its fix #5012

Copy link
Member

@mkeeler mkeeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. It seems like maybe we could use a more generic ACL related error to encompass permission denials when the token is valid and for when the token id given is invalid. Either way this PR fixes the immediate problem.

@mkeeler mkeeler merged commit 5960974 into hashicorp:master Jan 4, 2019
@ShimmerGlass ShimmerGlass deleted the fix-bad-acl-register-lock branch January 4, 2019 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wrong ACLs lead to desynchronization of Consul agent with Cluster
4 participants