[Fix] Services sometimes not being synced with acl_enforce_version_8 = false #4771

ShimmerGlass · 2018-10-09T11:48:08Z

This fixes a bug were registering an agent with a non-existent ACL token can prevent other
services registered with a good token from being synced to the server when using
acl_enforce_version_8 = false.

Background

When acl_enforce_version_8 is off the agent does not check the ACL token validity before
storing the service in its state.
When syncing a service registered with a missing ACL token we fall into the default error
handling case (https://github.com/hashicorp/consul/blob/master/agent/local/state.go#L1255)
and stop the sync (https://github.com/hashicorp/consul/blob/master/agent/local/state.go#L1082)
without setting its Synced property to true like in the permission denied case.
This means that the sync will always stop at the faulty service(s).
The order in which the services are synced is random since we iterate on a map. So eventually
all services with good ACL tokens will be synced, this can however take some time and is influenced
by the cluster size, the bigger the slower because retries are less frequent.
Having a service in this state also prevent all further sync of checks as they are done after
the services.

Changes

This change modify the sync process to continue even if there is an error.
This fixes the issue described above as well as making the sync more error tolerant: if the server repeatedly refuses
a service (the ACL token could have been deleted by the time the service is synced, the servers
were upgraded to a newer version that has more strict checks on the service definition...).
Then all services and check that can be synced will, and those that don't will be marked as errors in
the logs instead of blocking the whole process.

pierresouchay

LGTM, very happy we finally found the reason for #3676

pierresouchay · 2018-11-09T08:21:20Z

@mkeeler @banks Can someone have a look to this?

While I understand this bug is only with old ACL system, we spent so much time fighting with this bug :-) -> #3676

banks · 2018-11-09T11:40:32Z

So are you still running with acl_enforce_version_8 = false in production? Have you looked into what it would take to update?

1.4.0 is introducing an even bigger change to ACL (that will remain backward compatible for now but eventually require a migration) and we'd be interested to know what is going to stop people migrating to that. What was the blocker for going from <0.8 ACL to 0.8?

I'll add this to list to discuss later when we've had some more time to think about if this has any problems associated with it especially since it will be released along with or after an entirely new ACL system...

banks · 2018-11-09T11:52:50Z

agent/local/state.go

@@ -1079,7 +1079,7 @@ func (l *State) SyncChanges() error {
 			l.logger.Printf("[DEBUG] agent: Service %q in sync", id)
 		}
 		if err != nil {
-			return err
+			l.logger.Printf("[WARN] agent: Failed to register service %q: %s", id, err)


It would be preferable if we could turn the error into a nil inside each of the sync*, delete* methods in the same way we do for v8 ACL permission failures.

That would mean this won't have any other unexpected changes of behaviour for totally unrelated issues that currently error here. I can't immediately think of a reason it would be dangerous not to stop here on an error although it might be pointless in many cases to keep trying RPCs while there is no connection to server or something. But the fact that we already special case some error conditions like ACL denied in those methods rant her than globally catching everything here increases the chance of breaking an assumption with this change.

Can you clarify what the error does look like in the old ACL that doesn't exist any more case? Is there a reliable way to sniff for it?

The idea behind this was to prevent other cases when an error stops the whole process in the future. But I understand your point of view.
The condition catches ErrDisabled but not ErrNotFound, which is causing the issue.
I will change this to catch both errors and revert the continue on error in the loop.

pierresouchay · 2018-11-09T11:55:15Z

@banks yes, we are preparing it, but we will apply this after black Friday only since the impact might be huge.

Still, as long as option acl_enforce_version_8=false is present the bug is there ;) (And it took us more than 1 year to figure out the issue and create this patch)

banks · 2018-11-09T11:59:48Z

Yep not saying we don't want this fix, just in the middle of figuring out ACL migration paths and this is interesting extra data 😄.

…

On Fri, Nov 9, 2018 at 11:55 AM Pierre Souchay ***@***.***> wrote: @banks <https://github.com/banks> yes, we are preparing it, but we will apply this after black Friday only since the impact might be huge. Still, as long as option acl_enforce_version_8=false is present the bug is there ;) (And it took us more than 1 year to figure out the issue and create this patch) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4771 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAHYUyk3H5iY89e1xcXb5kBTlUAwDxn8ks5utW0kgaJpZM4XTDMe> .

pierresouchay · 2018-11-09T12:45:38Z

@banks The problem arises on few configurations to be fair, it requires the following conditions:

Having a large enough cluster for having a significant time between anti-entropy checks (on our clusters, around 7 minutes)
Having a large numbers of checks not flapping too regularly on a service having a wrong ACL
Having highly dynamic services (many services changed regularly), in our case, a new service is added/removed around every minute

When registering a service with wrong ACL, what happens is:

for each check, try register the check, but if it fails, stop here
then, when the check flaps, do the same
on anti-entropy check, do the same

When syncing the check, the each element of the map containing the check is run. With golang, the order is each time different, but if you have enough checks with wrong ACLs, the probability of having a check with wrong ACL is quite high (we observe this on machines having more than 30 services with several health checks each), so basically, the full map of checks is never completely synced -> those servers never sync their services. It happens ONLY if your checks are stable enough (not flapping too often), otherwise, sync of services are re-triggered between anti-entropy checks.

In our case, the probability is such high that those nodes never sync fully. We have a workaround for several months running: when we register a service locally, a script does check that the service is properly published in the catalog (so, not blocked by ACL), if not after a grave period, kill the service and de-register it. It works as a workaround, and we plan to switch acl_enforce_version_8 = true to avoid this, but this is a real issue we had for months in productions. Eventually, all the nodes of our clusters where not properly synced due to this error. Note that we had this exact same behaviour beforce the fix we did in Connect that caused an outage on our side ( #4620 ) -> since most Connect services had wrong ACL, all nodes where desync -> Consul Cluster not working at all (all checks having their state in Critical since all of our Checks have initial status := critical) and all services disappearing since all critical

banks · 2018-11-09T13:08:31Z

Thanks for the additional info.

I think the question inline is can we do the same thing with a more targeted change that doesn't potentially affect other types of errors too. I can't think of a reason that any other error would cause problems with this change, but it would take me a lot of thinking to convince myself it's safe in any possible error case to just keep going with the loop. If we can scope it down to only affecting the case you hit then it would be a no-brainer to merge instead of a some-brainer 😄.

Continue on error in the case of ACL not found when syncing the agent state. This fixes a bug where a service registered with a missing ACL token could prevent the agent from syncing its state entirely with enforce_acl_version_8 = false.

ShimmerGlass · 2018-11-09T13:38:48Z

@banks reworked the patch as you suggested, it now only adds a acl.IsErrNotFound check (see my response above).

banks

Fantastic thanks. We'll talk about when to merge in next release discussion.

ShimmerGlass · 2018-11-09T13:42:45Z

Awesome, thanks :)

pierresouchay · 2018-11-09T13:55:52Z

@banks Thank you

pierresouchay · 2018-12-07T23:57:23Z

A Merge maybe ?
Note that a quite similar behavior can be triggered by a race condition when registering services using the HTTP API, see #4998 and its fix #5012

mkeeler

Looks good to me. It seems like maybe we could use a more generic ACL related error to encompass permission denials when the token is valid and for when the token id given is invalid. Either way this PR fixes the immediate problem.

ShimmerGlass force-pushed the fix-bad-acl-register-lock branch from df1cb76 to 4d74088 Compare October 9, 2018 12:14

pierresouchay approved these changes Oct 9, 2018

View reviewed changes

banks added the needs-discussion Topic needs discussion with the larger Consul maintainers before committing to for a release label Nov 9, 2018

banks added this to the 1.4.1 milestone Nov 9, 2018

banks reviewed Nov 9, 2018

View reviewed changes

ShimmerGlass force-pushed the fix-bad-acl-register-lock branch from 4d74088 to 24a45b4 Compare November 9, 2018 13:36

banks approved these changes Nov 9, 2018

View reviewed changes

banks mentioned this pull request Nov 28, 2018

Service Registration/De-registration slows down due to existing services with expired token. #3363

Closed

banks removed the needs-discussion Topic needs discussion with the larger Consul maintainers before committing to for a release label Nov 28, 2018

pierresouchay mentioned this pull request Dec 7, 2018

Race Condition while registering services on local agent #4998

Closed

mkeeler approved these changes Jan 4, 2019

View reviewed changes

mkeeler merged commit 5960974 into hashicorp:master Jan 4, 2019

ShimmerGlass deleted the fix-bad-acl-register-lock branch January 4, 2019 15:03

ShimmerGlass mentioned this pull request Jul 12, 2019

Add a sync status endpoint / metrics to agents #6119

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix] Services sometimes not being synced with acl_enforce_version_8 = false #4771

[Fix] Services sometimes not being synced with acl_enforce_version_8 = false #4771

ShimmerGlass commented Oct 9, 2018 •

edited

Loading

pierresouchay left a comment

pierresouchay commented Nov 9, 2018

banks commented Nov 9, 2018

banks Nov 9, 2018

ShimmerGlass Nov 9, 2018

pierresouchay commented Nov 9, 2018

banks commented Nov 9, 2018 via email

pierresouchay commented Nov 9, 2018

banks commented Nov 9, 2018

ShimmerGlass commented Nov 9, 2018

banks left a comment

ShimmerGlass commented Nov 9, 2018

pierresouchay commented Nov 9, 2018

pierresouchay commented Dec 7, 2018

mkeeler left a comment

[Fix] Services sometimes not being synced with acl_enforce_version_8 = false #4771

[Fix] Services sometimes not being synced with acl_enforce_version_8 = false #4771

Conversation

ShimmerGlass commented Oct 9, 2018 • edited Loading

Background

Changes

pierresouchay left a comment

Choose a reason for hiding this comment

pierresouchay commented Nov 9, 2018

banks commented Nov 9, 2018

banks Nov 9, 2018

Choose a reason for hiding this comment

ShimmerGlass Nov 9, 2018

Choose a reason for hiding this comment

pierresouchay commented Nov 9, 2018

banks commented Nov 9, 2018 via email

pierresouchay commented Nov 9, 2018

banks commented Nov 9, 2018

ShimmerGlass commented Nov 9, 2018

banks left a comment

Choose a reason for hiding this comment

ShimmerGlass commented Nov 9, 2018

pierresouchay commented Nov 9, 2018

pierresouchay commented Dec 7, 2018

mkeeler left a comment

Choose a reason for hiding this comment

ShimmerGlass commented Oct 9, 2018 •

edited

Loading