-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fix] Services sometimes not being synced with acl_enforce_version_8 = false #4771
[Fix] Services sometimes not being synced with acl_enforce_version_8 = false #4771
Conversation
df1cb76
to
4d74088
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, very happy we finally found the reason for #3676
So are you still running with 1.4.0 is introducing an even bigger change to ACL (that will remain backward compatible for now but eventually require a migration) and we'd be interested to know what is going to stop people migrating to that. What was the blocker for going from <0.8 ACL to 0.8? I'll add this to list to discuss later when we've had some more time to think about if this has any problems associated with it especially since it will be released along with or after an entirely new ACL system... |
agent/local/state.go
Outdated
@@ -1079,7 +1079,7 @@ func (l *State) SyncChanges() error { | |||
l.logger.Printf("[DEBUG] agent: Service %q in sync", id) | |||
} | |||
if err != nil { | |||
return err | |||
l.logger.Printf("[WARN] agent: Failed to register service %q: %s", id, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be preferable if we could turn the error into a nil inside each of the sync*, delete* methods in the same way we do for v8 ACL permission failures.
That would mean this won't have any other unexpected changes of behaviour for totally unrelated issues that currently error here. I can't immediately think of a reason it would be dangerous not to stop here on an error although it might be pointless in many cases to keep trying RPCs while there is no connection to server or something. But the fact that we already special case some error conditions like ACL denied in those methods rant her than globally catching everything here increases the chance of breaking an assumption with this change.
Can you clarify what the error does look like in the old ACL that doesn't exist any more case? Is there a reliable way to sniff for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea behind this was to prevent other cases when an error stops the whole process in the future. But I understand your point of view.
The condition catches ErrDisabled
but not ErrNotFound
, which is causing the issue.
I will change this to catch both errors and revert the continue on error in the loop.
@banks yes, we are preparing it, but we will apply this after black Friday only since the impact might be huge. Still, as long as option acl_enforce_version_8=false is present the bug is there ;) (And it took us more than 1 year to figure out the issue and create this patch) |
Yep not saying we don't want this fix, just in the middle of figuring out
ACL migration paths and this is interesting extra data 😄.
…On Fri, Nov 9, 2018 at 11:55 AM Pierre Souchay ***@***.***> wrote:
@banks <https://github.com/banks> yes, we are preparing it, but we will
apply this after black Friday only since the impact might be huge.
Still, as long as option acl_enforce_version_8=false is present the bug is
there ;) (And it took us more than 1 year to figure out the issue and
create this patch)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4771 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAHYUyk3H5iY89e1xcXb5kBTlUAwDxn8ks5utW0kgaJpZM4XTDMe>
.
|
@banks The problem arises on few configurations to be fair, it requires the following conditions:
When registering a service with wrong ACL, what happens is:
When syncing the check, the each element of the map containing the check is run. With golang, the order is each time different, but if you have enough checks with wrong ACLs, the probability of having a check with wrong ACL is quite high (we observe this on machines having more than 30 services with several health checks each), so basically, the full map of checks is never completely synced -> those servers never sync their services. It happens ONLY if your checks are stable enough (not flapping too often), otherwise, sync of services are re-triggered between anti-entropy checks. In our case, the probability is such high that those nodes never sync fully. We have a workaround for several months running: when we register a service locally, a script does check that the service is properly published in the catalog (so, not blocked by ACL), if not after a grave period, kill the service and de-register it. It works as a workaround, and we plan to switch |
Thanks for the additional info. I think the question inline is can we do the same thing with a more targeted change that doesn't potentially affect other types of errors too. I can't think of a reason that any other error would cause problems with this change, but it would take me a lot of thinking to convince myself it's safe in any possible error case to just keep going with the loop. If we can scope it down to only affecting the case you hit then it would be a no-brainer to merge instead of a some-brainer 😄. |
Continue on error in the case of ACL not found when syncing the agent state. This fixes a bug where a service registered with a missing ACL token could prevent the agent from syncing its state entirely with enforce_acl_version_8 = false.
4d74088
to
24a45b4
Compare
@banks reworked the patch as you suggested, it now only adds a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic thanks. We'll talk about when to merge in next release discussion.
Awesome, thanks :) |
@banks Thank you |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. It seems like maybe we could use a more generic ACL related error to encompass permission denials when the token is valid and for when the token id given is invalid. Either way this PR fixes the immediate problem.
Fixes: #3676
This fixes a bug were registering an agent with a non-existent ACL token can prevent other
services registered with a good token from being synced to the server when using
acl_enforce_version_8 = false
.Background
When
acl_enforce_version_8
is off the agent does not check the ACL token validity beforestoring the service in its state.
When syncing a service registered with a missing ACL token we fall into the default error
handling case (https://github.com/hashicorp/consul/blob/master/agent/local/state.go#L1255)
and stop the sync (https://github.com/hashicorp/consul/blob/master/agent/local/state.go#L1082)
without setting its Synced property to true like in the permission denied case.
This means that the sync will always stop at the faulty service(s).
The order in which the services are synced is random since we iterate on a map. So eventually
all services with good ACL tokens will be synced, this can however take some time and is influenced
by the cluster size, the bigger the slower because retries are less frequent.
Having a service in this state also prevent all further sync of checks as they are done after
the services.
Changes
This change modify the sync process to continue even if there is an error.
This fixes the issue described above as well as making the sync more error tolerant: if the server repeatedly refuses
a service (the ACL token could have been deleted by the time the service is synced, the servers
were upgraded to a newer version that has more strict checks on the service definition...).
Then all services and check that can be synced will, and those that don't will be marked as errors in
the logs instead of blocking the whole process.