-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
403 Permission denied error / 404 not found when submiting concurrent kv GET,PUT or DELETE / kv GET recurse with the same token/different tokens with the same policy from a non primary datacenter #5219
Comments
Please notice we have had impactful disasters due to this issue. Consul UI triggers a delete?recurse when it receives a 404. We've had unexpected 404 responses due to the fact we've been submitting requests associated to the same policy and in a intermittent way were processed concurrently. |
@viniciusartur Were the requests being made through a client agent or directly to a consul server? |
We’ve had unexpected 403 on both scenarios: through a client agent and directly to a consul server. |
We noticed that configuring the replication token on the non primary datacenter stops reproducing the issue. |
@viniciusartur This is good information. Somehow I managed to miss the "from non primary datacenter" bit in the bug title. I assume in your non-primary DC you were seeing some error logs about not being able to replicate policies (prior to setting up the replication token). In Consul 1.4.0+ a replication_token must be set in non-primary datacenters. This token needs at least I assume then that the real bug lies somewhere with the remote policy resolution happening on the servers. This also brings up a bigger issue which is that there is currently no guide around setting up ACLs with multiple datacenters which would have probably helped. |
@viniciusartur Without the replication token set do those queries ever work? They shouldn't as you have a default deny policy and the down policy is extend-cache. Since it could have never populated the cache of remote policies it should deny access always. |
@mkeeler Couldn’t you reproduce the issue the way I posted? |
@viniciusartur I was able to reproduce yesterday. Not with vagrant but with a little terraform + docker + a python script to bootstrap acls and do the kv gets/puts. I will have to write up an some internals docs on this but there are a few things to note:
I put in a feature request for me to remember that at some point those secondary servers need to determine how stale their replication is and enable the fallback procedure in that case as well: #4842 One other note is that replication is done by the leader in each secondary datacenter. When I spun up my test cluster I wasn't seeing it at first either. But then remembered I needed to find the leader and view its logs. Why concurrent requests causes bad behavior and resulting in permission denials I have yet to determine. Now that I have reproduced it I am going to figure that out. |
Thanks for clarifying it! |
Overview of the Issue
Given we have 2 consul clusters with ACL enabled and ACL replication
When we submit concurrent kv GET, PUT or DELETE with the same token or with different tokens with the same policy from the non primary datacenter
Then we have a Unexpected response code: 403 (Permission denied)
Or
Then we have a 404 when we submit a kv GET recurse
It seems when consul is resolving the policies of this token it get lost in some way.
The function resolvePoliciesForIdentity from agent/consul/acl.go flows different when have concurrent requests with the same token.
Reproduction Steps
➜ reproduce_it cat Vagrantfile
command: consul agent -server -config-dir=config.json
config.json:
Create a policy with key_prefix "" { policy: "write"}
Create a token or create two different tokens using the same policy
Start 2 loops from dc2 writing keys using that token:
Eventually the 403 permission denied errors will come
or
Eventually the 404 will come
Please notice on this POC I didn't test through an agent connected to DC2, but we had the issue on a production cluster even connected from local agent instead of communicating directly though the server.
You can reproduce the error with PUT, GET or DELETE.
As I mentioned before,
It seems when consul is resolving the policies of this token it get lost in some way.
The function resolvePoliciesForIdentity from agent/consul/acl.go flows different when have concurrent requests with the same token or different tokens with the same policy.
The text was updated successfully, but these errors were encountered: