-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ACLs are refreshed on request critical path #3524
Comments
@slackpad would you have input on this issue? |
Hi @kamaradclimber with the WAN soft fail, if the ACL DC is totally offline (attempts to request from each of the servers have failed) then we will short circuit and fail right away, but you are right that if things are flappy there's kind of a middle ground where things can get very slow. Need to think on this a little more, but since ACLs are in these critical places, we may want to put some special controls on them, up to possibly using a context with a configurable deadline here, so you can set kind of an SLA on what you want to spend trying to enforce ACLs before falling back to your down policy. |
Thanks @slackpad for the answer. |
Hello @slackpad, |
@slackpad any thought on the previous suggestion? Having a control on the maximum time to fetch acl but keeping it in the critical path will keep bad performance on reads (even though it will be capped to a configurable maximum). |
I'd be glad to have feedback on the suggested solution (maybe by @kyhavlov ?) |
Hello @banks, |
No specific feedback for now other than we are going to be looking at ACLs in a few ways soon and this is good to have in mind. |
It will allow the following: * when connectivity is limited (saturated linnks between DCs), only one single request to refresh an ACL will be sent to ACL master DC instead of statcking ACL refresh queries * when extend-cache is used for ACL, do not wait for result, but refresh the ACL asynchronously, so no delay is not impacting slave DC * When extend-cache is not used, keep the existing blocking mechanism, but only send a single refresh request. This will fix hashicorp#3524
consul version
for both Client and ServerClient:
0.9.3
Server:
0.9.3
consul info
for both Client and ServerServer:
Operating system and Environment details
Description of the Issue (and unexpected/desired result)
When resolving a token from a non-authoritative server (non leader in the acl dc or any server outside of the acl dc), token are refreshed on the critical path.
If connection to the acl datacenter is normally slow (high latency) or unexpectedly slow (WAN issue) this increase dramatically the 99pctl latency to read consul data.
An example of metrics we have during an outage between va1 (remote datacenter) and the acl datacenter.
https://snapshot.raintank.io/dashboard/snapshot/bt6hf74mgL8o7176bIgMK6TaREFX14yy?refresh=15m&orgId=2
(it first happens at 22:36 UTC and then a second time a few hours later)
https://github.com/hashicorp/consul/blob/v0.9.3/agent/consul/acl.go#L168
It looks somewhat connected to #3111 and possibly #2604.
The text was updated successfully, but these errors were encountered: