Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ACLs are refreshed on request critical path #3524

Closed
kamaradclimber opened this issue Oct 3, 2017 · 8 comments · Fixed by #4303
Closed

ACLs are refreshed on request critical path #3524

kamaradclimber opened this issue Oct 3, 2017 · 8 comments · Fixed by #4303
Labels
theme/federation-usability Anything related to Federation type/enhancement Proposed improvement or new feature
Milestone

Comments

@kamaradclimber
Copy link
Contributor

consul version for both Client and Server

Client: 0.9.3
Server: 0.9.3

consul info for both Client and Server

Server:

agent:
	check_monitors = 0
	check_ttls = 0
	checks = 2
	services = 3
build:
	prerelease = 
	revision = 112c060
	version = 0.9.3
consul:
	bootstrap = false
	known_datacenters = 7
	leader = true
	leader_addr = 10.71.0.187:8300
	server = true
raft:
	applied_index = 286708491
	commit_index = 286708491
	fsm_pending = 0
	last_contact = 0
	last_log_index = 286708491
	last_log_term = 462759
	last_snapshot_index = 286703820
	last_snapshot_term = 462759
	latest_configuration = [{Suffrage:Voter ID:10.71.0.173:8300 Address:10.71.0.173:8300} {Suffrage:Voter ID:10.71.0.171:8300 Address:10.71.0.171:8300} {Suffrage:Voter ID:10.71.0.187:8300 Address:10.71.0.187:8300}]
	latest_configuration_index = 1
	num_peers = 2
	protocol_version = 2
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Leader
	term = 462759
runtime:
	arch = amd64
	cpu_count = 24
	goroutines = 21482
	max_procs = 23
	os = linux
	version = go1.9
serf_lan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 3017
	failed = 1
	health_score = 0
	intent_queue = 0
	left = 4
	member_time = 1093124
	members = 4449
	query_queue = 0
	query_time = 1939
serf_wan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 66379
	members = 21
	query_queue = 0
	query_time = 1

Operating system and Environment details

  • centos 7.3.1611

Description of the Issue (and unexpected/desired result)

When resolving a token from a non-authoritative server (non leader in the acl dc or any server outside of the acl dc), token are refreshed on the critical path.

If connection to the acl datacenter is normally slow (high latency) or unexpectedly slow (WAN issue) this increase dramatically the 99pctl latency to read consul data.
An example of metrics we have during an outage between va1 (remote datacenter) and the acl datacenter.
https://snapshot.raintank.io/dashboard/snapshot/bt6hf74mgL8o7176bIgMK6TaREFX14yy?refresh=15m&orgId=2
(it first happens at 22:36 UTC and then a second time a few hours later)

https://github.com/hashicorp/consul/blob/v0.9.3/agent/consul/acl.go#L168

It looks somewhat connected to #3111 and possibly #2604.

@kamaradclimber
Copy link
Contributor Author

@slackpad would you have input on this issue?

@slackpad slackpad added type/enhancement Proposed improvement or new feature theme/federation-usability Anything related to Federation labels Oct 17, 2017
@slackpad slackpad added this to the Unplanned milestone Oct 17, 2017
@slackpad
Copy link
Contributor

Hi @kamaradclimber with the WAN soft fail, if the ACL DC is totally offline (attempts to request from each of the servers have failed) then we will short circuit and fail right away, but you are right that if things are flappy there's kind of a middle ground where things can get very slow.

Need to think on this a little more, but since ACLs are in these critical places, we may want to put some special controls on them, up to possibly using a context with a configurable deadline here, so you can set kind of an SLA on what you want to spend trying to enforce ACLs before falling back to your down policy.

@kamaradclimber
Copy link
Contributor Author

Thanks @slackpad for the answer.

@kamaradclimber
Copy link
Contributor Author

Hello @slackpad,
would it make sense to avoid having acl refresh on critical path and move it in a background job?
This would solve this issue and maybe improve performance.

@kamaradclimber
Copy link
Contributor Author

@slackpad any thought on the previous suggestion?

Having a control on the maximum time to fetch acl but keeping it in the critical path will keep bad performance on reads (even though it will be capped to a configurable maximum).
Moving the refresh out of the critical path will ensure we have a constant time to read data, independant from latency with the acl datacenter.

@kamaradclimber
Copy link
Contributor Author

I'd be glad to have feedback on the suggested solution (maybe by @kyhavlov ?)

@kamaradclimber
Copy link
Contributor Author

Hello @banks,
would you have feedback on that issue? It impact us everytime we see a high latency between our datacenters lasting more than x minutes.

@banks
Copy link
Member

banks commented Jun 14, 2018

No specific feedback for now other than we are going to be looking at ACLs in a few ways soon and this is good to have in mind.

pierresouchay added a commit to pierresouchay/consul that referenced this issue Jul 1, 2018
It will allow the following:

 * when connectivity is limited (saturated linnks between DCs), only one
   single request to refresh an ACL will be sent to ACL master DC instead
   of statcking ACL refresh queries
 * when extend-cache is used for ACL, do not wait for result, but refresh
   the ACL asynchronously, so no delay is not impacting slave DC
 * When extend-cache is not used, keep the existing blocking mechanism,
   but only send a single refresh request.

This will fix hashicorp#3524
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/federation-usability Anything related to Federation type/enhancement Proposed improvement or new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants