ACLs are refreshed on request critical path #3524

kamaradclimber · 2017-10-03T07:56:50Z

`consul version` for both Client and Server

Client: 0.9.3
Server: 0.9.3

`consul info` for both Client and Server

Server:

agent:
	check_monitors = 0
	check_ttls = 0
	checks = 2
	services = 3
build:
	prerelease = 
	revision = 112c060
	version = 0.9.3
consul:
	bootstrap = false
	known_datacenters = 7
	leader = true
	leader_addr = 10.71.0.187:8300
	server = true
raft:
	applied_index = 286708491
	commit_index = 286708491
	fsm_pending = 0
	last_contact = 0
	last_log_index = 286708491
	last_log_term = 462759
	last_snapshot_index = 286703820
	last_snapshot_term = 462759
	latest_configuration = [{Suffrage:Voter ID:10.71.0.173:8300 Address:10.71.0.173:8300} {Suffrage:Voter ID:10.71.0.171:8300 Address:10.71.0.171:8300} {Suffrage:Voter ID:10.71.0.187:8300 Address:10.71.0.187:8300}]
	latest_configuration_index = 1
	num_peers = 2
	protocol_version = 2
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Leader
	term = 462759
runtime:
	arch = amd64
	cpu_count = 24
	goroutines = 21482
	max_procs = 23
	os = linux
	version = go1.9
serf_lan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 3017
	failed = 1
	health_score = 0
	intent_queue = 0
	left = 4
	member_time = 1093124
	members = 4449
	query_queue = 0
	query_time = 1939
serf_wan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 66379
	members = 21
	query_queue = 0
	query_time = 1

Operating system and Environment details

centos 7.3.1611

Description of the Issue (and unexpected/desired result)

When resolving a token from a non-authoritative server (non leader in the acl dc or any server outside of the acl dc), token are refreshed on the critical path.

If connection to the acl datacenter is normally slow (high latency) or unexpectedly slow (WAN issue) this increase dramatically the 99pctl latency to read consul data.
An example of metrics we have during an outage between va1 (remote datacenter) and the acl datacenter.
https://snapshot.raintank.io/dashboard/snapshot/bt6hf74mgL8o7176bIgMK6TaREFX14yy?refresh=15m&orgId=2
(it first happens at 22:36 UTC and then a second time a few hours later)

https://github.com/hashicorp/consul/blob/v0.9.3/agent/consul/acl.go#L168

It looks somewhat connected to #3111 and possibly #2604.

The text was updated successfully, but these errors were encountered:

kamaradclimber · 2017-10-12T16:16:37Z

@slackpad would you have input on this issue?

slackpad · 2017-10-17T23:34:06Z

Hi @kamaradclimber with the WAN soft fail, if the ACL DC is totally offline (attempts to request from each of the servers have failed) then we will short circuit and fail right away, but you are right that if things are flappy there's kind of a middle ground where things can get very slow.

Need to think on this a little more, but since ACLs are in these critical places, we may want to put some special controls on them, up to possibly using a context with a configurable deadline here, so you can set kind of an SLA on what you want to spend trying to enforce ACLs before falling back to your down policy.

kamaradclimber · 2017-11-04T12:16:50Z

Thanks @slackpad for the answer.

kamaradclimber · 2017-11-19T12:48:01Z

Hello @slackpad,
would it make sense to avoid having acl refresh on critical path and move it in a background job?
This would solve this issue and maybe improve performance.

kamaradclimber · 2018-01-23T17:48:52Z

@slackpad any thought on the previous suggestion?

Having a control on the maximum time to fetch acl but keeping it in the critical path will keep bad performance on reads (even though it will be capped to a configurable maximum).
Moving the refresh out of the critical path will ensure we have a constant time to read data, independant from latency with the acl datacenter.

kamaradclimber · 2018-02-11T08:58:43Z

I'd be glad to have feedback on the suggested solution (maybe by @kyhavlov ?)

kamaradclimber · 2018-06-14T13:01:32Z

Hello @banks,
would you have feedback on that issue? It impact us everytime we see a high latency between our datacenters lasting more than x minutes.

banks · 2018-06-14T13:17:02Z

No specific feedback for now other than we are going to be looking at ACLs in a few ways soon and this is good to have in mind.

It will allow the following: * when connectivity is limited (saturated linnks between DCs), only one single request to refresh an ACL will be sent to ACL master DC instead of statcking ACL refresh queries * when extend-cache is used for ACL, do not wait for result, but refresh the ACL asynchronously, so no delay is not impacting slave DC * When extend-cache is not used, keep the existing blocking mechanism, but only send a single refresh request. This will fix hashicorp#3524

slackpad added type/enhancement Proposed improvement or new feature theme/federation-usability Anything related to Federation labels Oct 17, 2017

slackpad added this to the Unplanned milestone Oct 17, 2017

pierresouchay mentioned this issue Jun 28, 2018

Only send one single ACL cache refresh across network when TTL is over #4303

Merged

mkeeler closed this as completed in #4303 Jul 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ACLs are refreshed on request critical path #3524

ACLs are refreshed on request critical path #3524

kamaradclimber commented Oct 3, 2017

kamaradclimber commented Oct 12, 2017

slackpad commented Oct 17, 2017

kamaradclimber commented Nov 4, 2017

kamaradclimber commented Nov 19, 2017

kamaradclimber commented Jan 23, 2018

kamaradclimber commented Feb 11, 2018

kamaradclimber commented Jun 14, 2018

banks commented Jun 14, 2018

ACLs are refreshed on request critical path #3524

ACLs are refreshed on request critical path #3524

Comments

kamaradclimber commented Oct 3, 2017

consul version for both Client and Server

consul info for both Client and Server

Operating system and Environment details

Description of the Issue (and unexpected/desired result)

kamaradclimber commented Oct 12, 2017

slackpad commented Oct 17, 2017

kamaradclimber commented Nov 4, 2017

kamaradclimber commented Nov 19, 2017

kamaradclimber commented Jan 23, 2018

kamaradclimber commented Feb 11, 2018

kamaradclimber commented Jun 14, 2018

banks commented Jun 14, 2018

`consul version` for both Client and Server

`consul info` for both Client and Server