[WIP] New ACLs #4791

mkeeler · 2018-10-13T12:05:01Z

This PR is almost a complete rewrite of the ACL system within Consul. It brings the features more in line with other HashiCorp products. Obviously there is quite a bit left to do here but most of it is related docs, testing and finishing the last few commands in the CLI. I will update the PR description and check off the todos as I finish them over the next few days/week.

Description

At a high level this PR is mainly to split ACL tokens from Policies and to split the concepts of Authorization from Identities. A lot of this PR is mostly just to support CRUD operations on ACLTokens and ACLPolicies. These in and of themselves are not particularly interesting. The bigger conceptual changes are in how tokens get resolved, how backwards compatibility is handled and the separation of policy from identity which could lead the way to allowing for alternative identity providers.

On the surface and with a new cluster the ACL system will look very similar to that of Nomads. Both have tokens and policies. Both have local tokens. The ACL management APIs for both are very similar. I even ripped off Nomad's ACL bootstrap resetting procedure. There are a few key differences though.

Nomad requires token and policy replication where Consul only requires policy replication with token replication being opt-in. In Consul local tokens only work with token replication being enabled though.
All policies in Nomad are globally applicable. In Consul all policies are stored and replicated globally but can be scoped to a subset of the datacenters. This allows for more granular access management.
Unlike Nomad, Consul has legacy baggage in the form of the original ACL system. The ramifications of this are:
- A server running the new system must still support other clients using the legacy system.
- A client running the new system must be able to use the legacy RPCs when the servers in its datacenter are running the legacy system.
- The primary ACL DC's servers running in legacy mode needs to be a gate that keeps everything else in the entire multi-DC cluster running in legacy mode.

So not only does this PR implement the new ACL system but has a legacy mode built in for when the cluster isn't ready for new ACLs. Also detecting that new ACLs can be used is automatic and requires no configuration on the part of administrators. This process is detailed more in the "Transitioning from Legacy to New ACL Mode" section below.

Replication

There are 3 replication modes that are supported.

Legacy Replication
Policy-Only Replication
Policy+Token Replication

All 3 replication modes are now using the golang.org/x/time/rate packages Limiter struct to rate limit how quickly the rounds of replication are performed. Only a couple RPCs will be performed each round and then when apply replicated data to the local raft store there is other rate limiting in place to keep from overwhelming the system.

Legacy Replication

The actual procedure for replicating legacy tokens is mostly untouched. A few modifications were made to make it context aware and to make running the replication routine a leader only activity similar to other leader only activities like CA root pruning. The routine will be started and stopped when a node gains leadership and when that leadership is revoked.

Policy Replication

The basic procedure here is to get a policy listing from the remote end, diff it against the policies in the local store, fetch any policies that have changed and then apply the diff in 1+ raft transactions. The deletions are batched based on the number of policy IDs that can fit into the desired log size. The upserts are batched based on the policies estimated size. In the normal case, it is not expected to take more than 1 batch except when doing the initial replication in a cluster that is a heavy ACL user.

So overall it takes 2 remote RPCs and 1+ raftApply calls to do 1 round of policy replication. The RPCs are ACL.PolicyList and the ACL.PolicyBatchRead.

Token Replication

Token replication happens the same ways as policy replication. One RPC is made to get the token list. The tokens are diffed against the non-local tokens in the local store and any missing/out of date ones are fetched with a second RPC. Similar to policies, tokens are deleted in batches with the size based on the number of token IDs being deleted and upserts are batched based on estimated size of the resulting log.

Token Resolution

Prior to this PR, Consul would always trying to make a cross-DC request to the ACL datacenter to resolve a token to an effective Policy. Only when that failed, and the cache was stale would it use its locally replicated data to provide the value.

This PR changes that and makes it inline with Nomads ACL replication. Policies are only ever remotely resolve before policy replication gets started and replicates the first set of policies. Tokens are only ever resolved remotely when token replication is off (then we cache it similarly to the legacy mode) or before token replication has replicated anything.

When resolving remotely (always for clients, or with the conditions specified above for servers), the acl down policy is still respected including the more recent addition of the async-cache mode.

Another big change is that servers no longer have a non-authoritative cache. There will not be any server to server RPCs for doing token resolution within a single DC. This is how Nomad does it and it greatly simplifies the overall architecture.

ACL Package

The top level acl package is generic runtime ACL related code. It does not deal with caching nor persistence of ACL related things. Its only use is to provide a way to parse rules into policies and compile multiple policies into a single Authorizer.

The big difference in this package are:

The cache was gotten rid of. This package knows nothing about acl tokens (only policy rules) and therefore was no longer suitable to be used for caching as it doesn't have enough internal knowledge.
A policy will now parse both an old and new syntax into the policy structure.
Creating an Authorizer (formerly ACL) now takes a list of policies instead of a single Policy. These are merged taking into account rule precedence.
Both exact and prefix matches are stored in a single radix tree for each enforcement area. This means that the lookup procedure to determine the effective access for any given segment is a little more complicated. We can no longer call LongestPrefix on the radix tree and get a policy. Instead we have to walk the tree from the root down to the Prefix all the while tracking the longest prefix but also looking for the exact match. Then the exact match takes precedence over any prefix matching.

CLI

I am not 100% sold on the UX of the CLI after implementing and using it. In particular updating tokens and policies via the CLI could maybe use a little more thought. In particular what I am not in love with is that in many scenarios its difficult to unset a field. An example of this is removal of all datacenter restrictions from a policy. Also some things didn't work out how I expected regarding positional arguments. For example, this is not possible:

consul acl policy create my-agent-policy -rules @rules.hcl -datacenter dc1

The golang flag parser will not parse flags after positional arguments so we can't do that. So one thing I have done is to try and avoid positional arguments all together.

Transitioning from Legacy to New ACL Mode

One important piece to all of this is that it was designed so that systems with new ACLs can coexist with legacy systems. One ramification of this is that when bootstrapping a brand new cluster there is a short window where all systems will have to figure out that new ACLs can be used (we don't want to be issuing RPCs to systems that don't support those endpoints or applying things in Raft that other servers will error on). This is accomplished via the setting of the "acls" serf tag. When systems come up if ACLs are enabled this gets set to a value to indicate that it is running in legacy mode. The leader in the ACL/Primary DC is the first one which can start the transition and it will once all the servers within the ACL/Primary DC are advertising legacy ACL support (legacy systems don't advertse anything so when we encounter one of these we prevent the transition). Once the leader transitions to the new mode the other members in the primary DC will pick up this even and transition themselves. Once all servers in the primary DC have transitioned then the leaders in the other DCs will see this and transition (if all servers within their DC are capable).

Legacy Token Auto-Upgrade

All new tokens have an accessor ID and the concept of a token type is gone. So when acquiring leadership, in addition to starting ACL replication, the node will also start a token auto-upgrade procedure. This will (in a rate limited fashion) assign accessor IDs to all tokens and will assign the global-managment policy to all legacy management tokens.

Legacy client tokens cannot be fully auto-upgraded without user intervention as we do not want to create 1 policy per token. The process of upgrading individual tokens is simply assigning one or more policies to those tokens and discarding the rules. This can be done via the API or CLI but is a manual task as Consul cannot know which policies should apply for a particular token.

Changes

Revendored go-memdb updates. Most of the changes were docs but it did pull in the FilterIterator which is what I was wanting to use.
Revendored hcl to pull in support for outputting HCL in text form. This was needed for the rules translations.
What used to be called an ACL from the old acl package is now an Authorizer. This is much more descriptive what its purpose is and helps to reduce overloading names.
Previously we had an agent ACL Cache (non-authoritative) and a server ACL Cache that was authoritative. All the caching functionality and token/policy resolution capabilities have been moved into agent/consul/acl.go and mainly into the ACLResolver struct
Integrated ACL replication into the leader code. It gets started/stopped like the various other leader specific routines now.
Moved all ACL related configuration (except the acl_datacenter) into an acl stanza within the config. If those items are not present it will still fallback to the legacy config items and port them to the proper place within the runtime config.
TODO - many more things have changed.

TODOS

Follow On Work

Any of these that don't get done will get turned into github issues after this is merged.

Make ACL cache limits configurable
Make ACL mode check min/max intervals configurable
Maybe optimize the ACL mode transitioning to listen for serf member update events instead of iterating through the entire member set periodically.
Check last contact time for ACL replication RPCs to ensure they aren't too stale and re-issue as non-stale or to a different server if it is.

banks

Partial review of the new acl and policy stuff so far - looking great.

Couple of questions/minor things inline. Will look again soon at the rest.

acl/acl.go

banks · 2018-10-15T13:48:33Z

acl/policy.go

+
+func multiPolicyID(policies []*Policy) uint64 {
+	var id uint64
+	hasher := fnv.New64a()


FNV is not a crypto hash function. I assume this is OK but maybe we should add a comment here to state that it's never expected that the ID returned be used as a secret?

If the individual hash IDs are random with a crypto source then this might be OK in practice, but we should still use a crypto hash if we ever expect the output hash to be used as a secret/unguessable/unforgeable value.

Hmm, originally I was using fnv because I wasn't going to use it for crypto purposes. Except I just realized that it is being exported as a field for read only token operations that don't currently expose the secret and therefore should probably be a crypto hash.

This individual case is not a problem but there are others that could be.

banks

Bunch of assorted stuff in here. Some minor nits, some discussion points but we can punt them. Some I think are worth changing now but don't want to block you if I don't get another chance to go deep.

I certainly glossed some of it - particularly the new CLI commands so if someone else can look deeply at those that would be great.

agent/structs/acl.go

agent/acl_endpoint.go

agent/acl_endpoint_legacy.go

agent/config/config.go

banks · 2018-10-16T11:36:14Z

agent/consul/fsm/snapshot_oss.go

@@ -346,6 +353,26 @@ func (s *snapshot) persistIntentions(sink raft.SnapshotSink,
 	return nil
 }

+func (s *snapshot) persistIndex(sink raft.SnapshotSink, encoder *codec.Encoder) error {


Did we really not persist these before? I guess most of the are re-populated when we restore from a snapshot anyway by the last transaction to write to a table. So it's only the special cases where we use index table for primary storage of something that this matters.

Yeah, we really didn't. I changed how we mark whether ACL bootstrapping as been done. Now (like Nomad) it stores the raft index of when it was done in the index table.

And yes, most of it is repopulated.

So we could store/persist a whole other table just to store the single value of the bootstrap index. Which is kind of what we used to do. It seemed like the index table is the right place to store index information though

agent/structs/acl.go

lib/uuid.go

lib/stop_context.go

lib/atomic_bool.go

This just fills in test coverage for CRUD operations against the new ACL RPC endpoints. 1 test is failing, `TestACLEndpoint_PolicyResolve`, but not sure if it may be an underlying issue in token assignment. Still some TODO in testing various cases outlined in #4791 but figured I'd put this up now.

Additional tests for #4791.

* Update docs to include multiple tag support * Sort tags before using them in metrics This addresses the potential proliferation of metrics if a query of "?tag=foo&tag=bar" is treated differently than "?tag=bar&tag=foo". Now, tags are always sorted before being recorded, making these two emit the same metric. * Add caveat about multiple tags returned by the metrics endpoint

Things that are not present yet: - API package updates for the new HTTP endpoints - Backwards compatiblity. - CLI - Local Tokens New ACL System Things that are not present yet: - API package updates for the new HTTP endpoints - Backwards compatiblity. - CLI - Local Tokens

The new ACL system can be used when 1. All servers in the ACL datacenter are advertising support for the new ACLs. 2. All servers in the local datacenter are on version 1.4.0+ In each DC the leader of the cluster has to start advertising new acl support prior before the other servers will transition. All servers and clients start up in legacy mode until we know that new ACLs are supported.

The RPC layer will use the new token functions and then do result conversion.

Additionally a few other bugs have been worked out.

… be if it existed.

Still need to implement the legacy mode tests.

Also implement acl.Policy rule conversion for when we are returning/retrieving a policy via the legacy ACL.GetPolicy RPC endpoint.

Make it able to manage intentions

This just fills in test coverage for CRUD operations against the new ACL RPC endpoints. 1 test is failing, `TestACLEndpoint_PolicyResolve`, but not sure if it may be an underlying issue in token assignment. Still some TODO in testing various cases outlined in #4791 but figured I'd put this up now.

Additional tests for #4791.

* command/policy: simple create tests * command/policy: simple update tests * command/policy: simple list tests * command/policy: remove extra NoError * command/policy: simple read tests * command/policy: simple delete tests * command/token: simple read tests * command/token: simple list tests * command/token: simple create tests * command/token: fix some comments * command/token: simple delete tests * command/token: add simple update tests (failing) * command/translate: add simple tests * command/bootstrap: simple tests * command/agenttokens: simple tests

mkeeler requested review from banks and kyhavlov October 13, 2018 20:57

mkeeler force-pushed the feature/acl-v2 branch from a01570d to 65a0e52 Compare October 15, 2018 13:04

mkeeler changed the base branch from master to release/1.4-staging October 15, 2018 13:15

mkeeler requested a review from a team October 15, 2018 13:15

banks reviewed Oct 15, 2018

View reviewed changes

pearkes added this to the 1.4.0 milestone Oct 15, 2018

mkeeler force-pushed the feature/acl-v2 branch from b7d4e77 to f057ef5 Compare October 16, 2018 11:47

banks reviewed Oct 16, 2018

View reviewed changes

mkeeler requested review from hanshasselberg and rboyer October 17, 2018 15:40

pearkes mentioned this pull request Oct 18, 2018

ACL V2 CRUD RPC testing #4815

Merged

pearkes mentioned this pull request Oct 18, 2018

ACL V2 RPC additional RPC tests #4816

Merged

mkeeler pushed a commit that referenced this pull request Oct 18, 2018

ACL V2 RPC additional RPC tests (#4816)

96f744e

Additional tests for #4791.

pearkes force-pushed the release/1.4-staging branch from b7976c1 to 7d89e51 Compare October 19, 2018 15:46

mkeeler force-pushed the feature/acl-v2 branch from 40ba96b to e6482dd Compare October 19, 2018 16:01

Rebecca Zanzig and others added 11 commits October 19, 2018 12:02

Implement legacy flag in HTTP response and add CreateTime to tokens

f40b475

Fix a segfault in ACLTokenRead

794acd4

Allow missing accessor id and secret id for token updates

c426178

Minor change to get rid of name conflict.

daf9d61

Refactor of the serf tag helpers

f59054a

Get rid legacy state store functions

27875e0

The RPC layer will use the new token functions and then do result conversion.

Initial Token Auto-Upgrade

51a7dd6

Additionally a few other bugs have been worked out.

Dont attempt to use a nil SerfWAN on the server

fc18af0

mkeeler and others added 24 commits October 19, 2018 12:03

Return acl.ErrNotFound if policy/token isn’t in local store but would…

1d93a8b

… be if it existed.

Fix some bootstrap tests

1cf16c4

Get rid of double acl init

6cc447a

Make sure to init the map

5ca52a8

ACLResolver tests

5edb06a

Still need to implement the legacy mode tests.

Implement ACLResolver legacy tests

0e77b9f

Also implement acl.Policy rule conversion for when we are returning/retrieving a policy via the legacy ACL.GetPolicy RPC endpoint.

Update some more struct tests

3ce2ab1

Fix global-management policy

78817d9

Make it able to manage intentions

Request policy ids

d8aa31b

Legacy FSM tests working

8faf670

Implement snapshot save/restore tests

2ab1b7d

Handle deleting a non-existant token.

f4bf435

Fix bootstrapping

8bf2b1f

ACL V2 RPC additional RPC tests (#4816)

b05ea71

Additional tests for #4791.

Add a few easy acl replication tests

3501651

Change how we determine the ACL DC and if ACLs are enabled.

577abdc

Fix json + mapstructure decoding issue.

65ec4ed

Add state store snapshot/restore tests for tokens and policies

9a1c335

Update lots of tests

f5742f9

Prevent legacy endpoints from blowing away some new token data.

b54033e

API test fixup

95bdb8a

Fix the last few existing unit tests.

755005d

mkeeler force-pushed the feature/acl-v2 branch from e6482dd to 755005d Compare October 19, 2018 16:03

mkeeler merged commit 18b29c4 into release/1.4-staging Oct 19, 2018

mkeeler deleted the feature/acl-v2 branch October 24, 2018 13:48

pearkes mentioned this pull request Oct 26, 2018

ACL config file error on server startup #4770

Closed

poblahblahblah mentioned this pull request Nov 13, 2019

consul_acl: allow setting rules for *_prefix where supported ansible/ansible#64809

Closed

freddygv mentioned this pull request Jul 15, 2021

Fix some data races #10396

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] New ACLs #4791

[WIP] New ACLs #4791

mkeeler commented Oct 13, 2018 •

edited

Loading

banks left a comment

banks Oct 15, 2018

mkeeler Oct 15, 2018

banks left a comment

banks Oct 16, 2018

mkeeler Oct 16, 2018

mkeeler Oct 16, 2018

[WIP] New ACLs #4791

[WIP] New ACLs #4791

Conversation

mkeeler commented Oct 13, 2018 • edited Loading

Description

Replication

Legacy Replication

Policy Replication

Token Replication

Token Resolution

ACL Package

CLI

Transitioning from Legacy to New ACL Mode

Legacy Token Auto-Upgrade

Changes

TODOS

Follow On Work

banks left a comment

Choose a reason for hiding this comment

banks Oct 15, 2018

Choose a reason for hiding this comment

mkeeler Oct 15, 2018

Choose a reason for hiding this comment

banks left a comment

Choose a reason for hiding this comment

banks Oct 16, 2018

Choose a reason for hiding this comment

mkeeler Oct 16, 2018

Choose a reason for hiding this comment

mkeeler Oct 16, 2018

Choose a reason for hiding this comment

mkeeler commented Oct 13, 2018 •

edited

Loading