Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] New ACLs #4791

Merged
merged 85 commits into from
Oct 19, 2018
Merged

[WIP] New ACLs #4791

merged 85 commits into from
Oct 19, 2018

Conversation

mkeeler
Copy link
Member

@mkeeler mkeeler commented Oct 13, 2018

This PR is almost a complete rewrite of the ACL system within Consul. It brings the features more in line with other HashiCorp products. Obviously there is quite a bit left to do here but most of it is related docs, testing and finishing the last few commands in the CLI. I will update the PR description and check off the todos as I finish them over the next few days/week.

Description

At a high level this PR is mainly to split ACL tokens from Policies and to split the concepts of Authorization from Identities. A lot of this PR is mostly just to support CRUD operations on ACLTokens and ACLPolicies. These in and of themselves are not particularly interesting. The bigger conceptual changes are in how tokens get resolved, how backwards compatibility is handled and the separation of policy from identity which could lead the way to allowing for alternative identity providers.

On the surface and with a new cluster the ACL system will look very similar to that of Nomads. Both have tokens and policies. Both have local tokens. The ACL management APIs for both are very similar. I even ripped off Nomad's ACL bootstrap resetting procedure. There are a few key differences though.

  • Nomad requires token and policy replication where Consul only requires policy replication with token replication being opt-in. In Consul local tokens only work with token replication being enabled though.
  • All policies in Nomad are globally applicable. In Consul all policies are stored and replicated globally but can be scoped to a subset of the datacenters. This allows for more granular access management.
  • Unlike Nomad, Consul has legacy baggage in the form of the original ACL system. The ramifications of this are:
    • A server running the new system must still support other clients using the legacy system.
    • A client running the new system must be able to use the legacy RPCs when the servers in its datacenter are running the legacy system.
    • The primary ACL DC's servers running in legacy mode needs to be a gate that keeps everything else in the entire multi-DC cluster running in legacy mode.

So not only does this PR implement the new ACL system but has a legacy mode built in for when the cluster isn't ready for new ACLs. Also detecting that new ACLs can be used is automatic and requires no configuration on the part of administrators. This process is detailed more in the "Transitioning from Legacy to New ACL Mode" section below.

Replication

There are 3 replication modes that are supported.

Legacy Replication
Policy-Only Replication
Policy+Token Replication

All 3 replication modes are now using the golang.org/x/time/rate packages Limiter struct to rate limit how quickly the rounds of replication are performed. Only a couple RPCs will be performed each round and then when apply replicated data to the local raft store there is other rate limiting in place to keep from overwhelming the system.

Legacy Replication

The actual procedure for replicating legacy tokens is mostly untouched. A few modifications were made to make it context aware and to make running the replication routine a leader only activity similar to other leader only activities like CA root pruning. The routine will be started and stopped when a node gains leadership and when that leadership is revoked.

Policy Replication

The basic procedure here is to get a policy listing from the remote end, diff it against the policies in the local store, fetch any policies that have changed and then apply the diff in 1+ raft transactions. The deletions are batched based on the number of policy IDs that can fit into the desired log size. The upserts are batched based on the policies estimated size. In the normal case, it is not expected to take more than 1 batch except when doing the initial replication in a cluster that is a heavy ACL user.

So overall it takes 2 remote RPCs and 1+ raftApply calls to do 1 round of policy replication. The RPCs are ACL.PolicyList and the ACL.PolicyBatchRead.

Token Replication

Token replication happens the same ways as policy replication. One RPC is made to get the token list. The tokens are diffed against the non-local tokens in the local store and any missing/out of date ones are fetched with a second RPC. Similar to policies, tokens are deleted in batches with the size based on the number of token IDs being deleted and upserts are batched based on estimated size of the resulting log.

Token Resolution

Prior to this PR, Consul would always trying to make a cross-DC request to the ACL datacenter to resolve a token to an effective Policy. Only when that failed, and the cache was stale would it use its locally replicated data to provide the value.

This PR changes that and makes it inline with Nomads ACL replication. Policies are only ever remotely resolve before policy replication gets started and replicates the first set of policies. Tokens are only ever resolved remotely when token replication is off (then we cache it similarly to the legacy mode) or before token replication has replicated anything.

When resolving remotely (always for clients, or with the conditions specified above for servers), the acl down policy is still respected including the more recent addition of the async-cache mode.

Another big change is that servers no longer have a non-authoritative cache. There will not be any server to server RPCs for doing token resolution within a single DC. This is how Nomad does it and it greatly simplifies the overall architecture.

ACL Package

The top level acl package is generic runtime ACL related code. It does not deal with caching nor persistence of ACL related things. Its only use is to provide a way to parse rules into policies and compile multiple policies into a single Authorizer.

The big difference in this package are:

  1. The cache was gotten rid of. This package knows nothing about acl tokens (only policy rules) and therefore was no longer suitable to be used for caching as it doesn't have enough internal knowledge.
  2. A policy will now parse both an old and new syntax into the policy structure.
  3. Creating an Authorizer (formerly ACL) now takes a list of policies instead of a single Policy. These are merged taking into account rule precedence.
  4. Both exact and prefix matches are stored in a single radix tree for each enforcement area. This means that the lookup procedure to determine the effective access for any given segment is a little more complicated. We can no longer call LongestPrefix on the radix tree and get a policy. Instead we have to walk the tree from the root down to the Prefix all the while tracking the longest prefix but also looking for the exact match. Then the exact match takes precedence over any prefix matching.

CLI

I am not 100% sold on the UX of the CLI after implementing and using it. In particular updating tokens and policies via the CLI could maybe use a little more thought. In particular what I am not in love with is that in many scenarios its difficult to unset a field. An example of this is removal of all datacenter restrictions from a policy. Also some things didn't work out how I expected regarding positional arguments. For example, this is not possible:

consul acl policy create my-agent-policy -rules @rules.hcl -datacenter dc1

The golang flag parser will not parse flags after positional arguments so we can't do that. So one thing I have done is to try and avoid positional arguments all together.

Transitioning from Legacy to New ACL Mode

One important piece to all of this is that it was designed so that systems with new ACLs can coexist with legacy systems. One ramification of this is that when bootstrapping a brand new cluster there is a short window where all systems will have to figure out that new ACLs can be used (we don't want to be issuing RPCs to systems that don't support those endpoints or applying things in Raft that other servers will error on). This is accomplished via the setting of the "acls" serf tag. When systems come up if ACLs are enabled this gets set to a value to indicate that it is running in legacy mode. The leader in the ACL/Primary DC is the first one which can start the transition and it will once all the servers within the ACL/Primary DC are advertising legacy ACL support (legacy systems don't advertse anything so when we encounter one of these we prevent the transition). Once the leader transitions to the new mode the other members in the primary DC will pick up this even and transition themselves. Once all servers in the primary DC have transitioned then the leaders in the other DCs will see this and transition (if all servers within their DC are capable).

Legacy Token Auto-Upgrade

All new tokens have an accessor ID and the concept of a token type is gone. So when acquiring leadership, in addition to starting ACL replication, the node will also start a token auto-upgrade procedure. This will (in a rate limited fashion) assign accessor IDs to all tokens and will assign the global-managment policy to all legacy management tokens.

Legacy client tokens cannot be fully auto-upgraded without user intervention as we do not want to create 1 policy per token. The process of upgrading individual tokens is simply assigning one or more policies to those tokens and discarding the rules. This can be done via the API or CLI but is a manual task as Consul cannot know which policies should apply for a particular token.

Changes

  • Revendored go-memdb updates. Most of the changes were docs but it did pull in the FilterIterator which is what I was wanting to use.
  • Revendored hcl to pull in support for outputting HCL in text form. This was needed for the rules translations.
  • What used to be called an ACL from the old acl package is now an Authorizer. This is much more descriptive what its purpose is and helps to reduce overloading names.
  • Previously we had an agent ACL Cache (non-authoritative) and a server ACL Cache that was authoritative. All the caching functionality and token/policy resolution capabilities have been moved into agent/consul/acl.go and mainly into the ACLResolver struct
  • Integrated ACL replication into the leader code. It gets started/stopped like the various other leader specific routines now.
  • Moved all ACL related configuration (except the acl_datacenter) into an acl stanza within the config. If those items are not present it will still fallback to the legacy config items and port them to the proper place within the runtime config.
    TODO - many more things have changed.

TODOS

  • Rebase to master
  • Pull in primary datacenter code and make sure it works with the ACL configuration.
  • Finish the CLI
    • Implement consul acl token create
    • Implement consul acl token list
    • Implement consul acl token read
    • Implement consul acl token update
    • Implement consul acl token delete
  • Review all my TODOs in the code most of which are related to verifying that values of some consts are sane.
  • Documentation
    • API documentation for the new HTTP endpoints
    • CLI command documentation for all the new commands
    • Config documentation for the new/deprecated entries.
    • Rewrite of our ACL guide (just focusing on new ACLs)
    • ACL upgrade guide (how to transition from the old -> new system)
    • Telemetry docs (replication entries and various state store/raft ones too)
  • Unit Tests
    • Top level ACL package
    • HTTP ACL API
      • Legacy Endpoint Test Fixes
      • Policy CRUD
      • Token CRUD
    • API package
    • RPC ACL API
      • General CRUD Operations
      • Cannot delete anon token
      • Can update anon token policies
      • Cannot delete global management policy
      • Cannot change global management policy rules
      • Can rename global-management policy
    • ACL State (including Snapshotting)
    • ACL FSM (including Snapshotting)
    • Bootstrap resetting
    • ACL CLI
    • ACLResolver within the agent/consul package.
      • Local Policies
      • Local Tokens
      • Remote Policies
      • Remote Tokens
      • Legacy Mode Resolution
      • Default policies
      • Down Policies
      • Cache Expiration
      • Token Secret Redaction
    • ACL Replication
      • Policy Replication
      • Token Replication
      • Legacy Replication
    • ACL structures and supporting functions within the agent/structs package
      • New Legacy Tests
      • ACLToken tests
      • ACLPolicy tests
      • ACLCaches tests
    • Leader ACL initialization
    • Leader legacy token auto-upgrade (this will be a tough to "unit" test)
  • Integration Tests
    • Legacy Vault Integration
    • Legacy UI/Agent with new servers
    • Single Datacenter Tests
      • New Cluster - all nodes running new code
      • Upgraded Servers - Servers w/ new ACLs, clients running 1.3.0
      • Upgraded Clients - Clients w/ new ACLs, servers running 1.3.0
      • All Upgraded - previously ACLs were bootstrapped with a 1.3.0
      • Partial upgrade new leader - some servers running with new ACLs including the leader with other legacy ACL servers there as well (should use legacy ACLs all around)
      • Partial Upgrade old leader - some servers running with new ACLs with others, including the leader, using legacy ACLs
    • Multi Datacenter ACLs - all tests should be run with new ACLs in secondary datacenters with policy-only and policy+token replication.
      • Primary DC Legacy + Secondary DC Upgraded
        • Verify legacy ACL replication within new system still works.
      • Primary DC Upgrade + Secondary DC Legacy
      • Primary DC Upgrade + Secondary DC Upgraded + Secondary DC legacy clients
      • Local Tokens within a secondary DC do not show up in the primary.
    • Backwards Compatibility
      • Legacy tokens still useable
      • Legacy tokens visible via the new APIs

Follow On Work

Any of these that don't get done will get turned into github issues after this is merged.

  • Make ACL cache limits configurable
  • Make ACL mode check min/max intervals configurable
  • Maybe optimize the ACL mode transitioning to listen for serf member update events instead of iterating through the entire member set periodically.
  • Check last contact time for ACL replication RPCs to ensure they aren't too stale and re-issue as non-stale or to a different server if it is.

@mkeeler mkeeler requested review from banks and kyhavlov October 13, 2018 20:57
@mkeeler mkeeler changed the base branch from master to release/1.4-staging October 15, 2018 13:15
@mkeeler mkeeler requested a review from a team October 15, 2018 13:15
Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review of the new acl and policy stuff so far - looking great.

Couple of questions/minor things inline. Will look again soon at the rest.

acl/acl.go Show resolved Hide resolved
acl/acl.go Show resolved Hide resolved
acl/acl.go Show resolved Hide resolved
acl/policy.go Outdated

func multiPolicyID(policies []*Policy) uint64 {
var id uint64
hasher := fnv.New64a()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FNV is not a crypto hash function. I assume this is OK but maybe we should add a comment here to state that it's never expected that the ID returned be used as a secret?

If the individual hash IDs are random with a crypto source then this might be OK in practice, but we should still use a crypto hash if we ever expect the output hash to be used as a secret/unguessable/unforgeable value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, originally I was using fnv because I wasn't going to use it for crypto purposes. Except I just realized that it is being exported as a field for read only token operations that don't currently expose the secret and therefore should probably be a crypto hash.

This individual case is not a problem but there are others that could be.

@pearkes pearkes added this to the 1.4.0 milestone Oct 15, 2018
Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bunch of assorted stuff in here. Some minor nits, some discussion points but we can punt them. Some I think are worth changing now but don't want to block you if I don't get another chance to go deep.

I certainly glossed some of it - particularly the new CLI commands so if someone else can look deeply at those that would be great.

agent/structs/acl.go Show resolved Hide resolved
agent/acl_endpoint.go Show resolved Hide resolved
agent/acl_endpoint_legacy.go Show resolved Hide resolved
agent/config/config.go Outdated Show resolved Hide resolved
agent/config/config.go Outdated Show resolved Hide resolved
@@ -346,6 +353,26 @@ func (s *snapshot) persistIntentions(sink raft.SnapshotSink,
return nil
}

func (s *snapshot) persistIndex(sink raft.SnapshotSink, encoder *codec.Encoder) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we really not persist these before? I guess most of the are re-populated when we restore from a snapshot anyway by the last transaction to write to a table. So it's only the special cases where we use index table for primary storage of something that this matters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we really didn't. I changed how we mark whether ACL bootstrapping as been done. Now (like Nomad) it stores the raft index of when it was done in the index table.

And yes, most of it is repopulated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we could store/persist a whole other table just to store the single value of the bootstrap index. Which is kind of what we used to do. It seemed like the index table is the right place to store index information though

agent/structs/acl.go Outdated Show resolved Hide resolved
lib/uuid.go Outdated Show resolved Hide resolved
lib/stop_context.go Show resolved Hide resolved
lib/atomic_bool.go Outdated Show resolved Hide resolved
@pearkes pearkes mentioned this pull request Oct 18, 2018
mkeeler pushed a commit that referenced this pull request Oct 18, 2018
This just fills in test coverage for CRUD operations against the new ACL RPC endpoints.

1 test is failing, `TestACLEndpoint_PolicyResolve`, but not sure if it may be an underlying issue in token assignment.

Still some TODO in testing various cases outlined in #4791 but figured I'd put this up now.
mkeeler pushed a commit that referenced this pull request Oct 18, 2018
@pearkes pearkes force-pushed the release/1.4-staging branch from b7976c1 to 7d89e51 Compare October 19, 2018 15:46
Rebecca Zanzig and others added 11 commits October 19, 2018 12:02
* Update docs to include multiple tag support

* Sort tags before using them in metrics

This addresses the potential proliferation of metrics if a query of
"?tag=foo&tag=bar" is treated differently than "?tag=bar&tag=foo".
Now, tags are always sorted before being recorded, making these two
emit the same metric.

* Add caveat about multiple tags returned by the metrics endpoint
Things that are not present yet:

- API package updates for the new HTTP endpoints
- Backwards compatiblity.
- CLI
- Local Tokens

New ACL System

Things that are not present yet:

- API package updates for the new HTTP endpoints
- Backwards compatiblity.
- CLI
- Local Tokens
The new ACL system can be used when

1. All servers in the ACL datacenter are advertising support for the new ACLs.
2. All servers in the local datacenter are on version 1.4.0+

In each DC the leader of the cluster has to start advertising new acl support prior before the other servers will transition.

All servers and clients start up in legacy mode until we know that new ACLs are supported.
The RPC layer will use the new token functions and then do result conversion.
Additionally a few other bugs have been worked out.
mkeeler and others added 24 commits October 19, 2018 12:03
Still need to implement the legacy mode tests.
Also implement acl.Policy rule conversion for when we are returning/retrieving a policy via the legacy ACL.GetPolicy RPC endpoint.
Make it able to manage intentions
This just fills in test coverage for CRUD operations against the new ACL RPC endpoints.

1 test is failing, `TestACLEndpoint_PolicyResolve`, but not sure if it may be an underlying issue in token assignment.

Still some TODO in testing various cases outlined in #4791 but figured I'd put this up now.
* command/policy: simple create tests

* command/policy:  simple update tests

* command/policy: simple list tests

* command/policy: remove extra NoError

* command/policy: simple read tests

* command/policy: simple delete tests

* command/token: simple read tests

* command/token: simple list tests

* command/token: simple create tests

* command/token: fix some comments

* command/token: simple delete tests

* command/token: add simple update tests (failing)

* command/translate: add simple tests

* command/bootstrap: simple tests

* command/agenttokens: simple tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants