-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] New ACLs #4791
[WIP] New ACLs #4791
Conversation
a01570d
to
65a0e52
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review of the new acl and policy stuff so far - looking great.
Couple of questions/minor things inline. Will look again soon at the rest.
acl/policy.go
Outdated
|
||
func multiPolicyID(policies []*Policy) uint64 { | ||
var id uint64 | ||
hasher := fnv.New64a() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FNV is not a crypto hash function. I assume this is OK but maybe we should add a comment here to state that it's never expected that the ID returned be used as a secret?
If the individual hash IDs are random with a crypto source then this might be OK in practice, but we should still use a crypto hash if we ever expect the output hash to be used as a secret/unguessable/unforgeable value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, originally I was using fnv because I wasn't going to use it for crypto purposes. Except I just realized that it is being exported as a field for read only token operations that don't currently expose the secret and therefore should probably be a crypto hash.
This individual case is not a problem but there are others that could be.
b7d4e77
to
f057ef5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bunch of assorted stuff in here. Some minor nits, some discussion points but we can punt them. Some I think are worth changing now but don't want to block you if I don't get another chance to go deep.
I certainly glossed some of it - particularly the new CLI commands so if someone else can look deeply at those that would be great.
@@ -346,6 +353,26 @@ func (s *snapshot) persistIntentions(sink raft.SnapshotSink, | |||
return nil | |||
} | |||
|
|||
func (s *snapshot) persistIndex(sink raft.SnapshotSink, encoder *codec.Encoder) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we really not persist these before? I guess most of the are re-populated when we restore from a snapshot anyway by the last transaction to write to a table. So it's only the special cases where we use index table for primary storage of something that this matters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we really didn't. I changed how we mark whether ACL bootstrapping as been done. Now (like Nomad) it stores the raft index of when it was done in the index table.
And yes, most of it is repopulated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we could store/persist a whole other table just to store the single value of the bootstrap index. Which is kind of what we used to do. It seemed like the index table is the right place to store index information though
This just fills in test coverage for CRUD operations against the new ACL RPC endpoints. 1 test is failing, `TestACLEndpoint_PolicyResolve`, but not sure if it may be an underlying issue in token assignment. Still some TODO in testing various cases outlined in #4791 but figured I'd put this up now.
b7976c1
to
7d89e51
Compare
40ba96b
to
e6482dd
Compare
* Update docs to include multiple tag support * Sort tags before using them in metrics This addresses the potential proliferation of metrics if a query of "?tag=foo&tag=bar" is treated differently than "?tag=bar&tag=foo". Now, tags are always sorted before being recorded, making these two emit the same metric. * Add caveat about multiple tags returned by the metrics endpoint
Things that are not present yet: - API package updates for the new HTTP endpoints - Backwards compatiblity. - CLI - Local Tokens New ACL System Things that are not present yet: - API package updates for the new HTTP endpoints - Backwards compatiblity. - CLI - Local Tokens
The new ACL system can be used when 1. All servers in the ACL datacenter are advertising support for the new ACLs. 2. All servers in the local datacenter are on version 1.4.0+ In each DC the leader of the cluster has to start advertising new acl support prior before the other servers will transition. All servers and clients start up in legacy mode until we know that new ACLs are supported.
The RPC layer will use the new token functions and then do result conversion.
Additionally a few other bugs have been worked out.
… be if it existed.
Still need to implement the legacy mode tests.
Also implement acl.Policy rule conversion for when we are returning/retrieving a policy via the legacy ACL.GetPolicy RPC endpoint.
Make it able to manage intentions
This just fills in test coverage for CRUD operations against the new ACL RPC endpoints. 1 test is failing, `TestACLEndpoint_PolicyResolve`, but not sure if it may be an underlying issue in token assignment. Still some TODO in testing various cases outlined in #4791 but figured I'd put this up now.
Additional tests for #4791.
* command/policy: simple create tests * command/policy: simple update tests * command/policy: simple list tests * command/policy: remove extra NoError * command/policy: simple read tests * command/policy: simple delete tests * command/token: simple read tests * command/token: simple list tests * command/token: simple create tests * command/token: fix some comments * command/token: simple delete tests * command/token: add simple update tests (failing) * command/translate: add simple tests * command/bootstrap: simple tests * command/agenttokens: simple tests
e6482dd
to
755005d
Compare
This PR is almost a complete rewrite of the ACL system within Consul. It brings the features more in line with other HashiCorp products. Obviously there is quite a bit left to do here but most of it is related docs, testing and finishing the last few commands in the CLI. I will update the PR description and check off the todos as I finish them over the next few days/week.
Description
At a high level this PR is mainly to split ACL tokens from Policies and to split the concepts of Authorization from Identities. A lot of this PR is mostly just to support CRUD operations on ACLTokens and ACLPolicies. These in and of themselves are not particularly interesting. The bigger conceptual changes are in how tokens get resolved, how backwards compatibility is handled and the separation of policy from identity which could lead the way to allowing for alternative identity providers.
On the surface and with a new cluster the ACL system will look very similar to that of Nomads. Both have tokens and policies. Both have local tokens. The ACL management APIs for both are very similar. I even ripped off Nomad's ACL bootstrap resetting procedure. There are a few key differences though.
So not only does this PR implement the new ACL system but has a legacy mode built in for when the cluster isn't ready for new ACLs. Also detecting that new ACLs can be used is automatic and requires no configuration on the part of administrators. This process is detailed more in the "Transitioning from Legacy to New ACL Mode" section below.
Replication
There are 3 replication modes that are supported.
All 3 replication modes are now using the golang.org/x/time/rate packages Limiter struct to rate limit how quickly the rounds of replication are performed. Only a couple RPCs will be performed each round and then when apply replicated data to the local raft store there is other rate limiting in place to keep from overwhelming the system.
Legacy Replication
The actual procedure for replicating legacy tokens is mostly untouched. A few modifications were made to make it context aware and to make running the replication routine a leader only activity similar to other leader only activities like CA root pruning. The routine will be started and stopped when a node gains leadership and when that leadership is revoked.
Policy Replication
The basic procedure here is to get a policy listing from the remote end, diff it against the policies in the local store, fetch any policies that have changed and then apply the diff in 1+ raft transactions. The deletions are batched based on the number of policy IDs that can fit into the desired log size. The upserts are batched based on the policies estimated size. In the normal case, it is not expected to take more than 1 batch except when doing the initial replication in a cluster that is a heavy ACL user.
So overall it takes 2 remote RPCs and 1+ raftApply calls to do 1 round of policy replication. The RPCs are ACL.PolicyList and the ACL.PolicyBatchRead.
Token Replication
Token replication happens the same ways as policy replication. One RPC is made to get the token list. The tokens are diffed against the non-local tokens in the local store and any missing/out of date ones are fetched with a second RPC. Similar to policies, tokens are deleted in batches with the size based on the number of token IDs being deleted and upserts are batched based on estimated size of the resulting log.
Token Resolution
Prior to this PR, Consul would always trying to make a cross-DC request to the ACL datacenter to resolve a token to an effective Policy. Only when that failed, and the cache was stale would it use its locally replicated data to provide the value.
This PR changes that and makes it inline with Nomads ACL replication. Policies are only ever remotely resolve before policy replication gets started and replicates the first set of policies. Tokens are only ever resolved remotely when token replication is off (then we cache it similarly to the legacy mode) or before token replication has replicated anything.
When resolving remotely (always for clients, or with the conditions specified above for servers), the acl down policy is still respected including the more recent addition of the async-cache mode.
Another big change is that servers no longer have a non-authoritative cache. There will not be any server to server RPCs for doing token resolution within a single DC. This is how Nomad does it and it greatly simplifies the overall architecture.
ACL Package
The top level
acl
package is generic runtime ACL related code. It does not deal with caching nor persistence of ACL related things. Its only use is to provide a way to parse rules into policies and compile multiple policies into a single Authorizer.The big difference in this package are:
CLI
I am not 100% sold on the UX of the CLI after implementing and using it. In particular updating tokens and policies via the CLI could maybe use a little more thought. In particular what I am not in love with is that in many scenarios its difficult to unset a field. An example of this is removal of all datacenter restrictions from a policy. Also some things didn't work out how I expected regarding positional arguments. For example, this is not possible:
The golang flag parser will not parse flags after positional arguments so we can't do that. So one thing I have done is to try and avoid positional arguments all together.
Transitioning from Legacy to New ACL Mode
One important piece to all of this is that it was designed so that systems with new ACLs can coexist with legacy systems. One ramification of this is that when bootstrapping a brand new cluster there is a short window where all systems will have to figure out that new ACLs can be used (we don't want to be issuing RPCs to systems that don't support those endpoints or applying things in Raft that other servers will error on). This is accomplished via the setting of the "acls" serf tag. When systems come up if ACLs are enabled this gets set to a value to indicate that it is running in legacy mode. The leader in the ACL/Primary DC is the first one which can start the transition and it will once all the servers within the ACL/Primary DC are advertising legacy ACL support (legacy systems don't advertse anything so when we encounter one of these we prevent the transition). Once the leader transitions to the new mode the other members in the primary DC will pick up this even and transition themselves. Once all servers in the primary DC have transitioned then the leaders in the other DCs will see this and transition (if all servers within their DC are capable).
Legacy Token Auto-Upgrade
All new tokens have an accessor ID and the concept of a token type is gone. So when acquiring leadership, in addition to starting ACL replication, the node will also start a token auto-upgrade procedure. This will (in a rate limited fashion) assign accessor IDs to all tokens and will assign the global-managment policy to all legacy management tokens.
Legacy client tokens cannot be fully auto-upgraded without user intervention as we do not want to create 1 policy per token. The process of upgrading individual tokens is simply assigning one or more policies to those tokens and discarding the rules. This can be done via the API or CLI but is a manual task as Consul cannot know which policies should apply for a particular token.
Changes
acl
stanza within the config. If those items are not present it will still fallback to the legacy config items and port them to the proper place within the runtime config.TODO - many more things have changed.
TODOS
consul acl token create
consul acl token list
consul acl token read
consul acl token update
consul acl token delete
Follow On Work
Any of these that don't get done will get turned into github issues after this is merged.