-
Notifications
You must be signed in to change notification settings - Fork 485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Forced Rotation and Revocation #1934
Comments
Thanks, Evan. The presented scope feels like the right first step - it accomplishes the goals without undue complication or burden. Also, exercising "existing pathways that must be working" along with the proactive nature of the proposed operations feels preferable. Revocation lists feel more likely to increase polling activities which may lead to scaling and stability concerns. Keeping key operations separate from bundle operations feels safer than combining them. It may be worth noting that the proposed agent status update preserves the security posture of no agent network listener - an important element in our overall security design. I am a little concerned about the ambiguous nature of the timing of completeness in steps 2 and 4. I believe strong instrumentation and reporting are critical to successful implementation of this solution, along with perhaps some configurable threshold defaults for the percentage of agent completion (through status updates). This may also help with future automation efforts as well for large scale rotation. What positive acknowledgment mechanisms do/can we have for federated trust domains? |
is All this process can result in a lot of work for server (we'll force rotation of all SVIDs for agent/workload) maybe we can add a ratelimit or prevents to rotates only |
Yes I agree - the agent status api described in this proposal is the main avenue by which we can gain visibility into the safety of step 2 and 4. I should note that steps 1, 2, and 4 are all operations that occur in SPIRE today - as such, we're very familiar with the behavior... the difference is that today, safety is provided by delaying these operations for long enough that we can be reasonably certain that everything has picked up the changes. In this proposal, we want to (safely) speed that up. Having an operator present makes it easier.
Unfortunately, none. SPIFFE does not require bundle endpoint clients to authenticate, nor does it require servers to have knowledge of their consumers. One thing we can do is warn based on the SPIFFE bundle refresh hint... though this does assume that bundle endpoint clients are respecting the hint. This does bring up an additional point - we may want the ability to roll back step 2.
I think so?
Yes this is a good point. One thing we could do is add some jitter to SVID TTLs when we sign them. I think this has been brought up before as a generally needed feature. |
my comment about key manager api, is becase that one is like a red button, and maybe, some people will complain about who can press it, and if it is ok to be open on http, or allow only for a local service, another question here about that key manager api, is that we'll have key managers, (we have allow create plugins for that) based in a quick look to that, those key managers creates new keys on demand, so it is no possible that those managers update a key by them-self, and notify us, but what happens if that communication fails (no key could be generated on demand) on this case? we'll just log and notify admin to try it again displaying error from key manager right? |
Hmm yea. If we have to, I suppose we could limit calls to local clients... but I'd really hate to start splitting things up like that. In the end, I think the real problem is lack of granular permissions for admin SVIDs (#716). I was told yesterday that we should see a renewed proposal coming through for that soon.
Yes, I think so. The operations in this procedure all benefit from tight feedback loops, so I'd expect the call to the rotate API to be blocking, and bubble up any errors that we get from the keymanager during the process. |
This issue is stale because it has been open for 365 days with no activity. |
This work is active and being tracked as part of GH project https://github.com/orgs/spiffe/projects/21 |
Introduction
Forced rotation and revocation has been a roadmap item for a while, and the time has come to scope it and put forth a proposal. The goal of this roadmap item is to provide a rapid, reliable, and automated mechanism for recovering from key compromise. This represents a very significant improvement over the current situation, which involves manual surgery on the datastore, as well as deep knowledge on SPIRE internals.
As our first steps into this territory, I'd like to propose the following scope, which I believe is the smallest reasonable scope that can still accomplish our goal:
The following items would be out-of-scope:
The major tradeoff considered in the scope of this proposal is operator action over AllowList/DenyList maintenance and distribution. In the event of a compromise, an operator must "rotate away" from the compromised keys, as opposed to distributing a list of distrusted keys to all consumers. This tradeoff is consistent with existing SPIFFE philosophies. It is also more reliable, as it exercises existing pathways that must be working.
A Four Step Operation
To safely rotate away from a compromised key, four distinct operations take place (from a user's perspective). First, we must prepare a new signing key. This creates the key to be rotated to, adding it to the bundle to begin its propagation. Second, we must activate the new signing key. This step shifts active signing operations off of the compromised key and onto the new key. The amount of time it takes to move from step one to step two must be as small as is safely possible.
In the third step, we must declare that the compromised key is about to be revoked. This declaration causes agents to proactively rotate any SVIDs they are managing that are associated with the to-be-revoked key. Finally, in the fourth step, we actively revoke the key. This removes the key from the bundle, an update that propagates to all agents and workloads. The amount of time it takes to move from step three to step four must be as small as is safely possible.
There may be an opportunity to combine step two and step three if the implementation and experience can easily allow for it.
Step 1: Prepare a New Signing Key
This step involves generating a new signing key, and injecting it into the bundle for distribution. The "next" signing key may already be prepared, however it may also be possible that an operator wants to step past that already-prepared key, and prepare a new one (if, for example, the prepared key is also compromised). Therefor, the prepare operation should allow an operator to provide the
-f
force flag equivalent, to prepare a new key overwriting the previously prepared key.Step 2: Activate a New Signing Key
This step involves shifting signing operations away from the old or compromised key, and towards the newly generated key. To do this safely, we need to have some level of understanding on how many agents have picked up the new key (as it was generated in step 1). It then becomes an operator decision on what level of propagation is "good enough".
Another factor to consider is how likely it is that other (federated) trust domains have picked up the new key. At the very least, this can be implied by the amount of time elapsed since step 1, compared to the SPIFFE bundle refresh hint that is currently set. Considering this period, as well as the relative number of agents that have picked up the new key, warnings should be reported/logged if necessary (e.g. "Are you sure you want to do this?")
Step 3: Signal an Impending Key Revocation
This step involves distributing knowledge of an impending key revocation. This is important because we need agents and workloads to actively (and rapidly) move off of key validation paths that include the compromised key. To do this, agents will need to be able to recognize this condition, audit their caches to see which SVIDs need renewal, and pro-actively request new SVIDs. The assumption is that by this point, signing operations on the key-to-be-revoked are no longer active (thus avoiding the chance that a renewed SVID is signed by the old key).
This operation may cause a flood of renewals. The load will need to be managed appropriately.
Step 4: Revoke the Compromised Key
This step involves completely removing the old key from the bundle. This bundle update will propagate downwards, and be pushed into workloads attached to the workload API. At this time, we want to be as certain as is reasonably possible that agents have completed their rotation away from the affected validation path. Ideally, there is a feedback mechanism to understand which agents have completed this task and which have not.
Some operators may prefer availability over the continued use of a potentially-compromised key. To accommodate this case, undoing the action should be straightforward and trivial.
New API Things
It is necessary to introduce some new APIs in order to achieve the process described above. In this section, I propose some APIs for this purpose.
It should be noted that the APIs proposed herein are geared towards generality, as opposed to the specific task at hand. I imagine that these APIs will grow over time, and will likely result in convenient avenues on which further (unrelated) features may be built.
Agent Status
In order to accomplish step two, and likely step four, it is necessary to have some insight into agent state. This insight is best achieved at the server level, for multiple reasons. One reason is that there may be different teams or organizations that are managing agents vs servers. Another reason is that this key manipulation process takes place at the server level - having to interrogate every agent directly will only slow things down. In the end, an operator performing these steps may not be in a position to collect information from every agent.
To solve this, I propose that we introduce a new RPC on the existing Agent API. This new RPC will allow agents to periodically post their status to the server, which in turn stores the latest reported state. For the purpose of this RFC, all we really need is the sequence number of the bundle that the agent has loaded, but it is easy to imagine further use of this update (e.g. agent version number).
The introduction of this API will not only help us accomplish the task at hand, but will also light the path for a much better agent management experience.
Key Management API
SPIRE currently has an API for bundle management, which covers CRUD-like operations on bundle resources stored by SPIRE. It is important to note that this management extends beyond just the local trust domain's bundle - it also applies to the management of federated bundles.
What is missing in this API is the ability to manipulate keys. Bundles are strictly "public"... and while one of those bundles (the local bundle) has ties to locally-managed keys, there is no way to manage the keys or the bundle that represents them, in lockstep.
Rather than overload the bundle API with key management operations, I propose the introduction of a new API. This API would be responsible for handling both step one and step two actions described in this proposal - the early preparation and activation of locally managed keys. It can also expose an interface for listing keys etc, which is not available today.
Bundle Revocation and Signaling
In order to support step four, we need an interface by which we can both signal for removal and actually remove an authority from the local bundle. I propose that this manifest as an RPC addition(s) to the existing Bundle API.
Due to the way that bundle information is stored in SPIRE, this task may be more involved than it seems. The requirement put forth by this proposal is that an operator completing step three of the process must be able to easily complete step four with the information that they already have on hand (e.g. the operator uses the SKID of the CA certificate to be removed in both step three and four).
The Bigger Picture
I feel that the absence of available interfaces for accomplishing this task is indicative of a larger problem. There are aspects of day-to-day SPIRE management that are not covered by the existing APIs, which (to me, at least) illustrates that we have overlooked some areas of functionality when designing them. As a result, I've made my best effort to model this proposal around generalized APIs that will be useful for future endeavors.
All of this is to say that, regardless of whether this proposal is accepted or not, the ultimate solution to the problem at hand should result in generalized interfaces and functionality rather than specialized.
Request for Comments
Any/all comments are appreciated!
The text was updated successfully, but these errors were encountered: