Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding enhancement #98 for SPIRE integration #100

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
227 changes: 227 additions & 0 deletions 98_spire_integration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,227 @@
<!--
**Note:** When your enhancement is complete, all of these comment blocks should be removed.

To get started with this template:

- [ ] **Fill out this file as best you can.**
At minimum, you should fill in the "Summary", and "Motivation" sections.
These should be easy if you've preflighted the idea of the enhancement with the
appropriate SIG(s).
- [ ] **Merge early and iterate.**
Avoid getting hung up on specific details and instead aim to get the goals of
the enhancement clarified and merged quickly. The best way to do this is to just
start with the high-level sections and fill out details incrementally in
subsequent PRs.
-->
# enhancement-98: SPIRE Integration

<!--
A table of contents is helpful for quickly jumping to sections of a enhancement and for
highlighting any additional information provided beyond the standard enhancement
template.
-->

<!-- toc -->
- [Release Signoff Checklist](#release-signoff-checklist)
- [Summary](#summary)
- [Motivation](#motivation)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Proposal](#proposal)
- [User Stories (optional)](#user-stories-optional)
- [Story 1](#story-1)
- [Story 2](#story-2)
- [Notes/Constraints/Caveats (optional)](#notesconstraintscaveats-optional)
- [Risks and Mitigations](#risks-and-mitigations)
- [Design Details](#design-details)
- [Test Plan](#test-plan)
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
- [Drawbacks](#drawbacks)
- [Alternatives](#alternatives)
- [Infrastructure Needed (optional)](#infrastructure-needed-optional)
<!-- /toc -->

## Release Signoff Checklist

<!--
**ACTION REQUIRED:** In order to merge code into a release, there must be an
issue in [keylime/enhancements] referencing this enhancement and targeting a release**.

For enhancements that make changes to code or processes/procedures in core
Keylime i.e., [keylime/keylime], we require the following Release
Signoff checklist to be completed.

Check these off as they are completed for the Release Team to track. These
checklist items _must_ be updated for the enhancement to be released.
-->

- [ ] Enhancement issue in release milestone, which links to pull request in [keylime/enhancements]
- [ ] Core members have approved the issue with the label `implementable`
- [ ] Design details are appropriately documented
- [ ] Test plan is in place
- [ ] User-facing documentation has been created in [keylime/keylime-docs]

<!--
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
-->

## Summary

SPIFFE/SPIRE is an elegant solution to workload identity that is pluggable
in it's node and workload attestors. Keylime would be a perfect candidate
for node attestation if it had a few extra APIs that would allow the
SPIRE agent and SPIRE server to be able to verify the following:

* Is the node where the SPIRE agent being attested by Keylime?
* Is the attestation passing?

We propose new APIs on the agent and verifier to allow SPIRE to verify
these things. But since SPIRE will also need plugins (agent and server)
for the node attestation we have some flexibility in how these APIs
appear. If we keep them generic enough, they could theoretically be used
by any system that wants to independently verify the state of given node
in Keylime.

## Motivation

* To expand Keylime's usefulness and reach in the cloud-native landscape.
* To have a better hardware root-of-trust for software identity
* To have a more complete Zero Trust solution

### Goals

When complete, this proposal will allow SPIRE plugins to be written to
target Keylime as an attestor and provide useful properties in keylime
as selectors in SPIRE. This will allow a user to craft authentication
and authorization policy that takes into account a machine's boot and
file integration attestation state.

### Non-Goals

Although these APIs will be generic, no direct effort will be made to
support other non-SPIRE entities.

## Proposal


### User Stories (optional)

#### Story 1

A developer will be able to develop a SPIRE agent and server plugin
that communitcates with the Keylime agent and verifier to be able to
independenly prove that the agent in question is on the same node as the
SPIRE agent and also that the agent is passing it's attestation policies
in Keylime.

This integration will also pull in various properties of the Keylime setup
(agent configuration, policy, etc) to use as selectors for SPIRE.

### Notes/Constraints/Caveats (optional)

None


### Risks and Mitigations

Care will need to be taken so that we don't leak any sensitive data in
these APIs and that our verification/signing process is secure and leads
to the guarantees we are making (that the SPIRE and Keylime agents are
on the same node).

The security of the information flows has been reviewed by several
members of the Keylime development team as well as SPIRE participants. The
implementation will need thorough review as well.


## Design Details

The following flow is anticipated for the full Keylime SPIRE plugins:

```
┌───────────────────────────────────────────────┐ ┌───────────────┐
│ │ │ │
│ Node #3 │ │ SPIRE │
│ ┌───────────────────────────────────┼────────────────────► SERVER │
│ ┌────────┴────┐ │ │ │
│ │ SPIRE │ #1 │ │ │
│ │ Agent ◄────────────────┐ │ └─────┬─────────┘
│ │ │ │ │ │
│ └─────────┬───┘ │ │ │ #4
│ │#2 ┌────▼─────┐ │ │
│ ┌──▼──────┐ │ Keylime │ │ ┌─────▼─────────┐
│ │ TPM │ │ Agent │ │ │ │
│ │ │ │ │ │ │ Keylime │
│ └─────────┘ └──────────┘ │ │ Verifier │
│ │ │ │
└───────────────────────────────────────────────┘ └───────────────┘
```

This flow has the following steps:

1. SPIRE Agent queries node-local /info API on keylime agent to get information like the Keylime UUID
2. SPIRE Agent creates a nonce that is sent to the TPM’s AK (keylime created) for signing
3. SPIRE Agent sends the information to the Spire Server
4. SPIRE Server queries Keylime Verifier about the agent. Does it exist? Is it passing attestation? If so, can you unencrypt (verify signature) of this nonce? If all are true, then SPIRE attestation passed and identity is issued.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is SPIRE attestation as 'periodic' as Keylime attestation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, especially for node attestation, I believe it only happens once at SPIRE agent startup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whatever SPIRE states about the node should only have a short-term validity period since Keylime attestation may detect that the nodes has gone out-of-policy shortly after ...



In order to accomodate this flow, this enhancement will consist of the following:

1. A new node-local, non-TLS API on the keylime agent responding the the `/info` path. It will return information about the keylime agent which will be used to not only identity the agent, but also be used to perform a signature verification. A 3rd party can use the credential created by the agent in the TPM to sign a nonce which can then be verified by the verifier. The new API will return the following information:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Particularly under the light of #60 it would be nice to see a separation between node-local APIs, and APIs that are being called by the verifier.

Also, IMHO it would be nice to consider the following things for node-local APIs:

  • the server that serves verifier APIs and the server that serves node-local APIs should be separate
  • the server should be listening on a unix socket
  • very much optional, but in the spirit of being a cloud native project: use grpc or better ttrpc for these APIs like other cloud native services do

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why we think they should be separate servers? A single binary which starts multiple processes? Multiple binaries? The latter would be a much bigger lift for packagers, etc.

very much optional, but in the spirit of being a cloud native project: use grpc or better ttrpc for these APIs like other cloud native services do

This is definitely worth doing. I don't know if the initial APIs will have them, but I'll make them versioned so we can add them later.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why we think they should be separate servers? A single binary which starts multiple processes? Multiple binaries? The latter would be a much bigger lift for packagers, etc.

sorry, that might have not been clear: logical separation in the code (definitely not multiple binaries or processes), so that the server listening on the unix socket serves all the node-local APIs, and the one listening on TCP for the verifier serves all the existing APIs.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been reading this carefully, and I'd like to dig a little into Mike's assumptions. During the initial review of the proposal I was not at peace with the spire agent having to talk to both the agent and the TPM device.

I think we can drop the requirement that the spire agent talk to the keylime agent -- without affecting security. My apologies for the extremely long writeup, and if I have a reasoning error in here please do point it out. I will then eat crow for having wasted your times :)

(A) we are positing a situation in which the keylime verifier is attesting the target node.
(A.1) i.e., the verifier has already established the correspondence between a node's UUID and its AK
(A.2) all attestation info, including the UUID and the AK, are available through the tenant API
(B) the SPIRE server's goal is to establish the target node's trustworthiness.
(B.1) first, the SPIRE server talks to the keylime verifier to download information about the target node.
(B.2) next the SPIRE server has to prove that the SPIRE agent on the node is co-located with the keylime agent.
(B.3) The simplest way to do this is to establish that both agents (keylime, spire) are talking to the same TPM.
(C) the SPIRE agent has access to all node information from the SPIRE server, as downloaded from the verifier.
(C.1) all the SPIRE agent has to to is to mount a challenge against the TPM device:
(C.2) have the TPM decrypt a challenge encrypted by the pubAK.
(C.3) at this point the SPIRE agent has [dis?]proven the fact that it's talking to the same TPM as the keylime agent. Interaction with the keylime agent was not necessary.
(D) once the SPIRE agent reports back to the server, the SPIRE certificates can be downloaded etc.
(D.1) a simple swizzle on certifying SPIRE agents would be for the SPIRE server to encrypt its certs with each TPM's AK, and have the agent use the actual SPIRE cert as the challenge.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@galmasi @mpeters. Are we assuming the SPIRE agent is, for all intents and purposes, the equivalent of a keylime_tenant? This would require the shipping of TLS certificates to all nodes (which will include client-private.pem), which strikes me as problem, both in terms of security and maintenance (TLS certificate renewal).

Maybe I am missing something, but it looks like we either do exactly what I describe here or we implement a new HTTP api for the verifier (which would have its own security side-effects). Am I missing something?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@galmasi I was reading your reply, and I have some questions/comments. I'm also not entirely happy with the SPIRE agent needing to talk to both the agent and the TPM, but for opposite reasons than you.

I have been reading this carefully, and I'd like to dig a little into Mike's assumptions. During the initial review of the proposal I was not at peace with the spire agent having to talk to both the agent and the TPM device.

From Mike's diagram the part that I don't particularly like is that (2) is going from the SPIRE agent to the TPM. It should be the keylime agent which is always querying the TPM.

I think we can drop the requirement that the spire agent talk to the keylime agent -- without affecting security.

My point of view is the exact opposite on this: we should need to drop the requirement that the SPIRE agent is talking to the TPM.

Here is my reasoning behind this: SPIRE already has a TPM integration as of today, and in order to promote and make keylime more valuable to other use cases even apart from SPIRE (and I am actually working on one right now), the barrier for attestation needs be lowered, and provide more value on top of this. In this case, the keylime agent is the one which interacts with the trust hardware module, and it happens to use the TPM at this point in time. It's the node-level abstraction of how to do these type of actions on the host. Furthermore, keylime does more than what SPIRE is currently doing with its TPM integration, and this is where the particular value add (IMHO) lies.

So I think for your (C.1) I would do the challenge through a node-level API through the keylime agent. It's extremely important though that this API is a host local API (obviously). Admittedly though, your approach can theoretically be considered more secure: as both components independently are talking to the same TPM which is the source of truth after all. However, for all practical purposes a host local API basically does the same thing (and one can control and restrict further access to this socket with additional methods as well). In a nutshell that would provide the following components:

  • the keylime agent helping with identity verification for other components on a host (like SPIRE), and making the reference with its UUID to other keylime components
  • the keylime verifier providing the information that this host is not only authenticated but passes additional integrity checks (MB policies and IMA)
  • the keylime tenant being able to help with identity verification on the server sides because it has the AIK which is needed for verifying the challenges

My apologies for the extremely long writeup, and if I have a reasoning error in here please do point it out. I will then eat crow for having wasted your times :)

It's a good discussion :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mpeters so, accessing the verifier APIs will require at least cacert.crt ca-public.pem client-cert.crt client-private.pem client-public.pem server-cert.crt server-public.pem . While I do agree it does not represent a security problem per se, the need for redistributing the TLS certificates over a (potentially) large number of nodes might represent an (operational) problem. I do wonder if SPIRE agents do trust the SPIRE server as a boundary condition, and in case you answer yes, delegating the communication with the keylime verifier to the server

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can drop the requirement that the spire agent talk to the keylime agent -- without affecting security.

My point of view is the exact opposite on this: we should need to drop the requirement that the SPIRE agent is talking to the TPM.

This was my first design @mheese but after starting it and talking it over with others I noticed it was flawed. The purpose of SPIRE attestation via keylime is twofold:

  1. Provide proof of which node the SPIRE agent is running on.
  2. Provide proof that the node from Persist verifier monitoring after agent restarts #1 has passed Keylime attestation

If the SPIRE agent just talks to the Keylime agent then it can't really prove #1. A compromised keylime agent on node A could accept requests and forward them to some other process on node B which could get it's answers either from a Keylime agent on node B or the TPM on node B. And the SPIRE agent on node A wouldn't know the difference as long as node A was registered with Keylime.

So by talking to the TPM directly we can independently prove the identity of the node and then prove #2 by talking to the Keylime agent and verifier.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do wonder if SPIRE agents do trust the SPIRE server as a boundary condition, and in case you answer yes, delegating the communication with the keylime verifier to the server

@maugustosilva Yes, I guess it's not clear from my proposal that the SPIRE server is the one talking to the keylime verifier, so it's only the SPIRE server that needs to be able to communicate with it over TLS.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as I mentioned to @galmasi already, that's why it is important that this API is node-local and cannot be reached through any other means. Sure, directly going to the TPM fully eliminates all these concerns, but it also makes it so much more impractical (and keeping it node-local is the practical approach of guaranteeing the same things). So as long as this API is node-local, you can prove (1).

That's what would keep this approach generic and being easily adoptable by other products for which talking directly to the TPM is a barrier which is just too high to achieve and which is why they would like to integrate with keylime to begin with.

That all said, it seems like you and others feel strongly about this approach. Yet again, while I agree that this is the theoretically safer approach, I disagree that it makes a practical difference.


* agent_uuid
* tpm_hash_alg
* tpm_encryption_alg
* tpm_signing_alg
* ek_handle

2. A new API on the verifier that can take a signed payload from a TPM and given agent's UUID verify that it came from a TPM associated with that agent. This will be used to independently verify that the Keylime agent resides on a node with that TPM.

3. An expansion of the existing `/agents` GET API on the verifier to return enough information for use as selectors in SPIRE.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what information is still necessary/needed for it to be enough for SPIRE?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now I was thinking of adding the name(s) of the Keylime policies passed by the node. Right now the only one with a name is the file integrity policy (IMA), but we can look at adding names to the measured boot policy and others in the future.

What other keylime data would you like to see as a selector?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no idea, that's kind of why I was asking :)



### Test Plan

The individual new APIs will have tests written for them. And the new
SPIRE plugins written to use those APIs will also have their own CI/CD
tests/pipelines to test against those APIs, targetting specific versions
of the Keylime agent and verifier.

### Upgrade / Downgrade Strategy

These will be net-new APIs and will require a minor bump in the Keylime
API version number. It's not believed that they will require database
schema changes, nor any upgrade migrations. As such, there doesn't need
to be a downgrade strategy.

### Dependency requirements

It is not believed that we will require any new dependencies for these
APIs as they will just re-use existing libraries for any cryptographic
signing or verification of those signatures.

## Drawbacks

It's possible that these APIs won't be useful outside of the SPIRE
integration, but it's our belief they will be generic enough to be evolved
for any 3rd party that wants to do deep verification of an node's status
in Keylime.

## Alternatives
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also be great to compare a keylime integration against the "tpm_devid" plugin here, and what the advantages for a keylime integration would be.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I'll add it.


It's already possible to create an integration of Keylime with SPIRE
by using the x509pop plugin, but there are several limitations with
this approach:

* You need to have a key management solution for those certs
* It's very automatic and requires a lot of setup/configuration for the users
* It relies on using the payload delivery mechanism of the keylime tenant which some users turn off for security
* It doesn't propagate any information about the Keylime setup into SPIRE properties for use in auth policy.
* There's no way to revoke the certificate if attestation fails in the future

This enhancement should allow for full Keylime/SPIRE plugins to fix all
of those problems and make it really easy and convenient for users.

## Infrastructure Needed (optional)

This enhancement shouldn't need any additional infrastructure.