Make Keylime easily deployable on Kubernetes/Openshift #1

maugustosilva · 2023-06-02T16:57:21Z

After several discussions with @mpeters @ansasaki Lukas Vrabec @galmasi and Marcus Hesse, we collectively decided that the time to have Keylime easily deployed on Kubernetes/Openshift has come. I propose we use this issue to concentrate all the relevant discussion on this topic.

I will start by listing some common relevant points, and I do thank Marcus Hesse for starting the discussion on the keylime-operator on CNCF's Slack. I believe I have addressed most of your questions on this writeup.

The main goal is to end with an "Attestation Operator", which can not only automatically add nodes (i.e., agents) to specific verifiers but can also properly react to administrative activities such as node reboots or cordoning off.

I am not an Kubernetes/Openshift expert by any means, and therefore my proposal here is bound to be incomplete/incorrect, and therefore additions/corrects are welcome. That being said, I see the following set of intermediate steps, in increasing order of complexity, as a good way to achieve our goal.

Ensure that all keylime components can be fully executed in an containerized manner. For this the following requirements should be satisfied.
a. Unmodified public images. I suggest we expand https://quay.io/organization/keylime (under Red Hat's control), already offering the "latest" verifier, registrar and tenant to also include the rust agent image (@ansasaki is pursing this)
b. Carefully determine the least amount of (container) privileges will be required to run the agent
c. Provide some tool to perform containerized keylime deployments (@maugustosilva and @galmasi have a tool, which is about to be released into open-source, to perform this task).
Create a simple Kubernetes application for keylime. At this point, we should be able to start by writing progressively more yaml files

a. The idea is to start with very simple Deployment with the following objects:
* AStatefulSet (initially of 1) for the Registrar
* AStatefulSet (initially of 1) for the Verifier
* A DaemonSet for the Agents
* Both exposed as Service (type=NodePort)
* mTLS certificates stored as Secrets
* Given the fact keylime can be fully configured via environment variables, we shall use environment dependent variables on our yaml.

b. Initially, I propose we make the following simplifying boundary conditions
* Given the use of the sqlite we could start without any DB deployment
* mTLS certificates are pre-generated (with keyime_ca commands) and added to the Kubernetes cluster
* Environment variables will be also set and maintained by some external tool
* The tenant will NOT be part of the initial deployment.
* Make use of the "Node Feature Discovery" to mark all the nodes with tpm devices (and make it part of the DaemonSet node selector)

c. From this point we should expand for an "scale-out" deployment.
* Multiple Registrars and Verifiers
* A pre-packaged helm deployment of some SQL database server will be used.
* A Service (type=LoadBalancer)

d. At this point, the following technical considerations should be made.
* I am hoping we can "get away" with a pre-packaged n-way replicated SQL DB server.
* Verifiers are identified by a "verifier ID", which I assume can be take from the "persistent identifier within a StatefulSet"
* The load balancing algorithm will have to use the URI (which contains the agent UUID) for the selection of the backend (i.e., we cannot use round-robin or source IP, given that presently a single tenant will add all the agents to the set of verifiers)
* Tenant is still considered as a component outside of the whole deployment
Create an Operator for keylime. My experience writing operators is fairly limited, but I will point out some of the desirable characteristics:
- Ability to automatically generate all pertinent certificates
- Ability to deal with environment variables
- Ability to automatically add agents to verifiers
- Ability to react to administrative tasks on node, such as reboot, drainage, cordoning off.
Make the Operator more "production-ready"
- How to deal with (measured boot and runtime/IMA) policies?
- How to deal with "scale-out" operations (i.e., if the number of verifier pods increase, should we perform "rebalancing")?
- How to integrate "durable attestation" on this scenario?
The majority of the aforementioned stakeholders (@maugustosilva @mpeters @ansasaki Lukas Vrabec @galmasi and Marcus Hesse) voted for having this worked developed on a new repository within the keylime project. I will create such repository.

The text was updated successfully, but these errors were encountered:

mheese · 2023-06-05T18:29:10Z

The main goal is to end with an "Attestation Operator", which can not only automatically add nodes (i.e., agents) to specific verifiers but can also properly react to administrative activities such as node reboots or cordoning off.

Listing the goal/purpose of the operator is a great idea. We should place this in the README for everyone immediately to see.

1. Ensure that all `keylime` components can be fully executed in an containerized manner. For this the following requirements should be satisfied.
   a. Unmodified public images. I suggest we expand https://quay.io/organization/keylime (under Red Hat's control), already offering the "latest"  `verifier`, `registrar` and `tenant` to also include the rust `agent` image (@ansasaki is pursing this)
   b. Carefully determine the least amount of (container) privileges will be required to run the `agent`

@ansasaki are you actively working on this? if not, this is a good task for me to take on.

   c. Provide some tool to perform containerized `keylime` deployments (@maugustosilva and @galmasi have a tool, which is about to be released into open-source, to perform this task).

@maugustosilva I assume this is for containerized deployments outside of Kubernetes?

2. Create a simple Kubernetes application for `keylime`. At this point, we should be able to start by writing progressively more `yaml` files
   a. The idea is to start with very simple `Deployment` with the following objects:
   * A`StatefulSet` (initially of 1) for the `Registrar`
   * A`StatefulSet` (initially of 1) for the `Verifier`
   * A `DaemonSet` for the `Agents`
   * Both exposed as `Service` (`type=NodePort`)
   * mTLS certificates stored as `Secrets`
   * Given the fact `keylime` can be fully configured via environment variables, we shall use environment dependent variables on our yaml.
   b. Initially, I propose we make the following simplifying boundary conditions
   * Given the use of the `sqlite` we could start without any DB deployment
   * mTLS certificates are pre-generated (with `keyime_ca` commands) and added to the Kubernetes cluster
   * Environment variables will be also set and maintained by some external tool
   * The `tenant` will NOT be part of the initial deployment.
   * Make use of the "Node Feature Discovery" to mark all the nodes with `tpm` devices (and make it part of the `DaemonSet` node selector)

I like the idea of the initial boundary conditions, it will make it a lot easier to make progress. Here are some questions/comments I have:

we should provide the deployment as a Helm chart
we could easily make the helm chart accessible as ORAS artifacts like the container images over the quay.io registry
does the registrar really need to be a StatefulSet as well? If yes, why? I thought its design is "stateless", and a Deployment could be enough
the Verifier as a StatefulSet is unfortunately probably required when it is being scaled because the specific verfier "owns" agents. As we are generally redesigning some things, this is IMHO something that we should pay attention to that we could avoid this design. Maybe instead of verifiers owning agents it could be a job distribution system? That would turn verifiers into "verifier workers" that take on jobs, it makes them stateless and they are way easier to scale in general
as mentioned before and I think we all agreed, the agent deployment should be optional, but activated by default
with regards to certificates there are two things we should do: (a) to begin with we document commands on how to generate the certificates and create Kubernetes secrets from them, (b) we can have a "cert-manager" integration, as this is the most popular tool to manage certificates on Kubernetes
sqlite is probably a good start as long as it is possible for the registrar and the verifier to have their own sqlite database
registrar and verifier deployments/statefulsets must have hard-coded replicasets of 1 for now in the Helm chart
love the idea around the node feature discovery for discovering TPM devices, but I think this could also come in a second step

   c. From this point we should expand for an "scale-out" deployment.
   * Multiple `Registrars` and `Verifiers`
   * A pre-packaged `helm` deployment of some SQL database server will be used.
   * A `Service` (`type=LoadBalancer`)
   d. At this point, the following technical considerations should be made.
   * I am hoping we can "get away" with a pre-packaged n-way replicated SQL DB server.
   * `Verifiers` are identified by a "verifier ID", which I assume can be take from the "persistent identifier within a StatefulSet"
   * The load balancing algorithm will have to use the URI (which contains the `agent` UUID) for the selection of the backend (i.e., we cannot use round-robin or source IP, given that presently a single `tenant` will add all the `agents` to the set of `verifiers`)
   * Tenant is still considered as a component outside of the whole deployment

this is exactly the right next step, but I feel like this is a long way in the future unfortunately
we should be able to use a helm chart dependency to pull in a SQL database deployment
I think you are bringing the problem to the point: the tenant interaction is what is actually performing the load-balancing so to speak
the way how I thought about it is that any "tenant" interaction is essentially part of the "operator". The tenant essentially becomes an operator.

3. Create an `Operator` for `keylime`. My experience writing operators is fairly limited, but I will point out some of the desirable characteristics:
   
   * Ability to automatically generate all pertinent certificates
   * Ability to deal with environment variables
   * Ability to automatically add `agents` to `verifiers`
   * Ability to react to administrative tasks on node, such as reboot, drainage, cordoning off.

what do you mean by ability to deal with environment variables?
agree on certs
agree on automatically adding agents
love the idea on reacting to reboots, etc. although not all events might be easy to detect
the language of choice should be golang for the operator (as that ecosystem is basically all golang)
most of the goals should be doable with CRDs and their respective Kubernetes controllers
however, there might be a need to create a Kubernetes resource in the registrar to kickstart the process to make it "automatic" (otherwise the creation of a resource would be the tenant CLI equivalent)

4. Make the `Operator` more "production-ready"
   
   * How to deal with (`measured boot` and `runtime/IMA`) policies?
   * How to deal with "scale-out" operations (i.e., if the number of `verifier` pods increase, should we perform "rebalancing")?
   * How to integrate "durable attestation" on this scenario?

These are the $100 questions :)

5. The majority of the aforementioned stakeholders (@maugustosilva @mpeters @ansasaki Lukas Vrabec @galmasi and Marcus Hesse) voted for having this worked developed on a new repository within the `keylime` project. I will create such repository.

@maugustosilva if you don't mind, I would start to create issues for at least some of the work that you are proposing here, so that I can get started to work on them?

maugustosilva · 2023-06-06T15:21:28Z

Hey @mheese, trying to answer a few of the questions here, but will most definitely start to fold it out into multiple issues:

we should provide the deployment as a Helm chart
100% agree
we could easily make the helm chart accessible as ORAS artifacts like the container images over the quay.io registry
sure, cannot think of a reason why not
does the registrar really need to be a StatefulSet as well? If yes, why? I thought its design is "stateless", and a Deployment could be enough
absolutely right, the Registrar does not have to be a StatefulSet
the Verifier as a StatefulSet is unfortunately probably required when it is being scaled because the specific verfier "owns" agents. As we are generally redesigning some things, this is IMHO something that we should pay attention to that we could avoid this design. Maybe instead of verifiers owning agents it could be a job distribution system? That would turn verifiers into "verifier workers" that take on jobs, it makes them stateless and they are way easier to scale in general
agree, but it is a tall order, will require significant changes in keylime
as mentioned before and I think we all agreed, the agent deployment should be optional, but activated by default
ah yes, yes... I have been actually playing around with some NFD script to label nodes with TPMs
with regards to certificates there are two things we should do: (a) to begin with we document commands on how to generate the certificates and create Kubernetes secrets from them, (b) we can have a "cert-manager" integration, as this is the most popular tool to manage certificates on Kubernetes
on it (item a)
sqlite is probably a good start as long as it is possible for the registrar and the verifier to have their own sqlite database
registrar and verifier deployments/statefulsets must have hard-coded replicasets of 1 for now in the Helm chart
+1
love the idea around the node feature discovery for discovering TPM devices, but I think this could also come in a second step
sure, not crucial, will just leave as an open issue
this is exactly the right next step, but I feel like this is a long way in the future unfortunately
I see... maybe I am underestimating the complexities of it
we should be able to use a helm chart dependency to pull in a SQL database deployment
I am counting on your help and expertise on that one, I am certainly not too familiar with any "good and simple" SQL helm charts
I think you are bringing the problem to the point: the tenant interaction is what is actually performing the load-balancing so to speak
An unfortunate problem, which is not gonna go away any time soon (waaaay to many changes in keylime proper)
the way how I thought about it is that any "tenant" interaction is essentially part of the "operator". The tenant essentially becomes an operator.
right, but even in this case a keylime admin might want to stop/remove/update a particular agent at a given time
what do you mean by ability to deal with environment variables?
how do we propagate env vars back to Pods? maybe it is just a matter of envFrom with a configMapRef
agree on certs
+ 1
agree on automatically adding agents
will generate an issue on keylime proper
love the idea on reacting to reboots, etc. although not all events might be easy to detect
+1
the language of choice should be golang for the operator (as that ecosystem is basically all golang)
I see.
most of the goals should be doable with CRDs and their respective Kubernetes controllers
I thought so, but still do not have the full picture in my head
however, there might be a need to create a Kubernetes resource in the registrar to kickstart the process to make it "automatic" (otherwise the creation of a resource would be the tenant CLI equivalent)
Hmmm, interesting

… dependency in the main keylime chart. Signed-off-by: George Almasi <gheorghe@us.ibm.com>

maugustosilva mentioned this issue Jun 2, 2023

Make Keylime easily deployable on Kubernetes/Openshift keylime/keylime#1378

Closed

galmasi pushed a commit to galmasi/attestation-operator that referenced this issue Jan 25, 2024

controller helm fix keylime#1: add a version number to the controller…

6955d72

… dependency in the main keylime chart. Signed-off-by: George Almasi <gheorghe@us.ibm.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Keylime easily deployable on Kubernetes/Openshift #1

Make Keylime easily deployable on Kubernetes/Openshift #1

maugustosilva commented Jun 2, 2023

mheese commented Jun 5, 2023 •

edited

Loading

maugustosilva commented Jun 6, 2023

Make Keylime easily deployable on Kubernetes/Openshift #1

Make Keylime easily deployable on Kubernetes/Openshift #1

Comments

maugustosilva commented Jun 2, 2023

mheese commented Jun 5, 2023 • edited Loading

maugustosilva commented Jun 6, 2023

mheese commented Jun 5, 2023 •

edited

Loading