-
Notifications
You must be signed in to change notification settings - Fork 21
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #2 from lukehinds/presist-agent
Add persist agent enhancement
- Loading branch information
Showing
1 changed file
with
316 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,316 @@ | ||
<!-- | ||
**Note:** When your enhancement is complete, all of these comment blocks should be removed. | ||
To get started with this template: | ||
- [ ] **Create an issue in keylime/enhancements** | ||
When filing an enhancement tracking issue, please ensure to complete all | ||
fields in that template. One of the fields asks for a link to the enhancement. You | ||
can leave that blank until this enhancement is made a pull request, and then | ||
go back to the enhancement and add the link. | ||
- [ ] **Make a copy of this template.** | ||
name it `NNNN-short-descriptive-title`, where `NNNN` is the issue number (with no | ||
leading-zero padding) assigned to your enhancement above. | ||
- [ ] **Fill out this file as best you can.** | ||
At minimum, you should fill in the "Summary", and "Motivation" sections. | ||
These should be easy if you've preflighted the idea of the enhancement with the | ||
appropriate SIG(s). | ||
- [ ] **Merge early and iterate.** | ||
Avoid getting hung up on specific details and instead aim to get the goals of | ||
the enhancement clarified and merged quickly. The best way to do this is to just | ||
start with the high-level sections and fill out details incrementally in | ||
subsequent PRs. | ||
--> | ||
# Persist verifier monitoring after agent restarts | ||
|
||
<!-- | ||
This is the title of your enhancement. Keep it short, simple, and descriptive. A good | ||
title can help communicate what the enhancement is and should be considered as part of | ||
any review. | ||
--> | ||
|
||
<!-- | ||
A table of contents is helpful for quickly jumping to sections of a enhancement and for | ||
highlighting any additional information provided beyond the standard enhancement | ||
template. | ||
--> | ||
|
||
<!-- toc --> | ||
- [Release Signoff Checklist](#release-signoff-checklist) | ||
- [Summary](#summary) | ||
- [Motivation](#motivation) | ||
- [Goals](#goals) | ||
- [Non-Goals](#non-goals) | ||
- [Proposal](#proposal) | ||
- [User Stories (optional)](#user-stories-optional) | ||
- [Story 1](#story-1) | ||
- [Story 2](#story-2) | ||
- [Notes/Constraints/Caveats (optional)](#notesconstraintscaveats-optional) | ||
- [Risks and Mitigations](#risks-and-mitigations) | ||
- [Design Details](#design-details) | ||
- [Test Plan](#test-plan) | ||
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) | ||
- [Drawbacks](#drawbacks) | ||
- [Alternatives](#alternatives) | ||
- [Infrastructure Needed (optional)](#infrastructure-needed-optional) | ||
<!-- /toc --> | ||
|
||
## Release Signoff Checklist | ||
|
||
<!-- | ||
**ACTION REQUIRED:** In order to merge code into a release, there must be an | ||
issue in [keylime/enhancements] referencing this enhancement and targeting a release**. | ||
For enhancements that make changes to code or processes/procedures in core | ||
Keylime i.e., [keylime/keylime], we require the following Release | ||
Signoff checklist to be completed. | ||
Check these off as they are completed for the Release Team to track. These | ||
checklist items _must_ be updated for the enhancement to be released. | ||
--> | ||
|
||
- [ ] Enhancement issue in release milestone, which links to pull request in [keylime/enhancements] | ||
- [ ] Core members have approved the issue with the label `implementable` | ||
- [ ] Design details are appropriately documented | ||
- [ ] Test plan is in place | ||
- [ ] User-facing documentation has been created in [keylime/keylime-docs] | ||
|
||
<!-- | ||
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone. | ||
--> | ||
|
||
## Summary | ||
|
||
<!-- | ||
This section is incredibly important for producing high quality user-focused | ||
documentation such as release notes or a development roadmap. It should be | ||
possible to collect this information before implementation begins in order to | ||
avoid requiring implementers to split their attention between writing release | ||
notes and implementing the feature itself. Reviewers | ||
should help to ensure that the tone and content of the `Summary` section is | ||
useful for a wide audience. | ||
A good summary is probably at least a paragraph in length. | ||
--> | ||
|
||
Should someone restart an agent based server or force an agent offline, the | ||
agent will no longer be monitored by the verifier. Upon starting the agent will | ||
just register with the registrar and IMA monitoring will cease. | ||
|
||
This behavior was originally discussed on the [keylime mailing list](https://keylime.groups.io/g/main/topic/q_is_an_agent_s_policy/72856684?p=,,,20,0,0,0::recentpostdate%2Fsticky,,,20,2,0,72856684) | ||
|
||
## Motivation | ||
|
||
<!-- | ||
This section is for explicitly listing the motivation, goals and non-goals of | ||
this enhancement. Describe why the change is important and the benefits to users. | ||
--> | ||
|
||
Its acceptable that someone may want to manually restart a server (or the server | ||
restarts as part of an automated work flow) while retaining the configuration | ||
set up during the intial "adding" of the agent to the verifier (`whitelist`, | ||
`tpm_policy`). They should not have to again add (or update) the verifier | ||
every time if there is not change in configuration or trust mapping (e.g software | ||
CA). | ||
|
||
### Goals | ||
|
||
<!-- | ||
List the specific goals of the enhancement. What is it trying to achieve? How will we | ||
know that this has succeeded? | ||
--> | ||
|
||
A user restarts the agent on a target node. When the agent is becomes active | ||
again the verifier proceeds to recommence monitoring the delegated measurements | ||
from when the target agent was first added to the verifier and registrar. | ||
|
||
### Non-Goals | ||
|
||
<!-- | ||
What is out of scope for this enhancement? Listing non-goals helps to focus discussion | ||
and make progress. | ||
--> | ||
|
||
Any sort of migration or fault redundancy (although both areas benefit from this | ||
change) | ||
|
||
## Proposal | ||
|
||
<!-- | ||
This is where we get down to the specifics of what the proposal actually is. | ||
This should have enough detail that reviewers can understand exactly what | ||
you're proposing, but should not include things like API designs or | ||
implementation. The "Design Details" section below is for the real | ||
nitty-gritty. | ||
--> | ||
|
||
A target machine is rebooted with no change in state (measured properties). This | ||
machine should not require “re adding” with the keylime tenant again. | ||
|
||
Once the target node / agent returns to an online / reachable state, the | ||
verifier should proceed to recommence run time monitoring. | ||
|
||
A new tornado web handler will be created within the verifier to listen for | ||
requests that an agent will emit when it (re)starts. | ||
|
||
Code will be introduced within the agent that will perform a `POST` request to | ||
inform the verifier an agent has been (re)started. This in turn will cause the | ||
verifier to perform an `operational_state query` for the `UUID` of that agent | ||
and then proceed to perform run time integrity monitoring again. | ||
|
||
### User Stories (optional) | ||
|
||
<!-- | ||
Detail the things that people will be able to do if this enhancement is implemented. | ||
Include as much detail as possible so that people can understand the "how" of | ||
the system. The goal here is to make this feel real for users without getting | ||
bogged down. | ||
--> | ||
|
||
For any given reason my server reboots. Keylime handles this event and provides | ||
trust monitoring once the server and agent are back online and can be reached | ||
by the verifier. | ||
|
||
Should the machines state have been tampered with during the offline period, | ||
Keylime will immediate fail the target node accordingly (or likewise show the | ||
machine is still in the expected trust state according to the delegated | ||
measurements) | ||
|
||
If I want to change measurements, I use the existing `update` command available | ||
in the Keylime Tenant CLI. | ||
|
||
### Risks and Mitigations | ||
|
||
<!-- | ||
What are the risks of this proposal and how do we mitigate. Think broadly. | ||
For example, consider both security and how this will impact the larger | ||
enhancement ecosystem. | ||
How will security be reviewed and by whom? | ||
--> | ||
|
||
We should be sure we do not introduce security risks and be mindful of future | ||
enhancements such as multi tenancy, auth and migration. | ||
|
||
## Design Details | ||
|
||
<!-- | ||
This section should contain enough information that the specifics of your | ||
change are understandable. This may include API specs (though not always | ||
required) or even code snippets. If there's any ambiguity about HOW your | ||
proposal will be implemented, this is the place to discuss them. | ||
--> | ||
|
||
Verifier Changes | ||
---------------- | ||
|
||
A new tornado web handler is created within the verifier to listen for requests | ||
that an agent will emit when it starts. We will call this `/nudge` for now with | ||
a more suitable name agreed within this review. | ||
|
||
A new `operational_state` named `OFFLINE` will be created for when a machine | ||
becomes unreachable during a `GET_QUOTE` `operational_state`. This state will be | ||
set once the agent fails to respond during its retry query period set within | ||
the `keylime.conf` configuration file. | ||
|
||
A new database row will need to be introduced for the `OFFLINE` | ||
`operational_state` | ||
|
||
Agent Changes | ||
------------- | ||
|
||
Code will be introduced to the agent that will perform a `POST` request | ||
`/nudge` to inform the verifier an agent has been (re)started. This in turn will | ||
instruct the verifier to perform an `operational_state` query for the `UUID` of | ||
the concerned agent. Should the `operational_state` be `OFFLINE`, it will | ||
change the `operational_state` to `GET_QUOTE` and proceed to (re)start continuous | ||
monitoring of the node with the previous set measurements (`whitelist`, | ||
`tpm_policy`) | ||
|
||
Registrar Changes | ||
------------------ | ||
|
||
No immediate changes come to mind, but we should be mindful of this as the | ||
design evolves. | ||
|
||
Keylime TPM coms changes | ||
------------------------ | ||
|
||
We will need to assess changes required within our TPM communications. For | ||
example the Agent calls `tpm_startup -c` and takes ownership of the tpm | ||
every time it starts. The AK handle is also flushed. | ||
|
||
We may need to consider having some sort of flag the agent queries to establish | ||
its already associated with a verifier. | ||
|
||
Rather than bootstrapping itself as a fresh agent, it instead retains its TPM | ||
set up and instead just instantiates its web service to allow rest API | ||
interactions with the verifier again. These interactions will be a continuum | ||
of the previous quote `GET` requests from the verifier, while retaining the | ||
existing root of trust already set up by the registrar (EKpub and AKPub). | ||
|
||
### Test Plan | ||
|
||
<!-- | ||
**Note:** *Not required until targeted at a release.* | ||
Consider the following in developing a test plan for this enhancement: | ||
- Will there be e2e and integration tests, in addition to unit tests? | ||
- How will it be tested in isolation vs with other components? | ||
No need to outline all of the test cases, just the general strategy. Anything | ||
that would count as tricky in the implementation and anything particularly | ||
challenging to test should be called out. | ||
All code is expected to have adequate tests (eventually with coverage | ||
expectations). | ||
--> | ||
|
||
Functional tests will be needed to play out the user case of restarting a | ||
agent, persisting state and reestablishing measurements upon its restart. | ||
|
||
Unit tests will be needed to test the new `nudge` API functionality. | ||
|
||
### Upgrade / Downgrade Strategy | ||
|
||
<!-- | ||
If applicable, how will the component be upgraded and downgraded? Make sure | ||
this is in the test plan. | ||
Consider the following in developing an upgrade/downgrade strategy for this enhancement | ||
--> | ||
|
||
May need to consider impact of upgrading with an agent offline and then the new | ||
TPM code changes interacting with the TPM setup from the previous release. | ||
|
||
## Drawbacks | ||
|
||
<!-- | ||
Why should this enhancement _not_ be implemented? | ||
--> | ||
|
||
TBD | ||
|
||
## Alternatives | ||
|
||
<!-- | ||
What other approaches did you consider and why did you rule them out? These do | ||
not need to be as detailed as the proposal, but should include enough | ||
information to express the idea and why it was not acceptable. | ||
--> | ||
|
||
We evolve the retry handler in the verifier to wait for indefinite periods | ||
instead of having a wake up API - this is hazardous as we risk bottle necks | ||
and need to consider managing more state (for example a node goes offline to | ||
never return). | ||
|
||
## Infrastructure Needed (optional) | ||
|
||
<!-- | ||
Use this section if you need things infrastructure related specific to your enhancement. Examples include a | ||
new subproject, repos requested, github webhook, changes to CI (travis). | ||
--> | ||
|
||
Some changes may be needed to travis CI, but not expected currently. | ||
|
||
No new repos required. |