Skip to content

Commit

Permalink
Merge pull request #2 from lukehinds/presist-agent
Browse files Browse the repository at this point in the history
Add persist agent enhancement
  • Loading branch information
lukehinds authored Aug 27, 2020
2 parents 27b978b + 29be793 commit 9fa6c4d
Showing 1 changed file with 316 additions and 0 deletions.
316 changes: 316 additions & 0 deletions 1_persist_agent_restart.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,316 @@
<!--
**Note:** When your enhancement is complete, all of these comment blocks should be removed.
To get started with this template:
- [ ] **Create an issue in keylime/enhancements**
When filing an enhancement tracking issue, please ensure to complete all
fields in that template. One of the fields asks for a link to the enhancement. You
can leave that blank until this enhancement is made a pull request, and then
go back to the enhancement and add the link.
- [ ] **Make a copy of this template.**
name it `NNNN-short-descriptive-title`, where `NNNN` is the issue number (with no
leading-zero padding) assigned to your enhancement above.
- [ ] **Fill out this file as best you can.**
At minimum, you should fill in the "Summary", and "Motivation" sections.
These should be easy if you've preflighted the idea of the enhancement with the
appropriate SIG(s).
- [ ] **Merge early and iterate.**
Avoid getting hung up on specific details and instead aim to get the goals of
the enhancement clarified and merged quickly. The best way to do this is to just
start with the high-level sections and fill out details incrementally in
subsequent PRs.
-->
# Persist verifier monitoring after agent restarts

<!--
This is the title of your enhancement. Keep it short, simple, and descriptive. A good
title can help communicate what the enhancement is and should be considered as part of
any review.
-->

<!--
A table of contents is helpful for quickly jumping to sections of a enhancement and for
highlighting any additional information provided beyond the standard enhancement
template.
-->

<!-- toc -->
- [Release Signoff Checklist](#release-signoff-checklist)
- [Summary](#summary)
- [Motivation](#motivation)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Proposal](#proposal)
- [User Stories (optional)](#user-stories-optional)
- [Story 1](#story-1)
- [Story 2](#story-2)
- [Notes/Constraints/Caveats (optional)](#notesconstraintscaveats-optional)
- [Risks and Mitigations](#risks-and-mitigations)
- [Design Details](#design-details)
- [Test Plan](#test-plan)
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
- [Drawbacks](#drawbacks)
- [Alternatives](#alternatives)
- [Infrastructure Needed (optional)](#infrastructure-needed-optional)
<!-- /toc -->

## Release Signoff Checklist

<!--
**ACTION REQUIRED:** In order to merge code into a release, there must be an
issue in [keylime/enhancements] referencing this enhancement and targeting a release**.
For enhancements that make changes to code or processes/procedures in core
Keylime i.e., [keylime/keylime], we require the following Release
Signoff checklist to be completed.
Check these off as they are completed for the Release Team to track. These
checklist items _must_ be updated for the enhancement to be released.
-->

- [ ] Enhancement issue in release milestone, which links to pull request in [keylime/enhancements]
- [ ] Core members have approved the issue with the label `implementable`
- [ ] Design details are appropriately documented
- [ ] Test plan is in place
- [ ] User-facing documentation has been created in [keylime/keylime-docs]

<!--
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
-->

## Summary

<!--
This section is incredibly important for producing high quality user-focused
documentation such as release notes or a development roadmap. It should be
possible to collect this information before implementation begins in order to
avoid requiring implementers to split their attention between writing release
notes and implementing the feature itself. Reviewers
should help to ensure that the tone and content of the `Summary` section is
useful for a wide audience.
A good summary is probably at least a paragraph in length.
-->

Should someone restart an agent based server or force an agent offline, the
agent will no longer be monitored by the verifier. Upon starting the agent will
just register with the registrar and IMA monitoring will cease.

This behavior was originally discussed on the [keylime mailing list](https://keylime.groups.io/g/main/topic/q_is_an_agent_s_policy/72856684?p=,,,20,0,0,0::recentpostdate%2Fsticky,,,20,2,0,72856684)

## Motivation

<!--
This section is for explicitly listing the motivation, goals and non-goals of
this enhancement. Describe why the change is important and the benefits to users.
-->

Its acceptable that someone may want to manually restart a server (or the server
restarts as part of an automated work flow) while retaining the configuration
set up during the intial "adding" of the agent to the verifier (`whitelist`,
`tpm_policy`). They should not have to again add (or update) the verifier
every time if there is not change in configuration or trust mapping (e.g software
CA).

### Goals

<!--
List the specific goals of the enhancement. What is it trying to achieve? How will we
know that this has succeeded?
-->

A user restarts the agent on a target node. When the agent is becomes active
again the verifier proceeds to recommence monitoring the delegated measurements
from when the target agent was first added to the verifier and registrar.

### Non-Goals

<!--
What is out of scope for this enhancement? Listing non-goals helps to focus discussion
and make progress.
-->

Any sort of migration or fault redundancy (although both areas benefit from this
change)

## Proposal

<!--
This is where we get down to the specifics of what the proposal actually is.
This should have enough detail that reviewers can understand exactly what
you're proposing, but should not include things like API designs or
implementation. The "Design Details" section below is for the real
nitty-gritty.
-->

A target machine is rebooted with no change in state (measured properties). This
machine should not require “re adding” with the keylime tenant again.

Once the target node / agent returns to an online / reachable state, the
verifier should proceed to recommence run time monitoring.

A new tornado web handler will be created within the verifier to listen for
requests that an agent will emit when it (re)starts.

Code will be introduced within the agent that will perform a `POST` request to
inform the verifier an agent has been (re)started. This in turn will cause the
verifier to perform an `operational_state query` for the `UUID` of that agent
and then proceed to perform run time integrity monitoring again.

### User Stories (optional)

<!--
Detail the things that people will be able to do if this enhancement is implemented.
Include as much detail as possible so that people can understand the "how" of
the system. The goal here is to make this feel real for users without getting
bogged down.
-->

For any given reason my server reboots. Keylime handles this event and provides
trust monitoring once the server and agent are back online and can be reached
by the verifier.

Should the machines state have been tampered with during the offline period,
Keylime will immediate fail the target node accordingly (or likewise show the
machine is still in the expected trust state according to the delegated
measurements)

If I want to change measurements, I use the existing `update` command available
in the Keylime Tenant CLI.

### Risks and Mitigations

<!--
What are the risks of this proposal and how do we mitigate. Think broadly.
For example, consider both security and how this will impact the larger
enhancement ecosystem.
How will security be reviewed and by whom?
-->

We should be sure we do not introduce security risks and be mindful of future
enhancements such as multi tenancy, auth and migration.

## Design Details

<!--
This section should contain enough information that the specifics of your
change are understandable. This may include API specs (though not always
required) or even code snippets. If there's any ambiguity about HOW your
proposal will be implemented, this is the place to discuss them.
-->

Verifier Changes
----------------

A new tornado web handler is created within the verifier to listen for requests
that an agent will emit when it starts. We will call this `/nudge` for now with
a more suitable name agreed within this review.

A new `operational_state` named `OFFLINE` will be created for when a machine
becomes unreachable during a `GET_QUOTE` `operational_state`. This state will be
set once the agent fails to respond during its retry query period set within
the `keylime.conf` configuration file.

A new database row will need to be introduced for the `OFFLINE`
`operational_state`

Agent Changes
-------------

Code will be introduced to the agent that will perform a `POST` request
`/nudge` to inform the verifier an agent has been (re)started. This in turn will
instruct the verifier to perform an `operational_state` query for the `UUID` of
the concerned agent. Should the `operational_state` be `OFFLINE`, it will
change the `operational_state` to `GET_QUOTE` and proceed to (re)start continuous
monitoring of the node with the previous set measurements (`whitelist`,
`tpm_policy`)

Registrar Changes
------------------

No immediate changes come to mind, but we should be mindful of this as the
design evolves.

Keylime TPM coms changes
------------------------

We will need to assess changes required within our TPM communications. For
example the Agent calls `tpm_startup -c` and takes ownership of the tpm
every time it starts. The AK handle is also flushed.

We may need to consider having some sort of flag the agent queries to establish
its already associated with a verifier.

Rather than bootstrapping itself as a fresh agent, it instead retains its TPM
set up and instead just instantiates its web service to allow rest API
interactions with the verifier again. These interactions will be a continuum
of the previous quote `GET` requests from the verifier, while retaining the
existing root of trust already set up by the registrar (EKpub and AKPub).

### Test Plan

<!--
**Note:** *Not required until targeted at a release.*
Consider the following in developing a test plan for this enhancement:
- Will there be e2e and integration tests, in addition to unit tests?
- How will it be tested in isolation vs with other components?
No need to outline all of the test cases, just the general strategy. Anything
that would count as tricky in the implementation and anything particularly
challenging to test should be called out.
All code is expected to have adequate tests (eventually with coverage
expectations).
-->

Functional tests will be needed to play out the user case of restarting a
agent, persisting state and reestablishing measurements upon its restart.

Unit tests will be needed to test the new `nudge` API functionality.

### Upgrade / Downgrade Strategy

<!--
If applicable, how will the component be upgraded and downgraded? Make sure
this is in the test plan.
Consider the following in developing an upgrade/downgrade strategy for this enhancement
-->

May need to consider impact of upgrading with an agent offline and then the new
TPM code changes interacting with the TPM setup from the previous release.

## Drawbacks

<!--
Why should this enhancement _not_ be implemented?
-->

TBD

## Alternatives

<!--
What other approaches did you consider and why did you rule them out? These do
not need to be as detailed as the proposal, but should include enough
information to express the idea and why it was not acceptable.
-->

We evolve the retry handler in the verifier to wait for indefinite periods
instead of having a wake up API - this is hazardous as we risk bottle necks
and need to consider managing more state (for example a node goes offline to
never return).

## Infrastructure Needed (optional)

<!--
Use this section if you need things infrastructure related specific to your enhancement. Examples include a
new subproject, repos requested, github webhook, changes to CI (travis).
-->

Some changes may be needed to travis CI, but not expected currently.

No new repos required.

0 comments on commit 9fa6c4d

Please sign in to comment.