diff --git a/1_persist_agent_restart.md b/1_persist_agent_restart.md new file mode 100644 index 0000000..8d84e4f --- /dev/null +++ b/1_persist_agent_restart.md @@ -0,0 +1,316 @@ + +# Persist verifier monitoring after agent restarts + + + + + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories (optional)](#user-stories-optional) + - [Story 1](#story-1) + - [Story 2](#story-2) + - [Notes/Constraints/Caveats (optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Test Plan](#test-plan) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Infrastructure Needed (optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + + + +- [ ] Enhancement issue in release milestone, which links to pull request in [keylime/enhancements] +- [ ] Core members have approved the issue with the label `implementable` +- [ ] Design details are appropriately documented +- [ ] Test plan is in place +- [ ] User-facing documentation has been created in [keylime/keylime-docs] + + + +## Summary + + + +Should someone restart an agent based server or force an agent offline, the +agent will no longer be monitored by the verifier. Upon starting the agent will +just register with the registrar and IMA monitoring will cease. + +This behavior was originally discussed on the [keylime mailing list](https://keylime.groups.io/g/main/topic/q_is_an_agent_s_policy/72856684?p=,,,20,0,0,0::recentpostdate%2Fsticky,,,20,2,0,72856684) + +## Motivation + + + +Its acceptable that someone may want to manually restart a server (or the server +restarts as part of an automated work flow) while retaining the configuration +set up during the intial "adding" of the agent to the verifier (`whitelist`, +`tpm_policy`). They should not have to again add (or update) the verifier +every time if there is not change in configuration or trust mapping (e.g software +CA). + +### Goals + + + +A user restarts the agent on a target node. When the agent is becomes active +again the verifier proceeds to recommence monitoring the delegated measurements +from when the target agent was first added to the verifier and registrar. + +### Non-Goals + + + +Any sort of migration or fault redundancy (although both areas benefit from this +change) + +## Proposal + + + +A target machine is rebooted with no change in state (measured properties). This +machine should not require “re adding” with the keylime tenant again. + +Once the target node / agent returns to an online / reachable state, the +verifier should proceed to recommence run time monitoring. + +A new tornado web handler will be created within the verifier to listen for +requests that an agent will emit when it (re)starts. + +Code will be introduced within the agent that will perform a `POST` request to +inform the verifier an agent has been (re)started. This in turn will cause the +verifier to perform an `operational_state query` for the `UUID` of that agent +and then proceed to perform run time integrity monitoring again. + +### User Stories (optional) + + + +For any given reason my server reboots. Keylime handles this event and provides +trust monitoring once the server and agent are back online and can be reached +by the verifier. + +Should the machines state have been tampered with during the offline period, +Keylime will immediate fail the target node accordingly (or likewise show the +machine is still in the expected trust state according to the delegated +measurements) + +If I want to change measurements, I use the existing `update` command available +in the Keylime Tenant CLI. + +### Risks and Mitigations + + + +We should be sure we do not introduce security risks and be mindful of future +enhancements such as multi tenancy, auth and migration. + +## Design Details + + + +Verifier Changes +---------------- + +A new tornado web handler is created within the verifier to listen for requests +that an agent will emit when it starts. We will call this `/nudge` for now with +a more suitable name agreed within this review. + +A new `operational_state` named `OFFLINE` will be created for when a machine +becomes unreachable during a `GET_QUOTE` `operational_state`. This state will be +set once the agent fails to respond during its retry query period set within +the `keylime.conf` configuration file. + +A new database row will need to be introduced for the `OFFLINE` +`operational_state` + +Agent Changes +------------- + +Code will be introduced to the agent that will perform a `POST` request +`/nudge` to inform the verifier an agent has been (re)started. This in turn will +instruct the verifier to perform an `operational_state` query for the `UUID` of +the concerned agent. Should the `operational_state` be `OFFLINE`, it will +change the `operational_state` to `GET_QUOTE` and proceed to (re)start continuous +monitoring of the node with the previous set measurements (`whitelist`, +`tpm_policy`) + +Registrar Changes +------------------ + +No immediate changes come to mind, but we should be mindful of this as the +design evolves. + +Keylime TPM coms changes +------------------------ + +We will need to assess changes required within our TPM communications. For +example the Agent calls `tpm_startup -c` and takes ownership of the tpm +every time it starts. The AK handle is also flushed. + +We may need to consider having some sort of flag the agent queries to establish +its already associated with a verifier. + +Rather than bootstrapping itself as a fresh agent, it instead retains its TPM +set up and instead just instantiates its web service to allow rest API +interactions with the verifier again. These interactions will be a continuum +of the previous quote `GET` requests from the verifier, while retaining the +existing root of trust already set up by the registrar (EKpub and AKPub). + +### Test Plan + + + +Functional tests will be needed to play out the user case of restarting a +agent, persisting state and reestablishing measurements upon its restart. + +Unit tests will be needed to test the new `nudge` API functionality. + +### Upgrade / Downgrade Strategy + + + +May need to consider impact of upgrading with an agent offline and then the new +TPM code changes interacting with the TPM setup from the previous release. + +## Drawbacks + + + +TBD + +## Alternatives + + + +We evolve the retry handler in the verifier to wait for indefinite periods +instead of having a wake up API - this is hazardous as we risk bottle necks +and need to consider managing more state (for example a node goes offline to +never return). + +## Infrastructure Needed (optional) + + + +Some changes may be needed to travis CI, but not expected currently. + +No new repos required.