Skip to content

Commit 953b627

Browse files
authored
Add initial trust quorum support (#487)
In order to allow for encrypted storage on individual sleds without the need for a user to type a password at boot, we utilize secret sharing across sleds, where a threshold number of sleds need to communicate in order to generate a `rack secret`. This rack secret can then be used to derive local encryption keys for individual sleds. We therefore provide the ability to prevent an attacker from stealing a subset of sleds or storage devices and obtaining any data. In fact, the control plane software does not even boot until the rack secret is reconstructed and the protected storage unlocked. There are quite a few moving parts required in order to implement a trust quorum, some of which involve the service processor and hardware root of trust. This commit only implements the part of the trust quorum responsible for retrieving existing key shares over an unfinished SPDM channel. It runs entirely on the host machine as part of the sled-agent. The code builds upon the multicast discovery code in #404, the SPDM negotiation code in #407 and the secret sharing code in #429. In the "normal" lifetime of an Oxide rack, a rack secret will be generated upon initialization of the new rack by the customer. The shares will then be distributed over SPDM channels to individual sleds such that they can be retrieved and combined at a later time when an individual sled or the entire rack reboots. The initial generation and distribution of shares is *not* part of this commit. Instead shares are individually distributed along with metadata as a `ShareDistribution` stored in a `share.json` file in the `sled_agent/pkg` directory under the install directory configured for `omicron-package install`. Share generation must be done manually now, but a follow up commit is coming for a deployment system that will generate the rack secret and distribute the shares along with the install of omicron. If the `share.json` file is not present, the server operates in single-node mode, and does not try to form a a trust quorum. This is behavior required for current development backwards compatibility and will eventually be removed. The SPDM protocol is run over a 2-byte size header framed transport operating over a TCP stream. We generate a client and server to initialize this transport, perform SPDM negotiation, and then begin share retrieval. As noted in #407, only the negotiation phase of the SPDM protocol is currently implemented, and so we simply return the TCP based transport when negotiation completes, and pretend for now that we are operating over a secure channel. This allows us to test out the end-to-end behavior before we have a production ready SPDM implementation integrated. This commit also makes a small change to the SPDM transport to provide for timeouts on `send` and `recv` operations, and no longer requires passing a logger to each call of `recv`.
1 parent ade6051 commit 953b627

File tree

15 files changed

+618
-96
lines changed

15 files changed

+618
-96
lines changed

sled-agent/Cargo.toml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,3 +63,7 @@ doc = false
6363
[[bin]]
6464
name = "sled-agent"
6565
doc = false
66+
67+
[[bin]]
68+
name = "sled-agent-overlay-files"
69+
doc = false
Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
// This Source Code Form is subject to the terms of the Mozilla Public
2+
// License, v. 2.0. If a copy of the MPL was not distributed with this
3+
// file, You can obtain one at https://mozilla.org/MPL/2.0/.
4+
5+
//! This binary is used to generate files unique to the sled agent running on
6+
//! each server. Specifically, the unique files we care about are key shares
7+
//! used for the trust quourm here. We generate a shared secret then split it,
8+
//! distributing each share to the appropriate server.
9+
10+
use omicron_sled_agent::bootstrap::trust_quorum::{
11+
RackSecret, ShareDistribution,
12+
};
13+
14+
use anyhow::{anyhow, Result};
15+
use std::path::PathBuf;
16+
use structopt::StructOpt;
17+
18+
#[derive(Debug, StructOpt)]
19+
#[structopt(
20+
name = "sled-agent-overlay-files",
21+
about = "Generate server unique files for deployment"
22+
)]
23+
struct Args {
24+
//// The rack secret threshold
25+
#[structopt(short, long)]
26+
threshold: usize,
27+
28+
/// A directory per server where the files are output
29+
#[structopt(short, long)]
30+
directories: Vec<PathBuf>,
31+
}
32+
33+
// Generate a rack secret and allocate a ShareDistribution to each deployment
34+
// server folder.
35+
fn overlay_secret_shares(
36+
threshold: usize,
37+
server_dirs: &[PathBuf],
38+
) -> Result<()> {
39+
let total_shares = server_dirs.len();
40+
let secret = RackSecret::new();
41+
let (shares, verifier) = secret
42+
.split(threshold, total_shares)
43+
.map_err(|e| anyhow!("Failed to split rack secret: {:?}", e))?;
44+
for (share, server_dir) in shares.into_iter().zip(server_dirs) {
45+
ShareDistribution {
46+
threshold,
47+
total_shares,
48+
verifier: verifier.clone(),
49+
share,
50+
}
51+
.write(&server_dir)?;
52+
}
53+
Ok(())
54+
}
55+
56+
fn main() -> Result<()> {
57+
let args = Args::from_args_safe().map_err(|err| anyhow!(err))?;
58+
overlay_secret_shares(args.threshold, &args.directories)?;
59+
Ok(())
60+
}

sled-agent/src/bootstrap/agent.rs

Lines changed: 108 additions & 67 deletions
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,10 @@
44

55
//! Bootstrap-related APIs.
66
7-
use super::client::types as bootstrap_types;
8-
use super::client::Client as BootstrapClient;
97
use super::discovery;
10-
use super::spdm::SpdmError;
8+
use super::trust_quorum::{
9+
self, RackSecret, ShareDistribution, TrustQuorumError,
10+
};
1111
use super::views::ShareResponse;
1212
use omicron_common::api::external::Error as ExternalError;
1313
use omicron_common::backoff::{
@@ -18,14 +18,11 @@ use omicron_common::packaging::sha256_digest;
1818
use slog::Logger;
1919
use std::collections::HashMap;
2020
use std::fs::File;
21-
use std::io::{Seek, SeekFrom};
21+
use std::io::{self, Seek, SeekFrom};
2222
use std::path::Path;
2323
use tar::Archive;
2424
use thiserror::Error;
2525

26-
const UNLOCK_THRESHOLD: usize = 1;
27-
const BOOTSTRAP_PORT: u16 = 12346;
28-
2926
/// Describes errors which may occur while operating the bootstrap service.
3027
#[derive(Error, Debug)]
3128
pub enum BootstrapError {
@@ -47,11 +44,8 @@ pub enum BootstrapError {
4744
#[error("Error making HTTP request")]
4845
Api(#[from] anyhow::Error),
4946

50-
#[error("Error running SPDM protocol: {0}")]
51-
Spdm(#[from] SpdmError),
52-
53-
#[error("Not enough peers to unlock storage")]
54-
NotEnoughPeers,
47+
#[error(transparent)]
48+
TrustQuorum(#[from] TrustQuorumError),
5549
}
5650

5751
impl From<BootstrapError> for ExternalError {
@@ -60,17 +54,41 @@ impl From<BootstrapError> for ExternalError {
6054
}
6155
}
6256

57+
// Attempt to read a key share file. If the file does not exist, we return
58+
// `Ok(None)`, indicating the sled is operating in a single node cluster. If
59+
// the file exists, we parse it and return Ok(ShareDistribution). For any
60+
// other error, we return the error.
61+
//
62+
// TODO: Remove after dynamic key generation. See #513.
63+
fn read_key_share() -> Result<Option<ShareDistribution>, BootstrapError> {
64+
let key_share_dir = Path::new("/opt/oxide/sled-agent/pkg");
65+
66+
match ShareDistribution::read(&key_share_dir) {
67+
Ok(share) => Ok(Some(share)),
68+
Err(TrustQuorumError::Io(err)) => {
69+
if err.kind() == io::ErrorKind::NotFound {
70+
Ok(None)
71+
} else {
72+
Err(BootstrapError::Io(err))
73+
}
74+
}
75+
Err(e) => Err(e.into()),
76+
}
77+
}
78+
6379
/// The entity responsible for bootstrapping an Oxide rack.
6480
pub(crate) struct Agent {
6581
/// Debug log
6682
log: Logger,
6783
peer_monitor: discovery::PeerMonitor,
84+
share: Option<ShareDistribution>,
6885
}
6986

7087
impl Agent {
7188
pub fn new(log: Logger) -> Result<Self, BootstrapError> {
7289
let peer_monitor = discovery::PeerMonitor::new(&log)?;
73-
Ok(Agent { log, peer_monitor })
90+
let share = read_key_share()?;
91+
Ok(Agent { log, peer_monitor, share })
7492
}
7593

7694
/// Implements the "request share" API.
@@ -89,74 +107,84 @@ impl Agent {
89107

90108
/// Communicates with peers, sharing secrets, until the rack has been
91109
/// sufficiently unlocked.
92-
///
93-
/// - This method retries until [`UNLOCK_THRESHOLD`] other agents are
94-
/// online, and have successfully responded to "share requests".
95-
async fn establish_sled_quorum(&self) -> Result<(), BootstrapError> {
96-
retry_notify(
110+
async fn establish_sled_quorum(
111+
&self,
112+
) -> Result<RackSecret, BootstrapError> {
113+
let rack_secret = retry_notify(
97114
internal_service_policy(),
98115
|| async {
99116
let other_agents = self.peer_monitor.addrs().await;
100-
info!(&self.log, "Bootstrap: Communicating with peers: {:?}", other_agents);
117+
info!(
118+
&self.log,
119+
"Bootstrap: Communicating with peers: {:?}", other_agents
120+
);
121+
122+
let share = self.share.as_ref().unwrap();
101123

102124
// "-1" to account for ourselves.
103-
//
104-
// NOTE: Clippy error exists while the compile-time unlock
105-
// threshold is "1", because we basically don't require any
106-
// peers to unlock.
107-
#[allow(clippy::absurd_extreme_comparisons)]
108-
if other_agents.len() < UNLOCK_THRESHOLD - 1 {
109-
warn!(&self.log, "Not enough peers to start establishing quorum");
125+
if other_agents.len() < share.threshold - 1 {
126+
warn!(
127+
&self.log,
128+
"Not enough peers to start establishing quorum"
129+
);
110130
return Err(BackoffError::Transient(
111-
BootstrapError::NotEnoughPeers,
131+
TrustQuorumError::NotEnoughPeers,
112132
));
113133
}
114-
info!(&self.log, "Bootstrap: Enough peers to start share transfer");
115-
116-
// TODO-correctness:
117-
// - Establish trust quorum.
118-
// - Once this is done, "unlock" local storage
119-
//
120-
// The current implementation sends a stub request to all known sled
121-
// agents, but does not actually create a quorum / unlock anything.
122-
let other_agents: Vec<BootstrapClient> = other_agents
134+
info!(
135+
&self.log,
136+
"Bootstrap: Enough peers to start share transfer"
137+
);
138+
139+
// Retrieve verified rack_secret shares from a quorum of agents
140+
let other_agents: Vec<trust_quorum::Client> = other_agents
123141
.into_iter()
124142
.map(|mut addr| {
125-
addr.set_port(BOOTSTRAP_PORT);
126-
// TODO-correctness:
127-
//
128-
// Many rust crates - such as "URL" - really dislike
129-
// using scopes in IPv6 addresses. Using
130-
// "addr.to_string()" results in an IP address format
131-
// that is rejected when embedded into a URL.
132-
//
133-
// Instead, we merely use IP and port for the moment,
134-
// which loses the scope information. Longer-term, if we
135-
// use ULAs (Unique Local Addresses) the scope shouldn't
136-
// be a factor anyway.
137-
let addr_str = format!("[{}]:{}", addr.ip(), addr.port());
138-
info!(&self.log, "bootstrap: Connecting to {}", addr_str);
139-
BootstrapClient::new(
140-
&format!("http://{}", addr_str),
141-
self.log.new(o!(
142-
"Address" => addr_str,
143-
)),
143+
addr.set_port(trust_quorum::PORT);
144+
trust_quorum::Client::new(
145+
&self.log,
146+
share.verifier.clone(),
147+
addr,
144148
)
145149
})
146150
.collect();
151+
152+
// TODO: Parallelize this and keep track of whose shares we've already retrieved and
153+
// don't resend. See https://github.com/oxidecomputer/omicron/issues/514
154+
let mut shares = vec![share.share.clone()];
147155
for agent in &other_agents {
148-
agent
149-
.api_request_share(&bootstrap_types::ShareRequest {
150-
identity: vec![],
151-
})
152-
.await
156+
let share = agent.get_share().await
153157
.map_err(|e| {
154-
info!(&self.log, "Bootstrap: Failed to share request with peer: {:?}", e);
155-
BackoffError::Transient(BootstrapError::Api(e))
158+
info!(&self.log, "Bootstrap: failed to retreive share from peer: {:?}", e);
159+
BackoffError::Transient(e)
156160
})?;
157-
info!(&self.log, "Bootstrap: Shared request with peer");
161+
info!(
162+
&self.log,
163+
"Bootstrap: retreived share from peer: {}",
164+
agent.addr()
165+
);
166+
shares.push(share);
158167
}
159-
Ok(())
168+
let rack_secret = RackSecret::combine_shares(
169+
share.threshold,
170+
share.total_shares,
171+
&shares,
172+
)
173+
.map_err(|e| {
174+
warn!(
175+
&self.log,
176+
"Bootstrap: failed to construct rack secret: {:?}", e
177+
);
178+
// TODO: We probably need to actually write an error
179+
// handling routine that gives up in some cases based on
180+
// the error returned from `RackSecret::combine_shares`.
181+
// See https://github.com/oxidecomputer/omicron/issues/516
182+
BackoffError::Transient(
183+
TrustQuorumError::RackSecretConstructionFailed(e),
184+
)
185+
})?;
186+
info!(self.log, "RackSecret computed from shares.");
187+
Ok(rack_secret)
160188
},
161189
|error, duration| {
162190
warn!(
@@ -169,7 +197,7 @@ impl Agent {
169197
)
170198
.await?;
171199

172-
Ok(())
200+
Ok(rack_secret)
173201
}
174202

175203
async fn launch_local_services(&self) -> Result<(), BootstrapError> {
@@ -200,14 +228,27 @@ impl Agent {
200228
Ok(())
201229
}
202230

231+
async fn run_trust_quorum_server(&self) -> Result<(), BootstrapError> {
232+
let my_share = self.share.as_ref().unwrap().share.clone();
233+
let mut server = trust_quorum::Server::new(&self.log, my_share)?;
234+
tokio::spawn(async move { server.run().await });
235+
Ok(())
236+
}
237+
203238
/// Performs device initialization:
204239
///
205-
/// - TODO: Communicates with other sled agents to establish a trust quorum.
240+
/// - Communicates with other sled agents to establish a trust quorum if a
241+
/// ShareDistribution file exists on the host. Otherwise, the sled operates
242+
/// as a single node cluster.
206243
/// - Verifies, unpacks, and launches other services.
207244
pub async fn initialize(&self) -> Result<(), BootstrapError> {
208245
info!(&self.log, "bootstrap service initializing");
209246

210-
self.establish_sled_quorum().await?;
247+
if self.share.is_some() {
248+
self.run_trust_quorum_server().await?;
249+
self.establish_sled_quorum().await?;
250+
}
251+
211252
self.launch_local_services().await?;
212253

213254
Ok(())

sled-agent/src/bootstrap/mod.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ mod discovery;
1111
mod http_entrypoints;
1212
mod multicast;
1313
mod params;
14-
pub mod rack_secret;
1514
pub mod server;
1615
mod spdm;
16+
pub mod trust_quorum;
1717
mod views;

sled-agent/src/bootstrap/spdm/error.rs

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
1+
// This Source Code Form is subject to the terms of the Mozilla Public
2+
// License, v. 2.0. If a copy of the MPL was not distributed with this
3+
// file, You can obtain one at https://mozilla.org/MPL/2.0/.
4+
15
//! Wrap errors returned from the `spdm` crate and std::io::Error.
26
37
use spdm::{requester::RequesterError, responder::ResponderError};
@@ -17,6 +21,9 @@ pub enum SpdmError {
1721

1822
#[error("invalid state transition: expected {expected}, got {got}")]
1923
InvalidState { expected: &'static str, got: &'static str },
24+
25+
#[error("timeout")]
26+
Timeout(#[from] tokio::time::error::Elapsed),
2027
}
2128

2229
impl From<RequesterError> for SpdmError {

0 commit comments

Comments
 (0)