Skip to content

Commit 2967351

Browse files
authored
[nexus] fault management situation reports (#9320)
[RFD 603] proposes the _fault management situation report_, or _sitrep_, as the central data structure for the control plane's fault management subsystem. The design, which is discussed in much greater detail in that RFD, draws a lot of inspiration from the blueprint data structure in the Reconfigurator. Sitreps are generated by the planning phase in a plan-execute pattern. At any time, a single sitrep is considered _current_. Updating the control plane's understanding of the state of the system based on new inputs is done by a new planning step based on the current sitrep along with other inputs, and produces a new sitrep with the current sitrep as its _parent_. A sitrep may then be added to the version history of current sitreps if (and only if) its parent sitrep is still the current sitrep (i.e. the highest version number currently stored in the sitrep history). This ensures that there is a single sequentially consistent history of sitreps. Sitreps generated based on outdated inputs --- due to multiple Nexuses generating them concurrently, or a Nexus operating on state that is no longer up to date ---may not be made current, and are discarded. This branch adds the foundation of the sitrep subsystem. In particular, it includes the following: - Database schemas for the `fm_sitrep` table, which stores metadata for sitreps, and the `fm_sitrep_history` table, which stores the version history - Models and `nexus_types` types for the same - Database queries for reading the current sitrep version, reading a sitrep by its ID, and for inserting sitreps, including the "compare and swap" CTE that ensures new versions may only be inserted if they descend directly from the current sitrep - A `fm_sitrep_loader` task that loads the latest sitrep version and publishes it over a `tokio::sync::watch` channel (which is not presently consumed by other code) - OMDB commands for looking at sitreps Right now, a sitrep only contains its top-level metadata. Other tables for storing parts of the sitrep, such as [cases](https://rfd.shared.oxide.computer/rfd/0603#_cases) and records for [updating Problems](https://rfd.shared.oxide.computer/rfd/0603#_problems), will be added later as more of the control plane fault management subsystem is implemented. Currently, no sitreps are ever created outside of tests, so this code won't really _do_ anything yet. But, it's an important foundation for the ret of the FM work, so I wanted to get it up for review as soon as possible. [RFD 603]: https://rfd.shared.oxide.computer/rfd/0603
1 parent 0bea14d commit 2967351

File tree

32 files changed

+1994
-3
lines changed

32 files changed

+1994
-3
lines changed

dev-tools/omdb/src/bin/omdb/db.rs

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -180,6 +180,7 @@ mod blueprints;
180180
mod db_metadata;
181181
mod ereport;
182182
mod saga;
183+
mod sitrep;
183184
mod user_data_export;
184185

185186
const NO_ACTIVE_PROPOLIS_MSG: &str = "<no active Propolis>";
@@ -378,6 +379,13 @@ enum DbCommands {
378379
RegionSnapshotReplacement(RegionSnapshotReplacementArgs),
379380
/// Commands for querying and interacting with sagas
380381
Saga(saga::SagaArgs),
382+
/// Commands for querying and interacting with fault management situation
383+
/// reports.
384+
Sitrep(sitrep::SitrepArgs),
385+
/// Show the current history of fault management situation reports.
386+
///
387+
/// This is an alias for `omdb db sitrep history`.
388+
Sitreps(sitrep::SitrepHistoryArgs),
381389
/// Print information about sleds
382390
Sleds(SledsArgs),
383391
/// Print information about customer instances.
@@ -1297,6 +1305,12 @@ impl DbArgs {
12971305
DbCommands::Saga(args) => {
12981306
args.exec(&omdb, &opctx, &datastore).await
12991307
}
1308+
DbCommands::Sitrep(args) => {
1309+
sitrep::cmd_db_sitrep(&opctx, &datastore, &fetch_opts, args).await
1310+
}
1311+
DbCommands::Sitreps(args) => {
1312+
sitrep::cmd_db_sitrep_history(&datastore, &fetch_opts, args).await
1313+
}
13001314
DbCommands::Sleds(args) => {
13011315
cmd_db_sleds(&opctx, &datastore, &fetch_opts, args).await
13021316
}
Lines changed: 355 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,355 @@
1+
// This Source Code Form is subject to the terms of the Mozilla Public
2+
// License, v. 2.0. If a copy of the MPL was not distributed with this
3+
// file, You can obtain one at https://mozilla.org/MPL/2.0/.
4+
5+
//! `omdb db sitrep` subcommands
6+
7+
use crate::db::DbFetchOptions;
8+
use crate::db::check_limit;
9+
use crate::helpers::const_max_len;
10+
use crate::helpers::datetime_rfc3339_concise;
11+
use anyhow::Context;
12+
use async_bb8_diesel::AsyncRunQueryDsl;
13+
use chrono::{DateTime, Utc};
14+
use clap::Args;
15+
use clap::Subcommand;
16+
use diesel::prelude::*;
17+
use nexus_db_queries::context::OpContext;
18+
use nexus_db_queries::db::DataStore;
19+
use nexus_db_queries::db::model;
20+
use nexus_db_queries::db::pagination::paginated;
21+
use nexus_types::fm;
22+
use omicron_common::api::external::DataPageParams;
23+
use omicron_common::api::external::PaginationOrder;
24+
use omicron_uuid_kinds::GenericUuid;
25+
use omicron_uuid_kinds::SitrepUuid;
26+
use tabled::Tabled;
27+
use uuid::Uuid;
28+
29+
use nexus_db_schema::schema::fm_sitrep::dsl as sitrep_dsl;
30+
use nexus_db_schema::schema::fm_sitrep_history::dsl as history_dsl;
31+
use nexus_db_schema::schema::inv_collection::dsl as inv_collection_dsl;
32+
33+
#[derive(Debug, Args, Clone)]
34+
pub(super) struct SitrepArgs {
35+
#[command(subcommand)]
36+
command: Commands,
37+
}
38+
39+
#[derive(Debug, Subcommand, Clone)]
40+
enum Commands {
41+
/// List the current situation report history.
42+
History(SitrepHistoryArgs),
43+
44+
/// Show the current situation report.
45+
///
46+
/// This is an alias for `omdb db sitrep info current`.
47+
Current(ShowArgs),
48+
49+
/// Show details on a situation report.
50+
#[clap(alias = "show")]
51+
Info {
52+
/// The UUID of the sitrep to show, or "current" to show the current
53+
/// sitrep.
54+
sitrep: SitrepIdOrCurrent,
55+
56+
#[clap(flatten)]
57+
args: ShowArgs,
58+
},
59+
}
60+
61+
#[derive(Debug, Args, Clone)]
62+
pub(super) struct SitrepHistoryArgs {
63+
/// If present, start at this sitrep version.
64+
///
65+
/// If this is not set, the list will start with the current sitrep. This
66+
/// option is useful when the number of sitreps exceeds the database fetch
67+
/// limit.
68+
#[arg(long, short, alias = "starting_at")]
69+
from: Option<u32>,
70+
}
71+
72+
#[derive(Debug, Args, Clone)]
73+
struct ShowArgs {}
74+
75+
#[derive(Debug, Clone, Copy)]
76+
enum SitrepIdOrCurrent {
77+
Current,
78+
Id(SitrepUuid),
79+
}
80+
81+
impl std::str::FromStr for SitrepIdOrCurrent {
82+
type Err = omicron_uuid_kinds::ParseError;
83+
84+
fn from_str(s: &str) -> Result<Self, Self::Err> {
85+
let s = s.trim();
86+
if s.eq_ignore_ascii_case("current") {
87+
Ok(Self::Current)
88+
} else {
89+
let id = s.parse()?;
90+
Ok(Self::Id(id))
91+
}
92+
}
93+
}
94+
95+
pub(super) async fn cmd_db_sitrep(
96+
opctx: &OpContext,
97+
datastore: &DataStore,
98+
fetch_opts: &DbFetchOptions,
99+
args: &SitrepArgs,
100+
) -> anyhow::Result<()> {
101+
match args.command {
102+
Commands::History(ref args) => {
103+
cmd_db_sitrep_history(datastore, fetch_opts, args).await
104+
}
105+
Commands::Info { sitrep, ref args } => {
106+
cmd_db_sitrep_show(opctx, datastore, fetch_opts, args, sitrep).await
107+
}
108+
Commands::Current(ref args) => {
109+
cmd_db_sitrep_show(
110+
opctx,
111+
datastore,
112+
fetch_opts,
113+
args,
114+
SitrepIdOrCurrent::Current,
115+
)
116+
.await
117+
}
118+
}
119+
}
120+
121+
pub(super) async fn cmd_db_sitrep_history(
122+
datastore: &DataStore,
123+
fetch_opts: &DbFetchOptions,
124+
args: &SitrepHistoryArgs,
125+
) -> anyhow::Result<()> {
126+
let ctx = || {
127+
if let Some(from) = args.from {
128+
format!(
129+
"listing fault management sitrep history (starting at {from})"
130+
)
131+
} else {
132+
"listing fault management sitrep history".to_string()
133+
}
134+
};
135+
136+
#[derive(Tabled)]
137+
#[tabled(rename_all = "SCREAMING_SNAKE_CASE")]
138+
struct SitrepRow {
139+
v: u32,
140+
id: Uuid,
141+
#[tabled(display_with = "datetime_rfc3339_concise")]
142+
created_at: DateTime<Utc>,
143+
comment: String,
144+
}
145+
146+
let conn = datastore.pool_connection_for_tests().await?;
147+
let marker = args.from.map(model::SqlU32::new);
148+
let pagparams = DataPageParams {
149+
marker: marker.as_ref(),
150+
direction: PaginationOrder::Descending,
151+
limit: fetch_opts.fetch_limit,
152+
};
153+
let sitreps: Vec<(model::SitrepVersion, model::SitrepMetadata)> =
154+
paginated(
155+
history_dsl::fm_sitrep_history,
156+
history_dsl::version,
157+
&pagparams,
158+
)
159+
.inner_join(
160+
sitrep_dsl::fm_sitrep.on(history_dsl::sitrep_id.eq(sitrep_dsl::id)),
161+
)
162+
.select((
163+
model::SitrepVersion::as_select(),
164+
model::SitrepMetadata::as_select(),
165+
))
166+
.load_async(&*conn)
167+
.await
168+
.with_context(ctx)?;
169+
170+
check_limit(&sitreps, fetch_opts.fetch_limit, ctx);
171+
172+
let rows = sitreps.into_iter().map(|(version, metadata)| {
173+
let model::SitrepMetadata {
174+
id,
175+
time_created,
176+
comment,
177+
creator_id: _,
178+
parent_sitrep_id: _,
179+
inv_collection_id: _,
180+
} = metadata;
181+
SitrepRow {
182+
v: version.version.into(),
183+
id: id.into_untyped_uuid(),
184+
created_at: time_created,
185+
comment,
186+
}
187+
});
188+
189+
let table = tabled::Table::new(rows)
190+
.with(tabled::settings::Style::empty())
191+
.with(tabled::settings::Padding::new(0, 1, 0, 0))
192+
.to_string();
193+
println!("{table}");
194+
195+
Ok(())
196+
}
197+
198+
async fn cmd_db_sitrep_show(
199+
opctx: &OpContext,
200+
datastore: &DataStore,
201+
_fetch_opts: &DbFetchOptions,
202+
_args: &ShowArgs,
203+
sitrep: SitrepIdOrCurrent,
204+
) -> anyhow::Result<()> {
205+
let ctx = || match sitrep {
206+
SitrepIdOrCurrent::Current => {
207+
"looking up the current fault management sitrep".to_string()
208+
}
209+
SitrepIdOrCurrent::Id(id) => {
210+
format!("looking up fault management sitrep {id:?}")
211+
}
212+
};
213+
214+
let current_version = datastore
215+
.fm_current_sitrep_version(&opctx)
216+
.await
217+
.context("failed to look up the current sitrep version")?;
218+
219+
let conn = datastore.pool_connection_for_tests().await?;
220+
let (maybe_version, sitrep) = match sitrep {
221+
SitrepIdOrCurrent::Id(id) => {
222+
let sitrep =
223+
datastore.fm_sitrep_read(opctx, id).await.with_context(ctx)?;
224+
let version = history_dsl::fm_sitrep_history
225+
.filter(history_dsl::sitrep_id.eq(id.into_untyped_uuid()))
226+
.select(model::SitrepVersion::as_select())
227+
.first_async(&*conn)
228+
.await
229+
.optional()
230+
.with_context(ctx)?
231+
.map(Into::into);
232+
(version, sitrep)
233+
}
234+
SitrepIdOrCurrent::Current => {
235+
let Some(version) = current_version.clone() else {
236+
anyhow::bail!("no current sitrep exists at this time");
237+
};
238+
239+
let sitrep = datastore
240+
.fm_sitrep_read(opctx, version.id)
241+
.await
242+
.with_context(ctx)?;
243+
(Some(version), sitrep)
244+
}
245+
};
246+
247+
let fm::Sitrep { metadata } = sitrep;
248+
let fm::SitrepMetadata {
249+
id,
250+
creator_id,
251+
time_created,
252+
parent_sitrep_id,
253+
inv_collection_id,
254+
comment,
255+
} = metadata;
256+
257+
const ID: &'static str = "ID";
258+
const PARENT_SITREP_ID: &'static str = "parent sitrep ID";
259+
const CREATED_BY: &'static str = "created by";
260+
const CREATED_AT: &'static str = "created at";
261+
const COMMENT: &'static str = "comment";
262+
const STATUS: &'static str = "status";
263+
const VERSION: &'static str = " version";
264+
const MADE_CURRENT_AT: &'static str = " made current at";
265+
const INV_COLLECTION_ID: &'static str = "inventory collection ID";
266+
const INV_STARTED_AT: &'static str = " started at";
267+
const INV_FINISHED_AT: &'static str = " finished at";
268+
269+
const WIDTH: usize = const_max_len(&[
270+
ID,
271+
PARENT_SITREP_ID,
272+
CREATED_AT,
273+
CREATED_BY,
274+
COMMENT,
275+
STATUS,
276+
VERSION,
277+
MADE_CURRENT_AT,
278+
INV_COLLECTION_ID,
279+
INV_STARTED_AT,
280+
INV_FINISHED_AT,
281+
]);
282+
283+
println!("\n{:=<80}", "== FAULT MANAGEMENT SITUATION REPORT ");
284+
println!(" {ID:>WIDTH$}: {id:?}");
285+
println!(" {PARENT_SITREP_ID:>WIDTH$}: {parent_sitrep_id:?}");
286+
println!(" {CREATED_BY:>WIDTH$}: {creator_id}");
287+
println!(" {CREATED_AT:>WIDTH$}: {time_created}");
288+
if comment.is_empty() {
289+
println!(" {COMMENT:>WIDTH$}: N/A\n");
290+
} else {
291+
println!(" {COMMENT:>WIDTH$}:");
292+
println!("{}\n", textwrap::indent(&comment, " "));
293+
}
294+
295+
match maybe_version {
296+
None => println!(
297+
" {STATUS:>WIDTH$}: not committed to the sitrep history"
298+
),
299+
Some(fm::SitrepVersion { version, time_made_current, .. }) => {
300+
if matches!(current_version, Some(ref v) if v.id == id) {
301+
println!(" {STATUS:>WIDTH$}: this is the current sitrep!",);
302+
} else {
303+
println!(" {STATUS:>WIDTH$}: in the sitrep history");
304+
}
305+
println!(" {VERSION:>WIDTH$}: v{version}");
306+
println!(" {MADE_CURRENT_AT:>WIDTH$}: {time_made_current}");
307+
match current_version {
308+
Some(v) if v.id == id => {}
309+
Some(fm::SitrepVersion { version, id, .. }) => {
310+
println!(
311+
"(i) note: the current sitrep is {id:?} \
312+
(at v{version})",
313+
);
314+
}
315+
None => {
316+
eprintln!(
317+
"/!\\ WEIRD: this sitrep is in the sitrep history, \
318+
but there is no current sitrep. this should not \
319+
happen!"
320+
);
321+
}
322+
};
323+
}
324+
}
325+
326+
println!("\n{:-<80}", "== DIAGNOSIS INPUTS ");
327+
println!(" {INV_COLLECTION_ID:>WIDTH$}: {inv_collection_id:?}");
328+
let inv_collection = inv_collection_dsl::inv_collection
329+
.filter(
330+
inv_collection_dsl::id.eq(inv_collection_id.into_untyped_uuid()),
331+
)
332+
.select(model::InvCollection::as_select())
333+
.first_async(&*conn)
334+
.await
335+
.optional();
336+
match inv_collection {
337+
Err(err) => {
338+
eprintln!(
339+
"/!\\ failed to fetch inventory collection details: {err}"
340+
);
341+
}
342+
Ok(Some(model::InvCollection { time_started, time_done, .. })) => {
343+
println!(" {INV_STARTED_AT:>WIDTH$}: {time_started}");
344+
println!(" {INV_FINISHED_AT:>WIDTH$}: {time_done}");
345+
}
346+
Ok(None) => {
347+
println!(
348+
" note: this collection no longer exists (perhaps it has \
349+
been pruned?)"
350+
)
351+
}
352+
}
353+
354+
Ok(())
355+
}

0 commit comments

Comments
 (0)