-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distinguish between Scrimlets and Gimlets #1510
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! I don't think it matters much now, but we'll definitely need to handle the sled agent coming up without knowing whether it's running on a Scrimlet, and asking Nexus to update the database once that can be determined.
1b3365f
to
d172c9c
Compare
Sounds good. Updated the |
// TODO(https://github.com/oxidecomputer/omicron/issues/823): | ||
// Currently, the prescence or abscence of RSS is our signal | ||
// for "is this a scrimlet or not". | ||
// Longer-term, we should make this call based on the underlying |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note that this changes at runtime with respect to hardware. It's not a static thing and the underlying hardware presence will come and go as the Sidecar is power cycled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That matches my understanding. I do think there is more work that'll need to smooth out the case of "a sled has been removed from one location in the rack, and plugged in elsewhere" - perhaps a call to Nexus, to figure out what services should change - but otherwise, this makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, while the movement bit seemed captured, I just wanted to make sure it was clear that you might choose to start a bootstrap agent before this is known and wanted to call that out as part of this given what appeared the initial start the agent piece was encoding this knowledge right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks Robert. I believe Sean added functionality to update the database's record including the is_scrimlet
value in d172c9c.
One question I have though is about when that value is updated. At RSS-time, it's definitely possible that the sled agent comes up before Sidecar, so we'd need to support updating it from false
to true
. But it seems kind of "sticky" to me. That is, I'm not sure it makes sense to set this to false
, at least not right away, when Sidecar is cycled. If that's done as part of an administrative action or by Nexus, it seems like we'd want to keep the knowledge that this is or was a Scrimlet, and report 503s from dpd
or the sled agent. If Sidecar cycles unexpectedly, we probably still want to do that, along with generating an alert of some kind. I don't know if there's any action here, but spelling that out seems useful to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect an initial policy of stickiness across a given boot will be a pretty useful way to go here. That is, when it's rebooted, it'll be something that is determined fresh and resets stickiness.
Crucible changes Add test and fix for replay race condition (#1519) Fix clippy warnings (#1517) Add edition to `crucible-workspace-hack` (#1516) Split out Downstairs-specific stats (#1511) Move remaining `GuestWork` functionality into `Downstairs` (#1510) Track both jobs and bytes in each IO state (#1507) Fix new `rustc` and `clippy` warnings (#1509) Remove IOP/BW limits (for now) (#1506) Move `GuestBlockRes` into `DownstairsIO` (#1502) Update actions/checkout digest to eef6144 (#1499) Update Rust crate hyper-staticfile to 0.10.1 (#1411) Turn off test-up-2region-encrypted.sh (#1504) Add `IOop::Barrier` (#1494) Fix IPv6 addresses in `crutest` (#1503) Add region set options to more tests. (#1496) Simplify `CompleteJobs` (#1493) Removed ignored CI jobs (#1497) Minor cleanups to `print_last_completed` (#1501) Remove remaining `Arc<Volume>` instances (#1500) Add `VolumeBuilder` type (#1492) remove old unused scripts (#1495) More multiple region support. (#1484) Simplify matches (#1490) Move complete job tracker to a helper object (#1489) Expand summary and add documentation references to the README. (#1486) Remove `GuestWorkId` (2/2) (#1482) Remove `JobId` from `DownstairsIO` (1/2) (#1481) Remove unused `#[derive(..)]` (#1483) Update more tests to use dsc (#1480) Crutest now Volume only (#1479) Propolis changes manually impl Deserialize for PciPath for validation purposes (#801) phd: gate OS-specific tests, make others more OS-agnostic (#799) lib: log vCPU diagnostics on triple fault and for some unhandled exit types (#795) add marker trait to help check safety of guest memory reads (#794) clippy fixes for 1.82 (#796) lib: move cpuid::Set to cpuid_utils; prevent semantic subleaf conflicts (#782) PHD: write efivars in one go (#786) PHD: support guest-initiated reboot (#785) server: accept CPUID values in instance specs and plumb them to bhyve (#780) PHD: allow patched Crucible dependencies (#778) server: add a first-class error type to machine init (#777) PciPath to Bdf conversion is infallible; prove it and refactor (#774) instance spec rework: flatten InstanceSpecV0 (#767) Make PUT /instance/state 503 when waiting to init Less anxiety-inducing `Vm::{get, state_watcher}`
Crucible changes Add test and fix for replay race condition (#1519) Fix clippy warnings (#1517) Add edition to `crucible-workspace-hack` (#1516) Split out Downstairs-specific stats (#1511) Move remaining `GuestWork` functionality into `Downstairs` (#1510) Track both jobs and bytes in each IO state (#1507) Fix new `rustc` and `clippy` warnings (#1509) Remove IOP/BW limits (for now) (#1506) Move `GuestBlockRes` into `DownstairsIO` (#1502) Update actions/checkout digest to eef6144 (#1499) Update Rust crate hyper-staticfile to 0.10.1 (#1411) Turn off test-up-2region-encrypted.sh (#1504) Add `IOop::Barrier` (#1494) Fix IPv6 addresses in `crutest` (#1503) Add region set options to more tests. (#1496) Simplify `CompleteJobs` (#1493) Removed ignored CI jobs (#1497) Minor cleanups to `print_last_completed` (#1501) Remove remaining `Arc<Volume>` instances (#1500) Add `VolumeBuilder` type (#1492) remove old unused scripts (#1495) More multiple region support. (#1484) Simplify matches (#1490) Move complete job tracker to a helper object (#1489) Expand summary and add documentation references to the README. (#1486) Remove `GuestWorkId` (2/2) (#1482) Remove `JobId` from `DownstairsIO` (1/2) (#1481) Remove unused `#[derive(..)]` (#1483) Update more tests to use dsc (#1480) Crutest now Volume only (#1479) Propolis changes manually impl Deserialize for PciPath for validation purposes (#801) phd: gate OS-specific tests, make others more OS-agnostic (#799) lib: log vCPU diagnostics on triple fault and for some unhandled exit types (#795) add marker trait to help check safety of guest memory reads (#794) clippy fixes for 1.82 (#796) lib: move cpuid::Set to cpuid_utils; prevent semantic subleaf conflicts (#782) PHD: write efivars in one go (#786) PHD: support guest-initiated reboot (#785) server: accept CPUID values in instance specs and plumb them to bhyve (#780) PHD: allow patched Crucible dependencies (#778) server: add a first-class error type to machine init (#777) PciPath to Bdf conversion is infallible; prove it and refactor (#774) instance spec rework: flatten InstanceSpecV0 (#767) Make PUT /instance/state 503 when waiting to init Less anxiety-inducing `Vm::{get, state_watcher}` --------- Co-authored-by: Alan Hanson <alan@oxide.computer>
Part of #823
Fixes #1511
Sled-Agent
Nexus
PUT
notPOST