-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MGS: Start gateway-sp-comms
crate with a ManagementSwitch
#777
Conversation
} | ||
} | ||
|
||
impl From<gateway_sp_comms::SpType> for SpType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two From
impls (and similarly below for SpIdentifer
) are a bit of an eyesore; open to suggestions. Seems a little weird to have them derive JsonSchema
etc from inside gateway_sp_comms
, but maybe that's the simplest thing? I don't love how it would put part of the HTTP interface in another crate. Maybe a macro to derive the from impls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we could have it derive JsonSchema
from gateway_sp_comms
under some feature flag so that the Hubris task isn't burdened with it. I don't love it for the same reasons as I think you don't... on the other hand, it seems not-uncommon to have a feature to impl stuff like Serialize
/Deserialize
so maybe this amounts to the same thing
This is a first step to extracting all of `sp_comms` from `gateway` into `gateway-sp-comms`. It removes several placeholder oddities, like the gateway only listening on one UDP socket and identifying SPs by a mix of `SocketAddr`s and ignition target indices. Most communications are now tracked by `SwitchPort`, an opaque type provided by `ManagementSwitch` indicating which port of the management switch the communication is happening on. The switch instance provides methods to convert between logical IDs (type + slot) and ports, look up the SP socket address for a corresponding port, etc; the implementations of those functions are where all the placeholder logic now lives, which should make `gateway` somewhat more realistic.
6a27897
to
4a81672
Compare
for (typ, num) in [ | ||
(SpType::Switch, self.known_sps.switches.len()), | ||
(SpType::Sled, self.known_sps.sleds.len()), | ||
(SpType::Power, self.known_sps.power_controllers.len()), | ||
] { | ||
if n < num { | ||
return SpIdentifier { typ, slot: n }; | ||
} | ||
n -= num; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would we imagine this lookup being in e.g. a BTreeMap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, something like that. maybe a map, maybe just a vec? I think it's still TBD (even from RFD 250's point of view) where the product-specific mapping of ports to cubbies lives, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind - RFD 250 now notes that this config is shipped with MGS. So I think this turns into something like a map derived from a toml file?
self.ports[port.0].local_addr() | ||
} | ||
|
||
/// Get the socket to use to communicate with an AP and the socket address |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/AP/SP/ ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, good catch
// TODO how do we handle errors here (or from `recv()` below)? | ||
// currently just log them and retry, assuming any socket errors are | ||
// ephemeral |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this perhaps too hard. Some errors would be expected (I guess) such as std::io::ErrorKind::WouldBlock
which should be ignored. Other errors would seem indicative of some serious invariant violation. But after scrounging around in tokio a bit, it seems like this can't fail except in strange circumstances. I'd suggest that unless we know we can safely ignore (and log) an error we should bring panic!() and rely on SMF to restart our zone into a known consistent state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I like that better. I'll change this one (and the "some err other than WouldBlock" in the match below it) to panics, with a comment that we don't expect to see errors here in general.
// TODO what do we do here? we received a packet from an | ||
// address that doesn't match what we believe is the SP's | ||
// address. for now, log and discard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a great point and we may want to use TODO-security
as I've seen @davepacheco do. In particular, we might want to consider this situation as a trigger to potentially stop trusting a given sled or SP and force it to re-enter the trust quorum... or something.
Crucible changes Add test and fix for replay race condition (#1519) Fix clippy warnings (#1517) Add edition to `crucible-workspace-hack` (#1516) Split out Downstairs-specific stats (#1511) Move remaining `GuestWork` functionality into `Downstairs` (#1510) Track both jobs and bytes in each IO state (#1507) Fix new `rustc` and `clippy` warnings (#1509) Remove IOP/BW limits (for now) (#1506) Move `GuestBlockRes` into `DownstairsIO` (#1502) Update actions/checkout digest to eef6144 (#1499) Update Rust crate hyper-staticfile to 0.10.1 (#1411) Turn off test-up-2region-encrypted.sh (#1504) Add `IOop::Barrier` (#1494) Fix IPv6 addresses in `crutest` (#1503) Add region set options to more tests. (#1496) Simplify `CompleteJobs` (#1493) Removed ignored CI jobs (#1497) Minor cleanups to `print_last_completed` (#1501) Remove remaining `Arc<Volume>` instances (#1500) Add `VolumeBuilder` type (#1492) remove old unused scripts (#1495) More multiple region support. (#1484) Simplify matches (#1490) Move complete job tracker to a helper object (#1489) Expand summary and add documentation references to the README. (#1486) Remove `GuestWorkId` (2/2) (#1482) Remove `JobId` from `DownstairsIO` (1/2) (#1481) Remove unused `#[derive(..)]` (#1483) Update more tests to use dsc (#1480) Crutest now Volume only (#1479) Propolis changes manually impl Deserialize for PciPath for validation purposes (#801) phd: gate OS-specific tests, make others more OS-agnostic (#799) lib: log vCPU diagnostics on triple fault and for some unhandled exit types (#795) add marker trait to help check safety of guest memory reads (#794) clippy fixes for 1.82 (#796) lib: move cpuid::Set to cpuid_utils; prevent semantic subleaf conflicts (#782) PHD: write efivars in one go (#786) PHD: support guest-initiated reboot (#785) server: accept CPUID values in instance specs and plumb them to bhyve (#780) PHD: allow patched Crucible dependencies (#778) server: add a first-class error type to machine init (#777) PciPath to Bdf conversion is infallible; prove it and refactor (#774) instance spec rework: flatten InstanceSpecV0 (#767) Make PUT /instance/state 503 when waiting to init Less anxiety-inducing `Vm::{get, state_watcher}`
Crucible changes Add test and fix for replay race condition (#1519) Fix clippy warnings (#1517) Add edition to `crucible-workspace-hack` (#1516) Split out Downstairs-specific stats (#1511) Move remaining `GuestWork` functionality into `Downstairs` (#1510) Track both jobs and bytes in each IO state (#1507) Fix new `rustc` and `clippy` warnings (#1509) Remove IOP/BW limits (for now) (#1506) Move `GuestBlockRes` into `DownstairsIO` (#1502) Update actions/checkout digest to eef6144 (#1499) Update Rust crate hyper-staticfile to 0.10.1 (#1411) Turn off test-up-2region-encrypted.sh (#1504) Add `IOop::Barrier` (#1494) Fix IPv6 addresses in `crutest` (#1503) Add region set options to more tests. (#1496) Simplify `CompleteJobs` (#1493) Removed ignored CI jobs (#1497) Minor cleanups to `print_last_completed` (#1501) Remove remaining `Arc<Volume>` instances (#1500) Add `VolumeBuilder` type (#1492) remove old unused scripts (#1495) More multiple region support. (#1484) Simplify matches (#1490) Move complete job tracker to a helper object (#1489) Expand summary and add documentation references to the README. (#1486) Remove `GuestWorkId` (2/2) (#1482) Remove `JobId` from `DownstairsIO` (1/2) (#1481) Remove unused `#[derive(..)]` (#1483) Update more tests to use dsc (#1480) Crutest now Volume only (#1479) Propolis changes manually impl Deserialize for PciPath for validation purposes (#801) phd: gate OS-specific tests, make others more OS-agnostic (#799) lib: log vCPU diagnostics on triple fault and for some unhandled exit types (#795) add marker trait to help check safety of guest memory reads (#794) clippy fixes for 1.82 (#796) lib: move cpuid::Set to cpuid_utils; prevent semantic subleaf conflicts (#782) PHD: write efivars in one go (#786) PHD: support guest-initiated reboot (#785) server: accept CPUID values in instance specs and plumb them to bhyve (#780) PHD: allow patched Crucible dependencies (#778) server: add a first-class error type to machine init (#777) PciPath to Bdf conversion is infallible; prove it and refactor (#774) instance spec rework: flatten InstanceSpecV0 (#767) Make PUT /instance/state 503 when waiting to init Less anxiety-inducing `Vm::{get, state_watcher}` --------- Co-authored-by: Alan Hanson <alan@oxide.computer>
This is a first step to extracting all of
sp_comms
fromgateway
intogateway-sp-comms
. It removes several placeholder oddities, like the gateway only listening on one UDP socket and identifying SPs by a mix ofSocketAddr
s and ignition target indices. Most communications are now tracked bySwitchPort
, an opaque type provided byManagementSwitch
indicating which port of the management switch the communication is happening on. The switch instance provides methods to convert between logical IDs (type + slot) and ports, look up the SP socket address for a corresponding port, etc; the implementations of those functions are where all the placeholder logic now lives, which should makegateway
somewhat more realistic.ManagementSwitch
should present an interface that is close to what we expect based on RFD 250, at least after discovery / topology has taken place: each port on the management switch has both a local UDP socket (corresponding to the vnic for the appropriate port's vlan) and a socket address for the SP at the other end (assuming we've discovered that on startup; handling hot swap etc is down the line).PR logistics - this builds on #746, and is currently targeted at that branch. Once it's merged I'll retarget this to main. I'm going to continue the SP comms extraction, but wanted to pause here to keep the PR in "borderline too large" territory before getting to "monstrously large".