MGS: Finish extracting SP communications to its own crate #786

jgallagher · 2022-03-18T18:41:20Z

All the raw UDP / management switch logic now lives in gateway-sp-comms, on which gateway depends. This PR includes some miscellaneous cleanup (e.g., the task for receiving incoming packets is somewhat simpler and has one fewer Arc, and there are a handful of new unit tests), but no new functionality.

jgallagher · 2022-03-18T18:43:41Z

gateway-sp-comms/src/communicator.rs

+        // same MGS instance - we have two, and local mutexes won't help if both
+        // are trying to send to the same SP simultaneously. maybe we need some
+        // kind of time-limited lock from the SP side where it discards any
+        // incoming data from other sources?


Note - this comment refers to a behavior change compared to what was in the old version of the code. Open to suggestions about what to do here; I suspect the relatively naive mutex solution I had gave the illusion of addressing a problem without really addressing it. It solves "two simultaneous POSTs to a single SP's console via the same MGS instance", but doesn't address simultaneous requests to two MGS instances, or any of the variety of failure modes (e.g., we time out in the middle of a stream of UDP packets).

If we want to enforce mutual exclusion, the SP would need to play a role in it. One could imagine a few different mechanisms. For example, the SP could issue a console token (time-limited, as you say) that a client (presumably nexus or something else talking to one or either MGS) could use to permit console writes. Alternatively, the SP could issue a time-limited token to an MGS and clients would need to remember to route traffic through the proper MGS... and MGS would need to enforce mutual exclusion between clients.

I think any of that would be overkill. The serial console is really an interface of last resort. Until we've made some progress with RFD 247 (and we should! soon!), I'd argue that we should take the simplest path which is to provide no mechanism for mutual exclusion whatsoever. My guess is that 247 will land at a spot where that is sufficient for MVP

ahl

looks great; sorry for the delay!

ahl · 2022-03-24T05:11:23Z

gateway-sp-comms/src/communicator.rs

+        // same MGS instance - we have two, and local mutexes won't help if both
+        // are trying to send to the same SP simultaneously. maybe we need some
+        // kind of time-limited lock from the SP side where it discards any
+        // incoming data from other sources?


If we want to enforce mutual exclusion, the SP would need to play a role in it. One could imagine a few different mechanisms. For example, the SP could issue a console token (time-limited, as you say) that a client (presumably nexus or something else talking to one or either MGS) could use to permit console writes. Alternatively, the SP could issue a time-limited token to an MGS and clients would need to remember to route traffic through the proper MGS... and MGS would need to enforce mutual exclusion between clients.

I think any of that would be overkill. The serial console is really an interface of last resort. Until we've made some progress with RFD 247 (and we should! soon!), I'd argue that we should take the simplest path which is to provide no mechanism for mutual exclusion whatsoever. My guess is that 247 will land at a spot where that is sufficient for MVP

ahl · 2022-03-24T05:17:00Z

gateway-sp-comms/src/recv_handler/mod.rs

+        // `version` is intentionally the first 4 bytes of the packet; we
+        // could check it before trying to deserialize?


yeah; I think we might kick ourselves if we don't..

Changed in 93b27cd

ahl · 2022-03-24T05:20:52Z

gateway-sp-comms/src/recv_handler/mod.rs

+        // decide whether this is a response to an outstanding request or an
+        // unprompted message


thinking out loud: we're going to have other unprompted messages e.g. for errors. While we're content (I think) to poll for updates to the serial console, we'll need to figure out how we forward errors along. Somehow MGS needs to know how to poke Nexus and it seems like MGS will need some ability to register a handler like response handlers but effectively bound to a message class rather than to a request_id

That sounds right to me. MGS needs to be up even if nexus is down, which makes "MGS doesn't know (explicitly) about nexus but allows callers to register for messages" seem correct. I've been assuming MGS will only keep its relevant state in memory, which means a restart would require callers to need to reregister. Does that seem reasonable?

ahl · 2022-03-24T05:24:41Z

gateway-sp-comms/src/recv_handler/request_response_map.rs

+        // we don't care to check the return value here; this will be `Some(_)`
+        // if we're being dropped before a response has been received and
+        // forwarded on to our caller, and `None` if not (because a caller will
+        // have already extracted the sending channel)


Would it be valuable to leave behind some contents of requests in this case that we could reap later? The ostensible benefit would be that we could distinguish a timeout from a logic error. Or would this potentially add more work than it saves?

I'm not sure what we would leave behind, and/or I'm not sure how we would distinguish between a timeout and a logic error that manifested as an undelivered message (i.e., a timeout). Did you have something specific in mind?

Sorry for the delayed reply. I was imagining changing the type of requests to something like Mutex<HashMap<K, MaybeTimedOut<Sender<T>>>> and then in drop changing it to MaybeTimedOut::TimedOut(SystemTime::now)... that said, it sounds like a pain in the ass to then reap etc. Sorry for even saying anything... :-)

Crucible changes Add test and fix for replay race condition (#1519) Fix clippy warnings (#1517) Add edition to `crucible-workspace-hack` (#1516) Split out Downstairs-specific stats (#1511) Move remaining `GuestWork` functionality into `Downstairs` (#1510) Track both jobs and bytes in each IO state (#1507) Fix new `rustc` and `clippy` warnings (#1509) Remove IOP/BW limits (for now) (#1506) Move `GuestBlockRes` into `DownstairsIO` (#1502) Update actions/checkout digest to eef6144 (#1499) Update Rust crate hyper-staticfile to 0.10.1 (#1411) Turn off test-up-2region-encrypted.sh (#1504) Add `IOop::Barrier` (#1494) Fix IPv6 addresses in `crutest` (#1503) Add region set options to more tests. (#1496) Simplify `CompleteJobs` (#1493) Removed ignored CI jobs (#1497) Minor cleanups to `print_last_completed` (#1501) Remove remaining `Arc<Volume>` instances (#1500) Add `VolumeBuilder` type (#1492) remove old unused scripts (#1495) More multiple region support. (#1484) Simplify matches (#1490) Move complete job tracker to a helper object (#1489) Expand summary and add documentation references to the README. (#1486) Remove `GuestWorkId` (2/2) (#1482) Remove `JobId` from `DownstairsIO` (1/2) (#1481) Remove unused `#[derive(..)]` (#1483) Update more tests to use dsc (#1480) Crutest now Volume only (#1479) Propolis changes manually impl Deserialize for PciPath for validation purposes (#801) phd: gate OS-specific tests, make others more OS-agnostic (#799) lib: log vCPU diagnostics on triple fault and for some unhandled exit types (#795) add marker trait to help check safety of guest memory reads (#794) clippy fixes for 1.82 (#796) lib: move cpuid::Set to cpuid_utils; prevent semantic subleaf conflicts (#782) PHD: write efivars in one go (#786) PHD: support guest-initiated reboot (#785) server: accept CPUID values in instance specs and plumb them to bhyve (#780) PHD: allow patched Crucible dependencies (#778) server: add a first-class error type to machine init (#777) PciPath to Bdf conversion is infallible; prove it and refactor (#774) instance spec rework: flatten InstanceSpecV0 (#767) Make PUT /instance/state 503 when waiting to init Less anxiety-inducing `Vm::{get, state_watcher}`

Crucible changes Add test and fix for replay race condition (#1519) Fix clippy warnings (#1517) Add edition to `crucible-workspace-hack` (#1516) Split out Downstairs-specific stats (#1511) Move remaining `GuestWork` functionality into `Downstairs` (#1510) Track both jobs and bytes in each IO state (#1507) Fix new `rustc` and `clippy` warnings (#1509) Remove IOP/BW limits (for now) (#1506) Move `GuestBlockRes` into `DownstairsIO` (#1502) Update actions/checkout digest to eef6144 (#1499) Update Rust crate hyper-staticfile to 0.10.1 (#1411) Turn off test-up-2region-encrypted.sh (#1504) Add `IOop::Barrier` (#1494) Fix IPv6 addresses in `crutest` (#1503) Add region set options to more tests. (#1496) Simplify `CompleteJobs` (#1493) Removed ignored CI jobs (#1497) Minor cleanups to `print_last_completed` (#1501) Remove remaining `Arc<Volume>` instances (#1500) Add `VolumeBuilder` type (#1492) remove old unused scripts (#1495) More multiple region support. (#1484) Simplify matches (#1490) Move complete job tracker to a helper object (#1489) Expand summary and add documentation references to the README. (#1486) Remove `GuestWorkId` (2/2) (#1482) Remove `JobId` from `DownstairsIO` (1/2) (#1481) Remove unused `#[derive(..)]` (#1483) Update more tests to use dsc (#1480) Crutest now Volume only (#1479) Propolis changes manually impl Deserialize for PciPath for validation purposes (#801) phd: gate OS-specific tests, make others more OS-agnostic (#799) lib: log vCPU diagnostics on triple fault and for some unhandled exit types (#795) add marker trait to help check safety of guest memory reads (#794) clippy fixes for 1.82 (#796) lib: move cpuid::Set to cpuid_utils; prevent semantic subleaf conflicts (#782) PHD: write efivars in one go (#786) PHD: support guest-initiated reboot (#785) server: accept CPUID values in instance specs and plumb them to bhyve (#780) PHD: allow patched Crucible dependencies (#778) server: add a first-class error type to machine init (#777) PciPath to Bdf conversion is infallible; prove it and refactor (#774) instance spec rework: flatten InstanceSpecV0 (#767) Make PUT /instance/state 503 when waiting to init Less anxiety-inducing `Vm::{get, state_watcher}` --------- Co-authored-by: Alan Hanson <alan@oxide.computer>

jgallagher requested a review from ahl March 18, 2022 18:41

jgallagher commented Mar 18, 2022

View reviewed changes

Base automatically changed from mgs-extract-sp-comms-crate to main March 23, 2022 14:38

MGS: Finish extracting SP communications to its own crate

95d8c48

jgallagher force-pushed the mgs-finish-sp-comms-extraction branch from 92e5a5c to 95d8c48 Compare March 23, 2022 14:42

ahl reviewed Mar 24, 2022

View reviewed changes

jgallagher added 2 commits March 28, 2022 11:41

Address feedback on #777

d6ef524

MGS: Parse packet version before trying to deserialize

93b27cd

jgallagher merged commit 8108829 into main Apr 1, 2022

jgallagher deleted the mgs-finish-sp-comms-extraction branch April 1, 2022 14:59

smklein mentioned this pull request Apr 4, 2022

[Github Actions] Build Flake #868

Closed

leftwo mentioned this pull request Oct 25, 2024

Update Crucible and Propolis #6932

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MGS: Finish extracting SP communications to its own crate #786

MGS: Finish extracting SP communications to its own crate #786

jgallagher commented Mar 18, 2022

jgallagher Mar 18, 2022

ahl Mar 24, 2022

ahl left a comment

ahl Mar 24, 2022

ahl Mar 24, 2022

jgallagher Mar 28, 2022

ahl Mar 24, 2022

jgallagher Mar 28, 2022

ahl Mar 24, 2022

jgallagher Mar 28, 2022

ahl Mar 31, 2022

		// `version` is intentionally the first 4 bytes of the packet; we
		// could check it before trying to deserialize?

		// decide whether this is a response to an outstanding request or an
		// unprompted message

MGS: Finish extracting SP communications to its own crate #786

MGS: Finish extracting SP communications to its own crate #786

Conversation

jgallagher commented Mar 18, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment