-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ingest ereports from SPs #370
base: main
Are you sure you want to change the base?
Conversation
This commit refactors `gateway_sp_comms::SharedSocket` to make the received message handler a generic trait. This way, the `SharedSocket` type and its associated machinery for discovering SPs and forwarding received messages to per-SP handlers can be used for the ereport ingestion socket as well as for `control-plane-agent` messages.
Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize, SerializedSize, | ||
)] | ||
#[repr(transparent)] | ||
pub struct RestartId(pub u128); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did consider using the uuid
crate for this, as it supports no_std
, but it didn't really seem worth adding another dependency that would have to be compiled in to the Hubris binaries basically just to get UUID-like formatting in Debug
impls that are only used by MGS...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I felt that it was nicer to put all the ereport messages (both the SP-to-MGS messages and the MGS-to-SP messages) in their own module, rather than putting some of them in sp_to_mgs
and others in mgs_to_sp
. This way, a reader interested in the ereport stuff need only read this module, and a reader interested in the control-plane-agent protocol doesn't have to scroll past ereport messages. Future additions to the ereport protocol would change only the code in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly to https://github.com/oxidecomputer/management-gateway-service/pull/370/files#r2031615326, it felt nicer to have all the ereport bits defined in their own module, rather than smeared across shared_socket.rs
and single_sp.rs
. That way, all the ereport-specific code is in one place and it's easier to see the relationship between the code in the ereport socket receive handler and the single-SP handler, rather than having to trace it between modules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No argument from me. Do you think it would be clearer to move the control-plane-agent stuff to its own submodule too? (Not as part of this PR, but maybe alongside the renamings I suggested in another comment?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm definitely open to doing that in a subsequent change, I think it seems pretty reasonable (especially if we ever add a third socket/protocol for some other thing). I agree that we shouldn't mess with the control-plane-agent stuff in this PR though.
@@ -27,7 +30,7 @@ thiserror.workspace = true | |||
tlvc.workspace = true | |||
tokio.workspace = true | |||
usdt.workspace = true | |||
uuid.workspace = true | |||
uuid = { workspace = true, features = ["v4"] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was necessary for unrelated reasons: the task dump code added in #316 uses Uuid::new_v4
in gateway-sp-comms
, but gateway-sp-comms
doesn't enable the "v4"
feature (only faux-mgs
does). So, this didn't compile for me using the v2 cargo feature resolver.
gateway-sp-comms/src/ereport.rs
Outdated
Some(CborValue::Integer(i)) => task_names | ||
.get(i as usize) | ||
.cloned() | ||
.ok_or(DecodeError::BadTaskNameIndex { | ||
n, | ||
index: i as usize, | ||
})?, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cbiffle this is what I was asking about in https://github.com/oxidecomputer/rfd/pull/849#discussion_r2027376877: right now, this code assumes that when an ereport's task name is an integer, it will always be the index of an ereport that was earlier in the packet than the current one. I wanted to get your confirmation of whether we could rely on that assumption or would need to handle indexes pointing ahead of the current ereport.
per yesterday's chat with @cbiffle
i figured out how to force Serde to always use the "indefinite" encoding, so we can have some more tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't look particularly closely at the ereport parsing etc details (I'll defer that to you and Cliff, if that's okay). The structural MGS changes look good; just a handful of nits and questions.
recv_handler_task: JoinHandle<()>, | ||
log: Logger, | ||
} | ||
|
||
impl Drop for SharedSocket { | ||
// Hand-rolled `Debug` impl as the message type (`T`) needn't be `Debug` for the | ||
// `SharedSocket` to be debug. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems fine, but I'll plug https://docs.rs/derive-where/latest/derive_where/ since we use it in omicron, if you want to pull it in here too.
@@ -282,15 +284,18 @@ details in `dump.json`." | |||
pub struct SingleSp { | |||
interface: String, | |||
cmds_tx: mpsc::Sender<InnerCommand>, | |||
ereport_req_tx: mpsc::Sender<ereport::WorkerRequest>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not do this as a part of this PR, but I'm curious for your thoughts: there are a bunch of places like this one where we're going from one "thing" (in this case, an mpsc::Sender
) to two "things". The original "one" is always named generically, and the new one is named indicating it's related to ereports. Do you think we should go back and rename the generic things to indicate they're intended for control-plane-agent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think it might be worthwhile to do that (especially if we ever add a third port to the management network...). You're right that I just didn't really want to touch all the control-plane-agent code in this PR, but I'd be happy to do it in a separate commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No argument from me. Do you think it would be clearer to move the control-plane-agent stuff to its own submodule too? (Not as part of this PR, but maybe alongside the renamings I suggested in another comment?)
i thought i had gotten rid of these...
as per [this comment][1] from @jgallagher. this is similar to the control-plane-agent protocol. wow, it's almost like we're reimplementing TCP (but without flow control because thats hard). [1]: #370 (comment)
This pull request implements the MGS side of the SP ereport ingestion
protocol. For more information on the ereport ingestion protocol, refer
to the following RFDs:
In particular, this branch makes the following changes:
gateway-messages
representing the ereport protocol wiremessages exchanged between MGS and the SP; these are defined in
RFD 545.
shared_socket
module ingateway-sp-comms
. Currently, theSharedSocket
code for handlingreceived packets is tightly coupled to the control plane agent message
types. Ereport requests and responses are sent on a separate UDP port.
Therefore, I've hacked up this code a bit to allow
SharedSocket
tobe generic over a
RecvHandler
trait that defines how to handlereceived packets and dispatch them to single-SP handlers. This is
implemented for both the control-plane-agent protocol and, separately,
for the ereport protocol.
code for decoding ereport packets and a per-SP worker task that tracks
the metadata sent by the SP and adds it to each batch of ereports.
A corresponding Omicron branch, oxidecomputer/omicron#7903, depends on
this branch and integrates the ereport code into the MGS app binary and
the SP simulator.