Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest ereports from SPs #370

Open
wants to merge 42 commits into
base: main
Choose a base branch
from
Open

Ingest ereports from SPs #370

wants to merge 42 commits into from

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Mar 27, 2025

This pull request implements the MGS side of the SP ereport ingestion
protocol. For more information on the ereport ingestion protocol, refer
to the following RFDs:

In particular, this branch makes the following changes:

  • Add types to gateway-messages representing the ereport protocol wire
    messages exchanged between MGS and the SP; these are defined in
    RFD 545.
  • Somewhat substantial refactoring to the shared_socket module in
    gateway-sp-comms. Currently, the SharedSocket code for handling
    received packets is tightly coupled to the control plane agent message
    types. Ereport requests and responses are sent on a separate UDP port.
    Therefore, I've hacked up this code a bit to allow SharedSocket to
    be generic over a RecvHandler trait that defines how to handle
    received packets and dispatch them to single-SP handlers. This is
    implemented for both the control-plane-agent protocol and, separately,
    for the ereport protocol.
  • Actually add said implementation of the ereport protocol, including
    code for decoding ereport packets and a per-SP worker task that tracks
    the metadata sent by the SP and adds it to each batch of ereports.

A corresponding Omicron branch, oxidecomputer/omicron#7903, depends on
this branch and integrates the ereport code into the MGS app binary and
the SP simulator.

@hawkw hawkw requested a review from cbiffle March 27, 2025 23:08
@hawkw hawkw changed the title Ereport protocol messages Ingest ereports from SPs Apr 6, 2025
Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize, SerializedSize,
)]
#[repr(transparent)]
pub struct RestartId(pub u128);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did consider using the uuid crate for this, as it supports no_std, but it didn't really seem worth adding another dependency that would have to be compiled in to the Hubris binaries basically just to get UUID-like formatting in Debug impls that are only used by MGS...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I felt that it was nicer to put all the ereport messages (both the SP-to-MGS messages and the MGS-to-SP messages) in their own module, rather than putting some of them in sp_to_mgs and others in mgs_to_sp. This way, a reader interested in the ereport stuff need only read this module, and a reader interested in the control-plane-agent protocol doesn't have to scroll past ereport messages. Future additions to the ereport protocol would change only the code in this file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to https://github.com/oxidecomputer/management-gateway-service/pull/370/files#r2031615326, it felt nicer to have all the ereport bits defined in their own module, rather than smeared across shared_socket.rs and single_sp.rs. That way, all the ereport-specific code is in one place and it's easier to see the relationship between the code in the ereport socket receive handler and the single-SP handler, rather than having to trace it between modules.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No argument from me. Do you think it would be clearer to move the control-plane-agent stuff to its own submodule too? (Not as part of this PR, but maybe alongside the renamings I suggested in another comment?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm definitely open to doing that in a subsequent change, I think it seems pretty reasonable (especially if we ever add a third socket/protocol for some other thing). I agree that we shouldn't mess with the control-plane-agent stuff in this PR though.

@@ -27,7 +30,7 @@ thiserror.workspace = true
tlvc.workspace = true
tokio.workspace = true
usdt.workspace = true
uuid.workspace = true
uuid = { workspace = true, features = ["v4"] }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was necessary for unrelated reasons: the task dump code added in #316 uses Uuid::new_v4 in gateway-sp-comms, but gateway-sp-comms doesn't enable the "v4" feature (only faux-mgs does). So, this didn't compile for me using the v2 cargo feature resolver.

@hawkw hawkw requested review from mkeeter and jgallagher and removed request for mkeeter April 7, 2025 17:41
@hawkw hawkw marked this pull request as ready for review April 7, 2025 17:41
Comment on lines 311 to 317
Some(CborValue::Integer(i)) => task_names
.get(i as usize)
.cloned()
.ok_or(DecodeError::BadTaskNameIndex {
n,
index: i as usize,
})?,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbiffle this is what I was asking about in https://github.com/oxidecomputer/rfd/pull/849#discussion_r2027376877: right now, this code assumes that when an ereport's task name is an integer, it will always be the index of an ereport that was earlier in the packet than the current one. I wanted to get your confirmation of whether we could rely on that assumption or would need to handle indexes pointing ahead of the current ereport.

Copy link
Collaborator

@jgallagher jgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't look particularly closely at the ereport parsing etc details (I'll defer that to you and Cliff, if that's okay). The structural MGS changes look good; just a handful of nits and questions.

recv_handler_task: JoinHandle<()>,
log: Logger,
}

impl Drop for SharedSocket {
// Hand-rolled `Debug` impl as the message type (`T`) needn't be `Debug` for the
// `SharedSocket` to be debug.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine, but I'll plug https://docs.rs/derive-where/latest/derive_where/ since we use it in omicron, if you want to pull it in here too.

@@ -282,15 +284,18 @@ details in `dump.json`."
pub struct SingleSp {
interface: String,
cmds_tx: mpsc::Sender<InnerCommand>,
ereport_req_tx: mpsc::Sender<ereport::WorkerRequest>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not do this as a part of this PR, but I'm curious for your thoughts: there are a bunch of places like this one where we're going from one "thing" (in this case, an mpsc::Sender) to two "things". The original "one" is always named generically, and the new one is named indicating it's related to ereports. Do you think we should go back and rename the generic things to indicate they're intended for control-plane-agent?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it might be worthwhile to do that (especially if we ever add a third port to the management network...). You're right that I just didn't really want to touch all the control-plane-agent code in this PR, but I'd be happy to do it in a separate commit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No argument from me. Do you think it would be clearer to move the control-plane-agent stuff to its own submodule too? (Not as part of this PR, but maybe alongside the renamings I suggested in another comment?)

hawkw added 4 commits April 10, 2025 11:29
i thought i had gotten rid of these...
as per [this comment][1] from @jgallagher. this is
similar to the control-plane-agent protocol. wow, it's
almost like we're reimplementing TCP (but without flow
control because thats hard).

[1]: #370 (comment)
@hawkw hawkw requested a review from jgallagher April 10, 2025 20:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants