Skip to content

[mgs] API for ingesting ereports from SPs #7903

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 52 commits into from
May 13, 2025
Merged

[mgs] API for ingesting ereports from SPs #7903

merged 52 commits into from
May 13, 2025

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Apr 1, 2025

oxidecomputer/management-gateway-service#370 adds code to the
gateway-messages and gateway-sp-comms crates to implement the MGS
side of the ereport ingestion protocol. For more information on the
protocol itself, refer to the following RFDs:

This branch integrates the changes from those crates into the actual
MGS application, as well as adding simulated ereports to the SP
simulator. I've added some simple tests based on this.

In addition, this branch restructures the initial implementation of
the control plane ereport API I added in #7833. That branch proposed
a single dropshot API that would be implemented by both sled-agent and
MGS. This was possible because the initial design would have indexed all
ereport producers (reporters) by a UUID. However, per recent
conversations with @cbiffle and @jgallagher, we've determined that Nexus
will instead request ereports from service processors indexed by SP
physical topology (e.g. type and slot), like the rest of the MGS HTTP
API. Therefore, we can no longer have a single HTTP API for ereporters
that's implemented by both MGS and sled-agents, and instead, SP ereport
ingestion should be a new endpoint on the MGS API.

This branch does that, moving the ereport query params into
ereport-types, eliminating the separate ereport-api and
ereport-client crates, and adding an ereport-ingestion-by-SP-location
endpoint to the management gateway API.

Furthermore, there are some terminology changes. The ereport
protocol has a value which we've variously referred to as an "instance
ID", a "generation ID", and a "restart nonce", all of which have
unfortunate name collisions that are potentially confusing or just
unpleasant. We've agreed to refer to this value everywhere as a
"restart ID", so this commit also changes that.

hawkw added 4 commits April 1, 2025 10:15
Currently, the initial ereport ingestion API I added in #7833 proposed
a single dropshot API that would be implemented by both sled-agent and
MGS. This was possible because the initial design would have indexed all
ereport producers (reporters) by a UUID. However, per recent
conversations with @cbiffle and @jgallagher, we've determined that Nexus
will instead request ereports from service processors indexed by SP
physical topology (e.g. type and slot), like the rest of the MGS HTTP
API. Therefore, we can no longer have a single HTTP API for ereporters
that's implemented by both MGS and sled-agents, and instead, SP ereport
ingestion should be a new endpoint on the MGS API.

This commit does that, moving the ereport query params into
`ereport-types`, eliminating the separate `ereport-api` and
`ereport-client` crates, and adding an ereport-ingestion-by-SP-location endpoint to the management gateway API.
@hawkw hawkw requested a review from jgallagher April 1, 2025 21:43
@hawkw hawkw self-assigned this Apr 1, 2025
@hawkw
Copy link
Member Author

hawkw commented Apr 1, 2025

I'm not 100% sure what our disposition on merging this ought to be, as it does add an API to the MGS http_entrypoints that's currently unimplemented and always returns an error. I figured it was good to at least go ahead and open the PR so that other changes can be based upon it...

@hawkw hawkw marked this pull request as draft April 2, 2025 19:19
@hawkw
Copy link
Member Author

hawkw commented Apr 2, 2025

I'm turning this into a draft as I'm going to keep using this branch to hack up the ereport protocol types a bit more.

hawkw added a commit that referenced this pull request Apr 4, 2025
It turns out that our Git dependency on the
oxidecomputer/management-gateway-service repo hasn't been updated in...
a while. We're currently on a commit from September of last year,
oxidecomputer/management-gateway-service@9bbac47.
This branch updates it to the current HEAD commit,
oxidecomputer/management-gateway-service@f9566e6.

The only changes in MGS that required code changes in Omicron are:

- oxidecomputer/management-gateway-service#291, where I added a new
  `MeasurementKind` for AMD CPU T<sub>ctl</sub> values (which are not temperatures in degrees Celcius, but a secret third thing).
- oxidecomputer/management-gateway-service#316 by @mkeeter, adding the
  interface to read SP task dumps over the network. Since this adds
  methods to the `sp_impl::SpHandler` trait, the SP simulator
  implementations need to be updated, or else they will no longer
  compile. For now, I've just made these `unimplemented!()`, as we're
  not currently actually _using_ them.

In my PR #7903 implementing ereport ingestion from SPs, I had to make
these changes as part of changing the MGS dependency to pull in the new
`gateway-sp-comms` code for ereports. Since this isn't actually related,
and is just necessary to update the Git dep, I figured I'd pull that
commit (49973ae) into its own PR.
hawkw added a commit that referenced this pull request Apr 8, 2025
It turns out that our Git dependency on the
oxidecomputer/management-gateway-service repo hasn't been updated in...
a while. We're currently on a commit from September of last year,
oxidecomputer/management-gateway-service@9bbac47.
This branch updates it to the current HEAD commit,
oxidecomputer/management-gateway-service@f9566e6.

The only changes in MGS that required code changes in Omicron are:

- oxidecomputer/management-gateway-service#291, where I added a new
`MeasurementKind` for AMD CPU T<sub>ctl</sub> values (which are not
temperatures in degrees Celcius, but a secret third thing).
- oxidecomputer/management-gateway-service#316 by @mkeeter, adding the
interface to read SP task dumps over the network. Since this adds
methods to the `sp_impl::SpHandler` trait, the SP simulator
implementations need to be updated, or else they will no longer compile.
For now, I've just made these `unimplemented!()`, as we're not currently
actually _using_ them.

In my PR #7903 implementing ereport ingestion from SPs, I had to make
these changes as part of changing the MGS dependency to pull in the new
`gateway-sp-comms` code for ereports. Since this isn't actually related,
and is just necessary to update the Git dep, I figured I'd pull that
commit (49973ae) into its own PR.

---------

Co-authored-by: John Gallagher <john@oxidecomputer.com>
hawkw added 2 commits April 28, 2025 11:19
This necessitates moving the construction of `SpUpdate` earlier in the
sim Gimlet initialization, so that the ereport state can ask it for
Hubris version metadata. Now, it has to be constructed in
`Gimlet::spawn` and passed in to `UdpTask::new` (and thus
`Handler::new`), rather than constructed isnide `Handler::new`, so that
the ereport state can also be passed in to `Handler::new`. @dap, let me
know what you think of this --- if you base additional changes on code
with the current factoring, it will probably make the eventual merge
conflicts between our branches much more unpleasant, so maybe we can
pull this out and land it separately, and have my branch depend on that?
Let me know what you think!
hawkw added a commit to oxidecomputer/management-gateway-service that referenced this pull request May 5, 2025
This pull request implements the MGS side of the SP ereport ingestion
protocol. For more information on the ereport ingestion protocol, refer
to the following RFDs:

- [RFD 520  Control Plane Fault Ingestion and Data Model][RFD 520]
- [RFD 544 Embedded E-Report Formats][RFD 544]
- [RFD 545 Firmware E-Report Aggregation and Evacuation][RFD 545]

In particular, this branch makes the following changes:

- Add types to `gateway-messages` representing the ereport protocol wire
  messages exchanged between MGS and the SP; these are defined in 
  [RFD 545].
- Somewhat substantial refactoring to the `shared_socket` module in
  `gateway-sp-comms`. Currently, the `SharedSocket` code for handling
  received packets is tightly coupled to the control plane agent message
  types. Ereport requests and responses are sent on a separate UDP port.
  Therefore, I've hacked up this code a bit to allow `SharedSocket` to
  be generic over a `RecvHandler` trait that defines how to handle
  received packets and dispatch them to single-SP handlers. This is
  implemented for both the control-plane-agent protocol and, separately,
  for the ereport protocol.
- Actually add said implementation of the ereport protocol, including
  code for decoding ereport packets and a per-SP worker task that tracks
  the metadata sent by the SP and adds it to each batch of ereports.

A corresponding Omicron branch, oxidecomputer/omicron#7903, depends on
this branch and integrates the ereport code into the MGS app binary and
the SP simulator.

[RFD 520]: https://rfd.shared.oxide.computer/rfd/0520
[RFD 544]: https://rfd.shared.oxide.computer/rfd/0544
[RFD 545]: https://rfd.shared.oxide.computer/rfd/0545
@hawkw
Copy link
Member Author

hawkw commented May 13, 2025

One note to anyone reviewing this: there's a bit more work I'd like to do in the SP simulator and integrating with things like @davepacheco's work on simulating updates, simulated SP non-responsiveness, etc, but I'd kind of prefer to punt that to a subsequent branch. I'd prefer to merge this sooner as picking up the ereport APIs is blocking other MGS repo updates in Omicron.

Copy link
Contributor

@jgallagher jgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks. Just a couple small questions.

@@ -441,8 +435,10 @@ gateway-client = { path = "clients/gateway-client" }
# is "fine", because SP/MGS communication maintains forwards and backwards
# compatibility, but will mean that faux-mgs might be missing new
# functionality.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Less important on this commit than the upcoming one, but we should probably do what this comment says and update package-manifest.toml too.

}

let ereport_start_pos = pos;
buf[pos] = 0x9f; // start list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't need to block this PR, but: are there any cbor crates we could use here (we're already using ciborium elsewhere IIRC) to serialize this list? Or if that doesn't really work, do any of them happen to export this constant?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the reason that this part is a bit weird is because we want a particular behavior: we would like to put as many ereports as possible into the packet until the packet is full, and not write any bytes for any ereports that don't fit in the packet. This is why we build the list manually: just trying to use serde's serialize_seq with the whole iterator, or trying to serialize a Vec<Ereport> , would fail, but might have written some bytes from the last ereport that didn't fit before it failed, which is not the desired behavior. I think it might be possible to coax the lower-level serde API into letting us do that by serializing each entry individually and seeing if that call succeeds or fails, and tracking the last byte position after each successful serialization and chomping off anything past that. But, this felt like an easier way to do the same thing.

Unfortunately, none of the CBOR crates I looked at seemed to have constants for these bytes that were publicly exposed; per the ciborium docs it seems like they are planning to expose a lower-level library with things like that, but (AFAICT) it doesn't exist yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth changing this code and the MGS code to use ciborium instead of serde-cbor as it seems to provide most of the same APIs we need and seems more actively maintained, but I'd kind of rather do that in a separate branch that also updates MGS.

@hawkw hawkw enabled auto-merge (squash) May 13, 2025 20:44
@hawkw hawkw merged commit 5982442 into main May 13, 2025
19 checks passed
@hawkw hawkw deleted the eliza/ereport-sp-api branch May 13, 2025 23:16
hawkw added a commit that referenced this pull request Jun 25, 2025
This branch adds a Nexus background task for ingesting ereports from
service processors via MGS, using the MGS API endpoint added in #7903.
These APIs in turn expose the MGS/SP ereport ingestion protocol added in
oxidecomputer/management-gateway-service#370.

For more information on the protocol itself, refer to the following
RFDs:

- [RFD 520  Control Plane Fault Ingestion and Data Model][RFD 520]
- [RFD 544 Embedded E-Report Formats][RFD 544]
- [RFD 545 Firmware E-Report Aggregation and Evacuation][RFD 545]

In addition to the ereport ingester background task, this branch also
adds database tables for storing ereports from SPs, which are necessary
to implement the ingestion task. I've also added a table for storing
ereports from the sled host OS, which will eventually be ingested via
sled-agent. While there isn't currently anything that populates that
table, I wanted to begin sketching out how we would represent the two
categories of ereports we expect to deal with, and how we would query
both tables for ereports.

Finally, this branch also adds OMDB commands for querying the ereports
stored in the database. These OMDB commands may be useful both for
debugging the ereport ingestion subsystem itself *and* for diagnosing
issues once the SP firmware actually emits ereports. At present, the
higher-level components of the fault-management subsystem, which will
process ereports, diagnose faults, and generate alerts, have yet to be
implemented. Therefore, the OMDB ereport commands serve as an interim
solution for accessing the lower-level data, which may be useful for
debugging such faults until the higher-level FMA components exist.

[RFD 520]: https://rfd.shared.oxide.computer/rfd/0520
[RFD 544]: https://rfd.shared.oxide.computer/rfd/0544
[RFD 545]: https://rfd.shared.oxide.computer/rfd/0545

---------

Co-authored-by: Sean Klein <sean@oxide.computer>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants