Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read SP task crash dumps over the network #316

Merged
merged 10 commits into from
Dec 10, 2024
Merged

Conversation

mkeeter
Copy link
Contributor

@mkeeter mkeeter commented Nov 18, 2024

This PR adds new messages and functions to read SP task crash dumps over the network and write them to a canonical file, without the corresponding Hubris archive being available at debug time.

These "dry" dumps can then be hydrated with humility hydrate (oxidecomputer/humility#520). In the meantime, they are significantly smaller (RAM only, no ELF or debug info) and easier to pass around (we can using the existing MGS connection instead of talking to dump-agent and knowing how humpty works).

Copy link
Collaborator

@jgallagher jgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; some minor questions about the messages.

@@ -419,6 +419,12 @@ enum Command {
#[clap(subcommand)]
cmd: MonorailCommand,
},

/// List and read per-task crash dumps
Dump {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this take a component or is this only ever going to be an SP thing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably SP-only? It's pretty specific to Hubris, and we don't do task dumps on the RoT for security reasons.

@mkeeter mkeeter force-pushed the mkeeter/sp-task-dump branch from 6cf8e7c to 1dc56e1 Compare November 18, 2024 19:23
@mkeeter mkeeter force-pushed the mkeeter/sp-task-dump branch 2 times, most recently from 7fed014 to dd40e70 Compare November 20, 2024 16:16
@mkeeter mkeeter force-pushed the mkeeter/sp-task-dump branch from dd40e70 to 63072f4 Compare November 20, 2024 16:18
@mkeeter mkeeter merged commit f11406f into main Dec 10, 2024
10 checks passed
mkeeter added a commit to oxidecomputer/hubris that referenced this pull request Dec 10, 2024
Hubris counterpart to
oxidecomputer/management-gateway-service#316,
allowing us to read task dumps over the network via MGS.

These MGS messages – unlike `dump-agent` – require more intelligence
into the Hubris side. The external API is just "start read" and
"continue read"; Hubris is responsible for knowing how `humpty` writes
stuff into RAM.

We allow for up to 4 simultaneous readers, distinguished by
caller-provided keys. If there are more than four simultaneous readers,
then (1) what the heck, and (2) some readers will return an error
indicating that their key is no longer valid.

There are also minor tweaks to `dump-agent` and `jefe` implementations:
- `dump-agent` learns how to read dump data into a lease (instead of
  returning a 256-byte value)
- `jefe` caches the most recent dump area → memory address lookup, so we
  don't always need to start from the beginning of the linked list. This
  isn't useful if we have multiple readers, but makes the single-reader
  case less Accidentally Quadratic.
mkeeter added a commit to oxidecomputer/humility that referenced this pull request Dec 10, 2024
See `oxidecomputer/management-gateway-service#316` and
`oxidecomputer/hubris#1921` for details

This PR adds a new `humility hydrate` subcommand, which takes a "dry" task dump
and staples it to a Hubris archive, creating a normal Hubris core file.
hawkw added a commit to oxidecomputer/omicron that referenced this pull request Apr 4, 2025
It turns out that our Git dependency on the
oxidecomputer/management-gateway-service repo hasn't been updated in...
a while. We're currently on a commit from September of last year,
oxidecomputer/management-gateway-service@9bbac47.
This branch updates it to the current HEAD commit,
oxidecomputer/management-gateway-service@f9566e6.

The only changes in MGS that required code changes in Omicron are:

- oxidecomputer/management-gateway-service#291, where I added a new
  `MeasurementKind` for AMD CPU T<sub>ctl</sub> values (which are not temperatures in degrees Celcius, but a secret third thing).
- oxidecomputer/management-gateway-service#316 by @mkeeter, adding the
  interface to read SP task dumps over the network. Since this adds
  methods to the `sp_impl::SpHandler` trait, the SP simulator
  implementations need to be updated, or else they will no longer
  compile. For now, I've just made these `unimplemented!()`, as we're
  not currently actually _using_ them.

In my PR #7903 implementing ereport ingestion from SPs, I had to make
these changes as part of changing the MGS dependency to pull in the new
`gateway-sp-comms` code for ereports. Since this isn't actually related,
and is just necessary to update the Git dep, I figured I'd pull that
commit (49973ae) into its own PR.
hawkw added a commit to oxidecomputer/omicron that referenced this pull request Apr 8, 2025
It turns out that our Git dependency on the
oxidecomputer/management-gateway-service repo hasn't been updated in...
a while. We're currently on a commit from September of last year,
oxidecomputer/management-gateway-service@9bbac47.
This branch updates it to the current HEAD commit,
oxidecomputer/management-gateway-service@f9566e6.

The only changes in MGS that required code changes in Omicron are:

- oxidecomputer/management-gateway-service#291, where I added a new
`MeasurementKind` for AMD CPU T<sub>ctl</sub> values (which are not
temperatures in degrees Celcius, but a secret third thing).
- oxidecomputer/management-gateway-service#316 by @mkeeter, adding the
interface to read SP task dumps over the network. Since this adds
methods to the `sp_impl::SpHandler` trait, the SP simulator
implementations need to be updated, or else they will no longer compile.
For now, I've just made these `unimplemented!()`, as we're not currently
actually _using_ them.

In my PR #7903 implementing ereport ingestion from SPs, I had to make
these changes as part of changing the MGS dependency to pull in the new
`gateway-sp-comms` code for ereports. Since this isn't actually related,
and is just necessary to update the Git dep, I figured I'd pull that
commit (49973ae) into its own PR.

---------

Co-authored-by: John Gallagher <john@oxidecomputer.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants