-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read SP task crash dumps over the network #316
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM; some minor questions about the messages.
@@ -419,6 +419,12 @@ enum Command { | |||
#[clap(subcommand)] | |||
cmd: MonorailCommand, | |||
}, | |||
|
|||
/// List and read per-task crash dumps | |||
Dump { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this take a component or is this only ever going to be an SP thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably SP-only? It's pretty specific to Hubris, and we don't do task dumps on the RoT for security reasons.
6cf8e7c
to
1dc56e1
Compare
7fed014
to
dd40e70
Compare
dd40e70
to
63072f4
Compare
Hubris counterpart to oxidecomputer/management-gateway-service#316, allowing us to read task dumps over the network via MGS. These MGS messages – unlike `dump-agent` – require more intelligence into the Hubris side. The external API is just "start read" and "continue read"; Hubris is responsible for knowing how `humpty` writes stuff into RAM. We allow for up to 4 simultaneous readers, distinguished by caller-provided keys. If there are more than four simultaneous readers, then (1) what the heck, and (2) some readers will return an error indicating that their key is no longer valid. There are also minor tweaks to `dump-agent` and `jefe` implementations: - `dump-agent` learns how to read dump data into a lease (instead of returning a 256-byte value) - `jefe` caches the most recent dump area → memory address lookup, so we don't always need to start from the beginning of the linked list. This isn't useful if we have multiple readers, but makes the single-reader case less Accidentally Quadratic.
See `oxidecomputer/management-gateway-service#316` and `oxidecomputer/hubris#1921` for details This PR adds a new `humility hydrate` subcommand, which takes a "dry" task dump and staples it to a Hubris archive, creating a normal Hubris core file.
It turns out that our Git dependency on the oxidecomputer/management-gateway-service repo hasn't been updated in... a while. We're currently on a commit from September of last year, oxidecomputer/management-gateway-service@9bbac47. This branch updates it to the current HEAD commit, oxidecomputer/management-gateway-service@f9566e6. The only changes in MGS that required code changes in Omicron are: - oxidecomputer/management-gateway-service#291, where I added a new `MeasurementKind` for AMD CPU T<sub>ctl</sub> values (which are not temperatures in degrees Celcius, but a secret third thing). - oxidecomputer/management-gateway-service#316 by @mkeeter, adding the interface to read SP task dumps over the network. Since this adds methods to the `sp_impl::SpHandler` trait, the SP simulator implementations need to be updated, or else they will no longer compile. For now, I've just made these `unimplemented!()`, as we're not currently actually _using_ them. In my PR #7903 implementing ereport ingestion from SPs, I had to make these changes as part of changing the MGS dependency to pull in the new `gateway-sp-comms` code for ereports. Since this isn't actually related, and is just necessary to update the Git dep, I figured I'd pull that commit (49973ae) into its own PR.
It turns out that our Git dependency on the oxidecomputer/management-gateway-service repo hasn't been updated in... a while. We're currently on a commit from September of last year, oxidecomputer/management-gateway-service@9bbac47. This branch updates it to the current HEAD commit, oxidecomputer/management-gateway-service@f9566e6. The only changes in MGS that required code changes in Omicron are: - oxidecomputer/management-gateway-service#291, where I added a new `MeasurementKind` for AMD CPU T<sub>ctl</sub> values (which are not temperatures in degrees Celcius, but a secret third thing). - oxidecomputer/management-gateway-service#316 by @mkeeter, adding the interface to read SP task dumps over the network. Since this adds methods to the `sp_impl::SpHandler` trait, the SP simulator implementations need to be updated, or else they will no longer compile. For now, I've just made these `unimplemented!()`, as we're not currently actually _using_ them. In my PR #7903 implementing ereport ingestion from SPs, I had to make these changes as part of changing the MGS dependency to pull in the new `gateway-sp-comms` code for ereports. Since this isn't actually related, and is just necessary to update the Git dep, I figured I'd pull that commit (49973ae) into its own PR. --------- Co-authored-by: John Gallagher <john@oxidecomputer.com>
This PR adds new messages and functions to read SP task crash dumps over the network and write them to a canonical file, without the corresponding Hubris archive being available at debug time.
These "dry" dumps can then be hydrated with
humility hydrate
(oxidecomputer/humility#520). In the meantime, they are significantly smaller (RAM only, no ELF or debug info) and easier to pass around (we can using the existing MGS connection instead of talking todump-agent
and knowing howhumpty
works).