-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gimlet-seq: Record why the power state changed #1953
Conversation
Currently, when the Gimlet CPU sequencer changes the system's power state, no information about *why* the power state changed is recorded by the SP firmware. A system may power off or reboot for a variety of reasons: it may be requested by the host OS over IPCC, by the control plane over the management network, or triggered by the thermal task due to an overheat condition. This makes debugging an unexpected reboot or power off difficult, as the SP ringbuffers and other diagnostics do not indicate why an unexpected power state change occurred. See #1950 for a motivating example. This commit resolves this as described in #1950 by adding a new field to the `SetState` variant in the `drv-gimlet-seq-server` ringbuffer, so that the reason a power state change occurred can be recorded. Clients of the `cpu_seq` IPC API must now provide a `StateChangeReason` when calling `Sequencer.set_state`, along with the desired power state, and the sequencer task will record the provided reason in its ringbuffer. This way, we can distinguish between the various reasons a power state change may have occurred when debugging such issues. The `StateChangeReason` enum also generates counters, so that the total number of power state changes can be tracked. Fixes #1950
Currently, the `Trace::SetState` ringbuf entry in the sequencer is a tuple-like enum variant. This entry includes two `PowerState` fields, one recording the previous power state and the other recording the new power state that has been set. IMHO, using a tuple-like variant to represent this is a bit unfortunate, as in Humility, we'll see two values of the same type and it's not immediately obvious which is the previous state and which is the new state. This must be determined based on the order of the fields in the ringbuf entry, which requires referencing the Hubris code to determine. I felt like it was nicer to just use a struct-like variant with named fields for this. That way, the semantic meaning of the two `PowerState`s is actually encoded in the debug info, and Humility can just indicate which is the previous state and which is the new state when displaying the ring buffer. I also think it's a bit nicer to name the timestamp field --- otherwise, it just looks like some arbitrary integer, and you need to look at the code to determine that it's the timestamp of the power state change. If this is controversial for some reason, I'm happy to land it in a separate PR, but I figured it was nice to do while I was messing with the sequencer ringbuf.
drv/cpu-seq-api/src/lib.rs
Outdated
pub enum StateChangeReason { | ||
/// TThe system has just received power, so the sequencer has booted the | ||
/// host CPU. | ||
InitialPowerOn = 1, | ||
/// A power state change was requested by the control plane. | ||
ControlPlane, | ||
/// The host OS requested that the system power off without rebooting. | ||
HostPowerOff, | ||
/// The host OS requested that the system reboot. | ||
HostReboot, | ||
/// The system powered off because a component has overheated. | ||
Overheat, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very open to suggestions about the naming of these variants, if anyone dislikes the ones I came up with...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! I'm not sure what to suggest about your (very good) question about hiffy backwards compatibility - might be worth checking with mfg and host-os folks who are probably the biggest users of that?
Since manufacturing and test automation currently uses the `Sequencer.set_state` IPC via Hiffy, let's avoid breaking it and instead introduce a separate `Sequencer.set_state_with_reason`. Now, calls to `set_state` without a reason will get `StateChangeReason::Other`. In practice, this means Hiffy, as all Hubris-internal callers now use `set_state_with_reason`.
Instead of just setting it to `HostReboot` always, hang onto the last power off until reaching A0, so that the reason can be sent for the `set_state` call to reboot, as well as the power off. Also, just use `StateChangeReason` here instead of our own enum, and add it to the `host-sp-comms` ringbuf as well.
@jgallagher Alright, I think all the review feedback has been addressed and I'd love another review whenever you've got the chance! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
0939d49
to
007470b
Compare
Currently, when the Gimlet CPU sequencer changes the system's power
state, no information about why the power state changed is recorded by
the SP firmware. A system may power off or reboot for a variety of
reasons: it may be requested by the host OS over IPCC, by the control
plane over the management network, or triggered by the thermal task due
to an overheat condition. This makes debugging an unexpected reboot or
power off difficult, as the SP ringbuffers and other diagnostics do not
indicate why an unexpected power state change occurred. See #1950 for a
motivating example.
This commit resolves this as described in #1950 by adding a new field to
the
SetState
variant in thedrv-gimlet-seq-server
ringbuffer, sothat the reason a power state change occurred can be recorded. A new IPC
function,
Sequencer.set_state_with_reason
, is added to thecpu_seq
IPC API. This is equivalent to
Sequencer.set_state
but with theaddition of a
StateChangeReason
argument in addition to the desiredpower state, and the sequencer task will record the provided reason in
its ringbuffer. This way, we can distinguish between the various reasons
a power state change may have occurred when debugging such issues.
All Hubris-internal callers of
Sequencer.set_state
are updated toinstead use
Sequencer.set_state_with_reason
. In particular,host-sp-comms
will record a variety of differentStateChangeReason
s,allowing us to indicate whether the host requested a normal
power-off/reboot, the host OS panicked or failed to boot, or the host
CPU reset itself. Other callers like
control-plane-agent
andthermal
are simpler and just say "it was the control plane" or "overheat",
respectively. For backwards compatibility with existing callers of
Sequencer.set_state
viahiffy
, theset_state
IPC is left as-is,and will be recorded in the ringbuffer with
StateChangeReason::Other
.Since all Hubris tasks now use the new API,
Other
basically justmeans
hiffy
.The
StateChangeReason
enum also generates counters, so that the totalnumber of power state changes can be tracked.
Also, while I was here, I've changed the
Trace::SetState
entry in thedrv-gimlet-seq-server
ringbuf from a tuple-like enum variant to astruct-like enum variant with named fields. This entry includes two
PowerState
fields, one recording the previous power state and theother recording the new power state that has been set. IMHO, using a
tuple-like variant to represent this is a bit unfortunate, as in
Humility, we'll see two values of the same type and it's not immediately
obvious which is the previous state and which is the new state. This
must be determined based on the order of the fields in the ringbuf
entry, which requires referencing the Hubris code to determine.
I felt like it was nicer to just use a struct-like variant with named
fields for this. That way, the semantic meaning of the two
PowerState
sis actually encoded in the debug info, and Humility can just indicate
which is the previous state and which is the new state when displaying
the ring buffer. I also think it's a bit nicer to name the timestamp
field --- otherwise, it just looks like some arbitrary integer, and you
need to look at the code to determine that it's the timestamp of the
power state change.
Fixes #1950