-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
service zones may have device/link requirements #1820
Conversation
- Add support for mapping physical devices into a service zone - Add support for mapping existing links into a service zone - Don't start the switch zone if necessary devices are missing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few small comments, but overall looks good!
sled-agent/src/services.rs
Outdated
// When running on a real sidecar, we need the /dev/tofino device | ||
// to talk to the tofino ASIC. | ||
// TODO: this really needs to be checked periodically rather | ||
// than once at startup. Since the sidecar is in a difference |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spelling nit: "difference" -> "different"
) -> Result<Vec<String>, Error> { | ||
// When running on a real sidecar, we need the /dev/tofino device | ||
// to talk to the tofino ASIC. | ||
// TODO: this really needs to be checked periodically rather |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is certainly true. Is it possible to start the zone unconditionally, and add the device file in as it becomes available? I think Dendrite should be able to communicate a failure to find the device, by returning some error condition to requests which require that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe so, but I'll dig further. Devices in a zone are part of the zone config, which is fixed at zone startup. We could add tfpkt0
to a running zone, as it is a "link" rather than a "device".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There would need to be some OS work, but I don't think it's insurmountable. We'd need to be able to trigger reevaluation of the match rules in the configuration, probably in response to devfsadm or other sysevent activity, and create any missing nodes in the zone /dev
directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a short-term workaround, would it work to reboot the Dendrite zone if a new tofino
device appears? We could monitor for this in the GZ, and react by rebooting the zone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When built for tofino-asic
, dpd
will currently crash if /dev/tofino
isn't present. I can change that easily enough. With that change, then rebooting when the tofino device (dis)appears should work.
Then the question becomes: if we're not keying off the presence of the device, when do we launch a switch
zone? Do we really want that to be running on every gimlet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When built for
tofino-asic
,dpd
will currently crash if/dev/tofino
isn't present. I can change that easily enough. With that change, then rebooting when the tofino device (dis)appears should work.
Sure, though would this change even be necessary? Presumably, the GZ could wait until such a device appears before launching the switch zone.
Then the question becomes: if we're not keying off the presence of the device, when do we launch a
switch
zone? Do we really want that to be running on every gimlet?
I do not think so - I think we want the switch
zone software to exist on all gimlets (in case one is moved to the scrimlet location), but I think we'd prefer to refrain from launching the zone until we have confirmation this is a scrimlet device.
In #1510 I added a hard-coded configuration to make this distinction (if you wanted to test this now), but I do think the decision of "scrimlet vs gimlet" should be dynamically detectable, as discussed in https://github.com/orgs/oxidecomputer/projects/19/views/1
@@ -86,6 +86,7 @@ impl<DL: VnicSource + Clone> VnicAllocator<DL> { | |||
/// communicating with Oxide services. | |||
#[derive(Debug, Clone, Copy, PartialEq)] | |||
pub enum VnicKind { | |||
Physical, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems odd to have a variant of the virtial NIC type that represents a physical NIC. This appears to be bigger than this PR, so I don't think we should block it on fixing this, but we should probably change InstalledZone::install
to accept zero or more PhysicalLink
s as well as the Vnic
s we're currently passing in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic in the zone creation is identical for vnics and physical nics, so I didn't want to complicate the interface for no functional reason. In fact, the zone install()
calls the argument physical_nic: Option<Vnic>
. I'm going to rename Vnic -> Link, which will resolve the existing ambiguity and prevent me from making it worse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that change is beyond the scope of this PR. The addition of a Physical
variant to the Vnic
enum just made clear that it's not really the right type or abstraction here. Nothing for you to do, but it might make sense to create an issue to clean this up.
let nic_needed = match self.nic_needed(req) { | ||
Ok(n) => n, | ||
Err(e) => { | ||
info!(self.log, "skipping zone {}: {:?}", req.zone_name, e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be at warn!
or error!
log level? We may also want to specifically say that this is skipped because a TF-port variant zone requires the /dev/tfport
device file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warn!
would be a good idea.
The error message already tells you which zone isn't being created and why with "failed to find device {device}", which I think is all you're asking for.
Rename illumos::vnic to illumos::link to reflect addition of physical nics Update to pull the latest dendrite bits, built with omicron-package 0.5.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me; biggest question is the point raised here: https://github.com/oxidecomputer/omicron/pull/1820/files#r997306229
/// Represents the kind of a VNIC, such as whether it's for guest networking or | ||
/// communicating with Oxide services. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Represents the kind of a VNIC, such as whether it's for guest networking or | |
/// communicating with Oxide services. | |
/// Represents the kind of a Link, such as whether it's for guest networking or | |
/// communicating with Oxide services. |
) -> Result<Vec<String>, Error> { | ||
// When running on a real sidecar, we need the /dev/tofino device | ||
// to talk to the tofino ASIC. | ||
// TODO: this really needs to be checked periodically rather |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a short-term workaround, would it work to reboot the Dendrite zone if a new tofino
device appears? We could monitor for this in the GZ, and react by rebooting the zone.
- Expand the service manager to support initializing the switch zone as a background task. - Reserve the Switch Zone address as the second address in the sled's subnet. The first address is already reserved for the sled agent, but this ensures that any sled can "become" a scrimlet without needing to provision a new IP through Nexus. - This has some fall-out effects - we bump all the other addresses up, which are currently statically allocated. - Miscellaneous cleanups - Identify zones by a "ZoneType" enum, rather than a name. - Incorporate the portions of #1820 responsible for verifying links and devices when launching zones. Fixes oxidecomputer/minimum-upgradable-product#16 Part of oxidecomputer/minimum-upgradable-product#18
- Expand the service manager to support initializing the switch zone as a background task. - Reserve the Switch Zone address as the second address in the sled's subnet. The first address is already reserved for the sled agent, but this ensures that any sled can "become" a scrimlet without needing to provision a new IP through Nexus. - This has some fall-out effects - we bump all the other addresses up, which are currently statically allocated. - Miscellaneous cleanups - Identify zones by a "ZoneType" enum, rather than a name. - Incorporate the portions of #1820 responsible for verifying links and devices when launching zones. Fixes oxidecomputer/minimum-upgradable-product#16 Part of oxidecomputer/minimum-upgradable-product#18
PSA: I've incorporated the salient parts of this PR into #1933 - I think that once that merged, this PR could be marked obsolete. |
- Expand the service manager to support initializing the switch zone as a background task. - Reserve the Switch Zone address as the second address in the sled's subnet. The first address is already reserved for the sled agent, but this ensures that any sled can "become" a scrimlet without needing to provision a new IP through Nexus. - This has some fall-out effects - we bump all the other addresses up, which are currently statically allocated. - Miscellaneous cleanups - Identify zones by a "ZoneType" enum, rather than a name. - Incorporate the portions of #1820 responsible for verifying links and devices when launching zones. Fixes oxidecomputer/minimum-upgradable-product#16 Part of oxidecomputer/minimum-upgradable-product#18
Closed - obsoleted by 1918 and 1933 |
No description provided.