Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

service zones may have device/link requirements #1820

Closed
wants to merge 3 commits into from
Closed

Conversation

Nieuwejaar
Copy link
Contributor

No description provided.

  - Add support for mapping physical devices into a service zone
  - Add support for mapping existing links into a service zone
  - Don't start the switch zone if necessary devices are missing
Copy link
Collaborator

@bnaecker bnaecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small comments, but overall looks good!

// When running on a real sidecar, we need the /dev/tofino device
// to talk to the tofino ASIC.
// TODO: this really needs to be checked periodically rather
// than once at startup. Since the sidecar is in a difference
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling nit: "difference" -> "different"

) -> Result<Vec<String>, Error> {
// When running on a real sidecar, we need the /dev/tofino device
// to talk to the tofino ASIC.
// TODO: this really needs to be checked periodically rather
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is certainly true. Is it possible to start the zone unconditionally, and add the device file in as it becomes available? I think Dendrite should be able to communicate a failure to find the device, by returning some error condition to requests which require that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe so, but I'll dig further. Devices in a zone are part of the zone config, which is fixed at zone startup. We could add tfpkt0 to a running zone, as it is a "link" rather than a "device".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There would need to be some OS work, but I don't think it's insurmountable. We'd need to be able to trigger reevaluation of the match rules in the configuration, probably in response to devfsadm or other sysevent activity, and create any missing nodes in the zone /dev directory.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a short-term workaround, would it work to reboot the Dendrite zone if a new tofino device appears? We could monitor for this in the GZ, and react by rebooting the zone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When built for tofino-asic, dpd will currently crash if /dev/tofino isn't present. I can change that easily enough. With that change, then rebooting when the tofino device (dis)appears should work.

Then the question becomes: if we're not keying off the presence of the device, when do we launch a switch zone? Do we really want that to be running on every gimlet?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When built for tofino-asic, dpd will currently crash if /dev/tofino isn't present. I can change that easily enough. With that change, then rebooting when the tofino device (dis)appears should work.

Sure, though would this change even be necessary? Presumably, the GZ could wait until such a device appears before launching the switch zone.

Then the question becomes: if we're not keying off the presence of the device, when do we launch a switch zone? Do we really want that to be running on every gimlet?

I do not think so - I think we want the switch zone software to exist on all gimlets (in case one is moved to the scrimlet location), but I think we'd prefer to refrain from launching the zone until we have confirmation this is a scrimlet device.

In #1510 I added a hard-coded configuration to make this distinction (if you wanted to test this now), but I do think the decision of "scrimlet vs gimlet" should be dynamically detectable, as discussed in https://github.com/orgs/oxidecomputer/projects/19/views/1

@@ -86,6 +86,7 @@ impl<DL: VnicSource + Clone> VnicAllocator<DL> {
/// communicating with Oxide services.
#[derive(Debug, Clone, Copy, PartialEq)]
pub enum VnicKind {
Physical,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems odd to have a variant of the virtial NIC type that represents a physical NIC. This appears to be bigger than this PR, so I don't think we should block it on fixing this, but we should probably change InstalledZone::install to accept zero or more PhysicalLinks as well as the Vnics we're currently passing in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic in the zone creation is identical for vnics and physical nics, so I didn't want to complicate the interface for no functional reason. In fact, the zone install() calls the argument physical_nic: Option<Vnic>. I'm going to rename Vnic -> Link, which will resolve the existing ambiguity and prevent me from making it worse.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that change is beyond the scope of this PR. The addition of a Physical variant to the Vnic enum just made clear that it's not really the right type or abstraction here. Nothing for you to do, but it might make sense to create an issue to clean this up.

let nic_needed = match self.nic_needed(req) {
Ok(n) => n,
Err(e) => {
info!(self.log, "skipping zone {}: {:?}", req.zone_name, e);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be at warn! or error! log level? We may also want to specifically say that this is skipped because a TF-port variant zone requires the /dev/tfport device file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warn! would be a good idea.

The error message already tells you which zone isn't being created and why with "failed to find device {device}", which I think is all you're asking for.

Rename illumos::vnic to illumos::link to reflect addition of physical nics
Update to pull the latest dendrite bits, built with omicron-package 0.5.1
Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me; biggest question is the point raised here: https://github.com/oxidecomputer/omicron/pull/1820/files#r997306229

Comment on lines 85 to 86
/// Represents the kind of a VNIC, such as whether it's for guest networking or
/// communicating with Oxide services.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Represents the kind of a VNIC, such as whether it's for guest networking or
/// communicating with Oxide services.
/// Represents the kind of a Link, such as whether it's for guest networking or
/// communicating with Oxide services.

) -> Result<Vec<String>, Error> {
// When running on a real sidecar, we need the /dev/tofino device
// to talk to the tofino ASIC.
// TODO: this really needs to be checked periodically rather
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a short-term workaround, would it work to reboot the Dendrite zone if a new tofino device appears? We could monitor for this in the GZ, and react by rebooting the zone.

smklein added a commit that referenced this pull request Nov 10, 2022
- Expand the service manager to support initializing the switch zone as
a background task.
- Reserve the Switch Zone address as the second address in the sled's
subnet. The first address is already reserved for the sled agent, but
this ensures that any sled can "become" a scrimlet without needing to
provision a new IP through Nexus.
  - This has some fall-out effects - we bump all the other addresses up,
  which are currently statically allocated.
- Miscellaneous cleanups
  - Identify zones by a "ZoneType" enum, rather than a name.
  - Incorporate the portions of
  #1820 responsible for
  verifying links and devices when launching zones.

Fixes oxidecomputer/minimum-upgradable-product#16
Part of oxidecomputer/minimum-upgradable-product#18
smklein added a commit that referenced this pull request Nov 10, 2022
- Expand the service manager to support initializing the switch zone as
a background task.
- Reserve the Switch Zone address as the second address in the sled's
subnet. The first address is already reserved for the sled agent, but
this ensures that any sled can "become" a scrimlet without needing to
provision a new IP through Nexus.
  - This has some fall-out effects - we bump all the other addresses up,
  which are currently statically allocated.
- Miscellaneous cleanups
  - Identify zones by a "ZoneType" enum, rather than a name.
  - Incorporate the portions of
  #1820 responsible for
  verifying links and devices when launching zones.

Fixes oxidecomputer/minimum-upgradable-product#16
Part of oxidecomputer/minimum-upgradable-product#18
@smklein
Copy link
Collaborator

smklein commented Nov 10, 2022

PSA: I've incorporated the salient parts of this PR into #1933 - I think that once that merged, this PR could be marked obsolete.

smklein added a commit that referenced this pull request Nov 11, 2022
- Expand the service manager to support initializing the switch zone as a background task.
- Reserve the Switch Zone address as the second address in the sled's subnet. The first address is already reserved for the sled agent, but this ensures that any sled can "become" a scrimlet without needing to provision a new IP through Nexus.
  - This has some fall-out effects - we bump all the other addresses up, which are currently statically allocated.
- Miscellaneous cleanups
  - Identify zones by a "ZoneType" enum, rather than a name.
  - Incorporate the portions of #1820 responsible for verifying links and devices when launching zones.

Fixes oxidecomputer/minimum-upgradable-product#16
Part of oxidecomputer/minimum-upgradable-product#18
@Nieuwejaar
Copy link
Contributor Author

Closed - obsoleted by 1918 and 1933

@Nieuwejaar Nieuwejaar closed this Nov 12, 2022
@Nieuwejaar Nieuwejaar deleted the tfport branch November 12, 2022 19:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants