Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sled-agent] Launch switch zone automatically on scrimlets #1933

Merged
merged 11 commits into from
Nov 11, 2022

Conversation

smklein
Copy link
Collaborator

@smklein smklein commented Nov 10, 2022

  • Expand the service manager to support initializing the switch zone as a background task.
  • Reserve the Switch Zone address as the second address in the sled's subnet. The first address is already reserved for the sled agent, but this ensures that any sled can "become" a scrimlet without needing to provision a new IP through Nexus.
    • This has some fall-out effects - we bump all the other addresses up, which are currently statically allocated.
  • Miscellaneous cleanups

Fixes https://github.com/oxidecomputer/minimum-upgradable-product/issues/16
Part of https://github.com/oxidecomputer/minimum-upgradable-product/issues/18

- Expand the service manager to support initializing the switch zone as
a background task.
- Reserve the Switch Zone address as the second address in the sled's
subnet. The first address is already reserved for the sled agent, but
this ensures that any sled can "become" a scrimlet without needing to
provision a new IP through Nexus.
  - This has some fall-out effects - we bump all the other addresses up,
  which are currently statically allocated.
- Miscellaneous cleanups
  - Identify zones by a "ZoneType" enum, rather than a name.
  - Incorporate the portions of
  #1820 responsible for
  verifying links and devices when launching zones.

Fixes oxidecomputer/minimum-upgradable-product#16
Part of oxidecomputer/minimum-upgradable-product#18
@smklein smklein changed the title Launch switch zone automatically when /dev/tofino detected [sled-agent] Launch switch zone automatically when /dev/tofino detected Nov 10, 2022
@smklein smklein added the Sled Agent Related to the Per-Sled Configuration and Management label Nov 10, 2022
@smklein smklein changed the title [sled-agent] Launch switch zone automatically when /dev/tofino detected [sled-agent] Launch switch zone automatically on scrimlets Nov 10, 2022
@smklein
Copy link
Collaborator Author

smklein commented Nov 10, 2022

Throwing @jgallagher on the review list too, since this also impacts when MGS gets launched

Base automatically changed from scrimlet-autodetect to main November 10, 2022 20:11
sled-agent/src/services.rs Outdated Show resolved Hide resolved
sled-agent/src/services.rs Show resolved Hide resolved
sled-agent/src/sled_agent.rs Show resolved Hide resolved
}
(SwitchZone::Initializing { initializer, .. }, None) => {
info!(log, "Disabling switch zone (was initializing)");
initializer.abort();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this introduce a chance for the switch zone to be orphaned mid-startup? If we're inside the self.initialize_zone(&request).await point inside initialize_switch_zone and we abort the task, will the zone only be torn down if we've made it far enough to have instantiated a RunningZone that will perform cleanup when it's dropped?

Copy link
Collaborator Author

@smklein smklein Nov 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both this ensure_switch function and the initialize_switch_zone task perform operations on the "zone" while holding onto the self.inner.switch_zone lock, so the zone should "fully exist (in SwitchZone::Running) or not exist" at any point that we try to abort that initializer task.

That was my intent at least; does that look right to you?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was debating using https://docs.rs/tokio/latest/tokio/sync/oneshot/index.html when implementing this, to avoid any concerns of cancellation safety, if this seems too perilous to you - I could go down that route if this seems unsafe?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I missed that we were holding the lock the whole time zone initialization is running. I think this looks right then, but just to make sure my understanding matches yours: The only ways we can get to this branch to abort the initializer is if either

a) someone calls ensure_switch(Some(_)) and then immediately calls ensure_switch(None), and the second ensure_switch happens to acquire the lock before the initialize_switch_zone() spawned by the first call can

b) zone initialization fails, in which case initialize_switch_zone() sleeps for 1 second before retrying (so any call to ensure_switch(None) during that sleep would hit this branch)

In either case, the abort is only cancelling a task that isn't doing anything meaningful (it either hasn't started yet or is sleeping between retries).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what's here is safe because of the lock, but I think it would be clearer with message passing. Holding locks across .await points is somewhat confusing to me personally. Most of this is just personal preference though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just added 282baee - I believe this is functionally the same, but it operates by asking the worker task to terminate, rather than invoking abort.

Does this seem preferable to y'all?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do like that better, but maybe just because I'm wary of cancellation issues in general? If others liked the abort version better I'd be fine going back to it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a bit clearer now, yes. It makes it very explicit inside the task that it will only cancel at this point (or at least unless something else calls abort :-) )

Copy link
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I don't personally have any major concerns.

sled-agent/src/services.rs Show resolved Hide resolved
sled-agent/src/sled_agent.rs Show resolved Hide resolved
}
(SwitchZone::Initializing { initializer, .. }, None) => {
info!(log, "Disabling switch zone (was initializing)");
initializer.abort();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what's here is safe because of the lock, but I think it would be clearer with message passing. Holding locks across .await points is somewhat confusing to me personally. Most of this is just personal preference though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Sled Agent Related to the Per-Sled Configuration and Management
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants