Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

omicron-package uninstall leaves a bunch of networking stuff around #1212

Closed
davepacheco opened this issue Jun 15, 2022 · 5 comments
Closed

Comments

@davepacheco
Copy link
Collaborator

Summary

While preparing for the demo on lab machine "sock" the other week, I tried tearing down the stack and building it up again to make sure that the setup stuff works as expected. With a bunch of help in chat, we found that there's nothing today that tears down a bunch of stuff, which causes other teardown scripts to fail. The summary (mostly thanks to @bnaecker) was that omicron-package uninstall should:

  • delete IP interfaces created by Sled Agent (these seem to be prefixed with net)
  • delete all service zone VNICs, which are prefixed by ox
  • delete all guest VNICs, which look like vopteX
  • delete all xde devices. This could be a library call or shell out to opteadm delete-xde.
  • unload the xde driver, in the absence of an xde ioctl that does the opposite of opteadm set-xde-underlay

It makes me a little nervous to have automation automatically remove stuff by prefix, especially a generic prefix like net, but I don't know enough about how this state is managed to suggest a safer approach.

How you know you've hit this

Most of this was found by trying to run destroy_virtual_hardware.sh until it succeeded. Until these issues are resolved, you'll see errors like this:

 dladm delete-vnic net0
dladm: vnic deletion failed: link busy
+ RC=1
+ [[ 1 -eq 0 ]]
+ warn 'Failed to delete VNIC link net0'
+ echo -e '\e[1;31mFailed to delete VNIC link net0\e[0m'
Failed to delete VNIC link net0

Workarounds

While I was setting up the stack, I worked around these issues by:

  • finding the IP interfaces and L2 devices that need to be removed with ipadm and dladm show-link, respectively
  • manually removing IP interfaces that start with net using pfexec ipadm delete-if net0
  • manually removing the "vopte" VNICs with pfexec dladm delete-vnic vopte0
  • manually removing the "opte" xde devices with pfexec /opt/oxide/opte/bin/opteadm delete-xde opte0
  • manually unloading the "opte" kernel module. Use modinfo | grep xde to find its id (first column) and then pfexec modunload -i <id> to unload it.

note: if you leave the pfexec off the ipadm delete-if command, you may then run into illumos bug 14724 (since fixed, but "sock" has not been updated as of this writing).

@bnaecker
Copy link
Collaborator

I'm taking a look at this now. For the most part, the bullet-list at the top is good. Those addresses and links are created by different parts of the sled agent, and we can just remove them. There's now one outlier though, which is the linklocal address created as part of the integration with Maghemite. As part of #1249, we now create a link-local address on a physical link on the system. All of the other addresses the sled agent creates are either over the etherstub VNIC (e.g., underlay0/linklocal) or the VNICs created by create_virtual_hardware.sh (e.g., net0/linklocal). In contrast, I now see igb0/linklocal, which is created by the bootstrap server and provided as an argument to the Maghemite SMF service for use in advertising routes.

This makes it a bit harder for the uninstall process to cleanup. We can still remove

  • all addresses like underlay0/{addr}
  • the VNIC underlay0
  • all Oxide "control" VNICS, e.g. oxControlService0
  • the etherstub stub0
  • the OPTE VNICs, e.g., vopte0
  • the OPTE port itself, e.g, opte0
  • remove all addresses over the simulated Chelsios, e.g., net0/linklocal

What do we do with igb0/linklocal? This was created by getting either a datalink specified in the sled agent config at smf/sled-agent/config.toml. I don't believe anyone actually uses that however, and it defaults to the first enumerated physical link on the system. Do we use that same logic (what's in the config file or first link) to find an address to delete?

I think it might make more sense to have the bootstrap agent use a different link. In particular, the net{0,1} VNICs that are supposed to stand in for the Chelsio NICs for OPTE. I feel like we should use that approach more completely, and have all networking resources in the control plane (at least those that are there now) created over these.

@jgallagher
Copy link
Contributor

I think it might make more sense to have the bootstrap agent use a different link. In particular, the net{0,1} VNICs that are supposed to stand in for the Chelsio NICs for OPTE. I feel like we should use that approach more completely, and have all networking resources in the control plane (at least those that are there now) created over these.

Agreed, and I created the igb0/linklocal address based on maghemite ultimately wanting to be given cxgbeN/linklocal, not realizing that net0/net1 are the stand-ins for cxgbe0/cxgbe1. I can take a stab at changing the maghemite setup to share logic with opte (find_chelsio_links(), and maybe also the creation of the linklocal address objects from them).

@bnaecker
Copy link
Collaborator

bnaecker commented Jul 1, 2022

That's perfect, thanks John.

@bnaecker
Copy link
Collaborator

bnaecker commented Jul 5, 2022

Note: Remove this reference to this issue once #1352 is merged.

@bnaecker
Copy link
Collaborator

bnaecker commented Jul 8, 2022

Resolved by #1352

@bnaecker bnaecker closed this as completed Jul 8, 2022
leftwo pushed a commit that referenced this issue Apr 30, 2024
Propolis:
Update oximeter dependency to pull in automatic producer registration (#689)
Propagate ReplaceResult up; return disk status (#687)
Enable clippy warnings for lossless casts
Update rustls deps for CVE-2024-32650
migration: refrain from offering all pages when possible (#682)

Crucible:
DTrace probes for IO on/off the network (#1284)
Update oximeter dep to pull in automatic producer registration (#1279)
Remove `ReadResponse` in favor of `RawReadResponse` (#1212)
Fix typo in DTrace upstairs_info (#1276)
replace needing no work should not be an error (#1275)
Add some DTrace scripts to the package. (#1274)
More Pantry updates for Region replacement (#1269)
Send the correct task count for reconciliations (#1271)
Raw extent cleanup (#1268)
leftwo added a commit that referenced this issue Apr 30, 2024
Propolis:
Update oximeter dependency to pull in automatic producer registration (#689)
Propagate ReplaceResult up; return disk status (#687)
Enable clippy warnings for lossless casts
Update rustls deps for CVE-2024-32650
migration: refrain from offering all pages when possible (#682)

Crucible:
DTrace probes for IO on/off the network (#1284)
Update oximeter dep to pull in automatic producer registration (#1279)
Remove `ReadResponse` in favor of `RawReadResponse` (#1212)
Fix typo in DTrace upstairs_info (#1276)
replace needing no work should not be an error (#1275)
Add some DTrace scripts to the package. (#1274)
More Pantry updates for Region replacement (#1269)
Send the correct task count for reconciliations (#1271)
Raw extent cleanup (#1268)

---------

Co-authored-by: Alan Hanson <alan@oxide.computer>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants