Skip to content

Conversation

@bnaecker
Copy link
Collaborator

@bnaecker bnaecker commented Nov 6, 2025

  • Add APIs in the sled-agent for creating / deleting probes, and have Nexus use them when managing probes from the external API, especially replacing the entire set of probes with a PUT.
  • Rework the probe manager to accept the list of expected probes from Nexus, and drive the state toward that, rather than periodically pollling Nexus.
  • Add background task for periodically pushing probes to sleds, and omdb innards for reporting its state.
  • Closes invert sled-agent's probe manager #9157

@bnaecker bnaecker marked this pull request as draft November 6, 2025 06:45
@bnaecker bnaecker force-pushed the invert-probe-api-direction branch from 08bcc6a to 72fa123 Compare November 6, 2025 17:57
@bnaecker
Copy link
Collaborator Author

bnaecker commented Nov 6, 2025

I did some manual testing on my local developer machine. I set up the full Omicron environment (virtual hardware, omicron-package install, IP Pool) and then created a few probes manually through the API:

bnaecker@flint : ~/file-cabinet/oxide/oxide.rs $ ./target/release/oxide --profile recovery experimental system probe create --project my-proj --name my-probe --description foo --sled 5824ea91-f422-4270-8182-13f90c448526
WARNING: 644 permissions on "/Users/bnaecker/.config/oxide/credentials.toml" may allow other users to access your login credentials.
{
  "description": "foo",
  "id": "e6a47756-c568-4450-9960-13ee5a787ff5",
  "name": "my-probe",
  "sled": "5824ea91-f422-4270-8182-13f90c448526",
  "time_created": "2025-11-06T17:38:40.346265Z",
  "time_modified": "2025-11-06T17:38:40.346265Z"
}
bnaecker@flint : ~/file-cabinet/oxide/oxide.rs $ ./target/release/oxide --profile recovery experimental system probe list --project my-proj
WARNING: 644 permissions on "/Users/bnaecker/.config/oxide/credentials.toml" may allow other users to access your login credentials.
[
  {
    "external_ips": [
      {
        "first_port": 0,
        "ip": "192.168.1.30",
        "kind": "ephemeral",
        "last_port": 65535
      }
    ],
    "id": "e6a47756-c568-4450-9960-13ee5a787ff5",
    "interface": {
      "id": "68730add-982c-4a73-8652-f3ca719a07b5",
      "ip": "172.30.0.5",
      "kind": {
        "type": "probe",
        "id": "e6a47756-c568-4450-9960-13ee5a787ff5"
      },
      "mac": "A8:40:25:F0:00:00",
      "name": "my-probe",
      "primary": true,
      "slot": 0,
      "subnet": "172.30.0.0/22",
      "vni": 12526184
    },
    "name": "my-probe",
    "sled": "5824ea91-f422-4270-8182-13f90c448526"
  }
]
bnaecker@flint : ~/file-cabinet/oxide/oxide.rs $ ./target/release/oxide --profile recovery experimental system probe create --project my-proj --name my-probe2 --description foo --sled 5824ea91-f422-4270-8182-13f90c448526
WARNING: 644 permissions on "/Users/bnaecker/.config/oxide/credentials.toml" may allow other users to access your login credentials.
{
  "description": "foo",
  "id": "803000cb-c2db-480b-a3f8-bdb4db41731f",
  "name": "my-probe2",
  "sled": "5824ea91-f422-4270-8182-13f90c448526",
  "time_created": "2025-11-06T17:44:34.832957Z",
  "time_modified": "2025-11-06T17:44:34.832957Z"
}

Nexus asked the sled-agent to create the probes, and it launched the zones with OPTE ports for them:

bnaecker@shale : ~/omicron $ zoneadm list
global
sidecar_softnpu
oxz_switch
oxz_internal_dns_3c96ff58-6abe-4a61-8836-b46b96ba0ba5
oxz_internal_dns_053dd812-449c-42ab-9573-c8db6727f13f
oxz_internal_dns_3a7e8f7b-4493-4cad-9324-d10875c786aa
oxz_ntp_1a85949b-bd73-4add-919d-309e2253c9e1
oxz_cockroachdb_0d9cf552-c13e-4931-82fc-aef2fa94e222
oxz_cockroachdb_272e801b-8e88-4ad8-92bd-622565da9f42
oxz_cockroachdb_1f1560bd-60c2-4be6-914c-5565daa2b4d6
oxz_cockroachdb_84ff6527-cea0-42fc-acb5-7067233e408e
oxz_cockroachdb_81608d04-3fec-43e7-a41b-9745747d67e0
oxz_crucible_c32e09b5-02b1-4a35-b6c8-65a99e8ae706
oxz_crucible_90fd9e6f-7815-443b-81a6-b2b0fa14f9bb
oxz_crucible_f910421d-d6a6-4f1a-8720-380b54387e09
oxz_external_dns_7a4ca070-b47b-499d-b198-db60230e9e49
oxz_crucible_fea0037a-7220-4c2b-b3fb-fbfadf2f3a73
oxz_crucible_4c9a9a74-f750-4632-bd20-6431123da399
oxz_nexus_c4804ae6-397b-4090-9195-1b5a71a258ab
oxz_crucible_pantry_907c5a64-ba23-4435-a32a-834b531699b7
oxz_crucible_8eb9528a-09e2-49f0-9af5-0a1333672e05
oxz_crucible_4adb2ed1-985d-40a8-be39-1adc61306951
oxz_crucible_pantry_b1b94151-35b6-4694-bded-698b5b02214f
oxz_nexus_f46b6256-dc2b-462a-a451-4e075d58b154
oxz_nexus_7d0743cc-8c60-40ff-9149-16b9cc51e76d
oxz_crucible_pantry_b77b0907-0afd-4f4a-8b61-3d7e181a2ba2
oxz_crucible_ab607daa-355d-45bd-9a23-de6929d7574d
oxz_crucible_39632cde-2107-4f39-b1c2-01510f747d2e
oxz_oximeter_85e6f8b0-81ee-4bd2-9276-6b6c3327f74b
oxz_external_dns_c562aeef-61e8-45ba-8c80-c07b902edcd6
oxz_clickhouse_cbb69528-8b4a-498c-a0de-d7d66eeb7d94
oxz_probe_e6a47756-c568-4450-9960-13ee5a787ff5
oxz_probe_803000cb-c2db-480b-a3f8-bdb4db41731f
bnaecker@shale : ~/omicron $ pfexec opteadm list-ports
LINK   MAC ADDRESS        IPv4 ADDRESS  EPHEMERAL IPv4  FLOATING IPv4  IPv6 ADDRESS  EXTERNAL IPv6  FLOATING IPv6  STATE
opte0  A8:40:25:FF:9D:65  172.30.3.5    None            None           None          None           None           running
opte1  A8:40:25:FF:81:C0  172.30.1.6    None            192.168.1.21   None          None           None           running
opte2  A8:40:25:FF:FE:87  172.30.2.6    None            192.168.1.23   None          None           None           running
opte3  A8:40:25:FF:9D:97  172.30.2.5    None            192.168.1.22   None          None           None           running
opte4  A8:40:25:FF:F2:90  172.30.1.5    None            192.168.1.20   None          None           None           running
opte5  A8:40:25:FF:C6:E3  172.30.2.7    None            192.168.1.24   None          None           None           running
opte6  A8:40:25:F0:00:00  172.30.0.5    192.168.1.30    None           None          None           None           running
opte7  A8:40:25:F0:00:01  172.30.0.6    192.168.1.31    None           None          None           None           running

We can also see the new background task in Nexus periodically pushing the full set of probes:

bnaecker@shale : ~/omicron $ cargo run --bin omdb -- nexus background-tasks show probe_distributor
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 1.10s
     Running `target/debug/omdb nexus background-tasks show probe_distributor`
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:1122:3344:101::b]:12232
task: "probe_distributor"
  configured period: every 1m
  currently executing: no
  last completed activation: iter 16, triggered by a periodic timer firing
    started at 2025-11-06T17:44:34.924Z (50s ago) and ran for 655ms
    succesfully-pushed probes: 1 total
      sled_id=5824ea91-f422-4270-8182-13f90c448526 n_probes=1
    errors while pushing probes: 0 total

bnaecker@shale : ~/omicron $ cargo run --bin omdb -- nexus background-tasks show probe_distributor
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 1.18s
     Running `target/debug/omdb nexus background-tasks show probe_distributor`
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:1122:3344:101::b]:12232
task: "probe_distributor"
  configured period: every 1m
  currently executing: no
  last completed activation: iter 17, triggered by a periodic timer firing
    started at 2025-11-06T17:45:34.925Z (2s ago) and ran for 391ms
    succesfully-pushed probes: 2 total
      sled_id=5824ea91-f422-4270-8182-13f90c448526 n_probes=2
    errors while pushing probes: 0 total

And when a probe is deleted, the sled-agent still tears down the zone and the Nexus background task now propagates just the one remaining probe:

bnaecker@shale : ~/omicron $ zoneadm list
global
sidecar_softnpu
oxz_switch
oxz_internal_dns_3c96ff58-6abe-4a61-8836-b46b96ba0ba5
oxz_internal_dns_053dd812-449c-42ab-9573-c8db6727f13f
oxz_internal_dns_3a7e8f7b-4493-4cad-9324-d10875c786aa
oxz_ntp_1a85949b-bd73-4add-919d-309e2253c9e1
oxz_cockroachdb_0d9cf552-c13e-4931-82fc-aef2fa94e222
oxz_cockroachdb_272e801b-8e88-4ad8-92bd-622565da9f42
oxz_cockroachdb_1f1560bd-60c2-4be6-914c-5565daa2b4d6
oxz_cockroachdb_84ff6527-cea0-42fc-acb5-7067233e408e
oxz_cockroachdb_81608d04-3fec-43e7-a41b-9745747d67e0
oxz_crucible_c32e09b5-02b1-4a35-b6c8-65a99e8ae706
oxz_crucible_90fd9e6f-7815-443b-81a6-b2b0fa14f9bb
oxz_crucible_f910421d-d6a6-4f1a-8720-380b54387e09
oxz_external_dns_7a4ca070-b47b-499d-b198-db60230e9e49
oxz_crucible_fea0037a-7220-4c2b-b3fb-fbfadf2f3a73
oxz_crucible_4c9a9a74-f750-4632-bd20-6431123da399
oxz_nexus_c4804ae6-397b-4090-9195-1b5a71a258ab
oxz_crucible_pantry_907c5a64-ba23-4435-a32a-834b531699b7
oxz_crucible_8eb9528a-09e2-49f0-9af5-0a1333672e05
oxz_crucible_4adb2ed1-985d-40a8-be39-1adc61306951
oxz_crucible_pantry_b1b94151-35b6-4694-bded-698b5b02214f
oxz_nexus_f46b6256-dc2b-462a-a451-4e075d58b154
oxz_nexus_7d0743cc-8c60-40ff-9149-16b9cc51e76d
oxz_crucible_pantry_b77b0907-0afd-4f4a-8b61-3d7e181a2ba2
oxz_crucible_ab607daa-355d-45bd-9a23-de6929d7574d
oxz_crucible_39632cde-2107-4f39-b1c2-01510f747d2e
oxz_oximeter_85e6f8b0-81ee-4bd2-9276-6b6c3327f74b
oxz_external_dns_c562aeef-61e8-45ba-8c80-c07b902edcd6
oxz_clickhouse_cbb69528-8b4a-498c-a0de-d7d66eeb7d94
oxz_probe_803000cb-c2db-480b-a3f8-bdb4db41731f
bnaecker@shale : ~/omicron $ pfexec opteadm list-ports
LINK   MAC ADDRESS        IPv4 ADDRESS  EPHEMERAL IPv4  FLOATING IPv4  IPv6 ADDRESS  EXTERNAL IPv6  FLOATING IPv6  STATE
opte0  A8:40:25:FF:9D:65  172.30.3.5    None            None           None          None           None           running
opte1  A8:40:25:FF:81:C0  172.30.1.6    None            192.168.1.21   None          None           None           running
opte2  A8:40:25:FF:FE:87  172.30.2.6    None            192.168.1.23   None          None           None           running
opte3  A8:40:25:FF:9D:97  172.30.2.5    None            192.168.1.22   None          None           None           running
opte4  A8:40:25:FF:F2:90  172.30.1.5    None            192.168.1.20   None          None           None           running
opte5  A8:40:25:FF:C6:E3  172.30.2.7    None            192.168.1.24   None          None           None           running
opte6  A8:40:25:F0:00:00  172.30.0.5    192.168.1.30    None           None          None           None           ready
opte7  A8:40:25:F0:00:01  172.30.0.6    192.168.1.31    None           None          None           None           running
bnaecker@shale : ~/omicron $ cargo run --bin omdb -- nexus background-tasks show probe_distributor
   Compiling illumos-utils v0.1.0 (/home/bnaecker/omicron/illumos-utils)
   Compiling sled-hardware-types v0.1.0 (/home/bnaecker/omicron/sled-hardware/types)
   Compiling nexus-sled-agent-shared v0.1.0 (/home/bnaecker/omicron/nexus-sled-agent-shared)
   Compiling bootstore v0.1.0 (/home/bnaecker/omicron/bootstore)
   Compiling sled-hardware v0.1.0 (/home/bnaecker/omicron/sled-hardware)
   Compiling sled-storage v0.1.0 (/home/bnaecker/omicron/sled-storage)
   Compiling nexus-types v0.1.0 (/home/bnaecker/omicron/nexus/types)
   Compiling sled-agent-types v0.1.0 (/home/bnaecker/omicron/sled-agent/types)
   Compiling sled-agent-client v0.1.0 (/home/bnaecker/omicron/clients/sled-agent-client)
   Compiling sled-agent-zone-images-examples v0.1.0 (/home/bnaecker/omicron/sled-agent/zone-images-examples)
   Compiling nexus-config v0.1.0 (/home/bnaecker/omicron/nexus-config)
   Compiling nexus-client v0.1.0 (/home/bnaecker/omicron/clients/nexus-client)
   Compiling sp-sim v0.1.0 (/home/bnaecker/omicron/sp-sim)
   Compiling nexus-lockstep-client v0.1.0 (/home/bnaecker/omicron/clients/nexus-lockstep-client)
   Compiling nexus-inventory v0.1.0 (/home/bnaecker/omicron/nexus/inventory)
   Compiling nexus-db-model v0.1.0 (/home/bnaecker/omicron/nexus/db-model)
   Compiling omicron-test-utils v0.1.0 (/home/bnaecker/omicron/test-utils)
   Compiling oximeter-producer v0.1.0 (/home/bnaecker/omicron/oximeter/producer)
   Compiling omicron-gateway v0.1.0 (/home/bnaecker/omicron/gateway)
   Compiling gateway-test-utils v0.1.0 (/home/bnaecker/omicron/gateway-test-utils)
   Compiling nexus-db-fixed-data v0.1.0 (/home/bnaecker/omicron/nexus/db-fixed-data)
   Compiling nexus-auth v0.1.0 (/home/bnaecker/omicron/nexus/auth)
   Compiling nexus-db-errors v0.1.0 (/home/bnaecker/omicron/nexus/db-errors)
   Compiling nexus-db-lookup v0.1.0 (/home/bnaecker/omicron/nexus/db-lookup)
   Compiling nexus-db-queries v0.1.0 (/home/bnaecker/omicron/nexus/db-queries)
   Compiling nexus-saga-recovery v0.1.0 (/home/bnaecker/omicron/nexus/saga-recovery)
   Compiling nexus-reconfigurator-preparation v0.1.0 (/home/bnaecker/omicron/nexus/reconfigurator/preparation)
   Compiling omicron-omdb v0.1.0 (/home/bnaecker/omicron/dev-tools/omdb)
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 51.96s
     Running `target/debug/omdb nexus background-tasks show probe_distributor`
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:1122:3344:101::b]:12232
task: "probe_distributor"
  configured period: every 1m
  currently executing: no
  last completed activation: iter 24, triggered by a periodic timer firing
    started at 2025-11-06T17:52:35.426Z (39s ago) and ran for 291ms
    succesfully-pushed probes: 1 total
      sled_id=5824ea91-f422-4270-8182-13f90c448526 n_probes=1
    errors while pushing probes: 0 total

I did notice that the OPTE port itself is not really gone, but in the ready state. Looking at the sled-agent log, we see that the OpteHdl::release_xde() ioctl itself seems to have failed:

17:46:03.801Z INFO SledAgent: removing probe e6a47756-c568-4450-9960-13ee5a787ff5
    file = sled-agent/src/probe_manager.rs:426
    sled_id = 5824ea91-f422-4270-8182-13f90c448526
17:46:03.801Z DEBG SledAgent: stopped tracking zone datalinks
    sled_id = 5824ea91-f422-4270-8182-13f90c448526
    zone_name = oxz_probe_e6a47756-c568-4450-9960-13ee5a787ff5
17:46:03.801Z DEBG SledAgent: removing target
    id = 14698109219025132861
    sled_id = 5824ea91-f422-4270-8182-13f90c448526
17:46:03.801Z DEBG SledAgent (PortManager): Removed OPTE port from manager
    id = 68730add-982c-4a73-8652-f3ca719a07b5
    kind = Probe { id: e6a47756-c568-4450-9960-13ee5a787ff5 }
    port = Port { inner: PortInner(PortData { name: "opte6", ip: 172.30.0.5, mac: MacAddr6([168, 64, 37, 240, 0, 0]), slot: 0, vni: Vni { inner: 12526184 }, subnet: V4(Ipv4Net { addr: 172.30.0.0, width: 22 }), gateway: Gateway { mac: MacAddr6([168, 64, 37, 255, 119, 119]), ip: 172.30.0.1 } }) }
17:46:03.801Z DEBG SledAgent: removed VNIC from tracked links
    link_name = oxControlprobe0
    sled_id = 5824ea91-f422-4270-8182-13f90c448526
17:46:03.801Z DEBG SledAgent: removing target
    id = 15815384577377418323
    sled_id = 5824ea91-f422-4270-8182-13f90c448526
17:46:03.801Z DEBG SledAgent: removed VNIC from tracked links
    link_name = opte6
    sled_id = 5824ea91-f422-4270-8182-13f90c448526
WARNING: Failed to delete the xde device. It must be deleted
            out of band, and it will not be possible to recreate the xde device until then. Error: CommandError(DeleteXde, System { errno: 16, msg: "failed to destroy DLS devnet: 16" })

I'm not sure exactly why this happened. I haven't modified any of the code that actually deletes the OPTE devices, so presumably this has always been the case. That might be worth more investigation.

@bnaecker
Copy link
Collaborator Author

bnaecker commented Nov 6, 2025

Will need to rebase on #9358 to build the TUF repo.

@bnaecker
Copy link
Collaborator Author

bnaecker commented Nov 6, 2025

With regards to the OPTE deletion failure, we might want to move this over to the Destructor object @smklein added a while back. I don't know if we've really made much use of that, but could at least get us past a transient failure in this case.

- Add APIs in the sled-agent for creating / deleting probes, and have
  Nexus use them when managing probes from the external API, especially
  replacing the entire set of probes with a PUT.
- Rework the probe manager to accept the list of expected probes from
  Nexus, and drive the state toward that, rather than periodically
  pollling Nexus.
- Add background task for periodically pushing probes to sleds, and omdb
  innards for reporting its state.
- Closes #9157
@bnaecker bnaecker force-pushed the invert-probe-api-direction branch from 72fa123 to 7f2bcd8 Compare November 6, 2025 19:50
@bnaecker bnaecker marked this pull request as ready for review November 6, 2025 19:50
@bnaecker
Copy link
Collaborator Author

bnaecker commented Nov 7, 2025

I've updated the failing OMDB tests, but this will still fail test_apis_up_to_date(). I'd like someone to gut check that we can and should simply overwrite the blessed OpenAPI document to the one which removes these APIs. We could also keep them in the document, but have the handlers do nothing or fail with a 501 or similar.

@iliana
Copy link
Contributor

iliana commented Nov 7, 2025

I'd like someone to gut check that we can and should simply overwrite the blessed OpenAPI document to the one which removes these APIs. We could also keep them in the document, but have the handlers do nothing or fail with a 501 or similar.

The OpenAPI manager is written to prevent this exact case, heh :) They need to still be in the document but I think in this case it's fine for them to fail with 400 Bad Request or 410 Gone. You can change the documentation to reflect that the endpoints no longer do anything, I believe that is considered "compatible".

@bnaecker
Copy link
Collaborator Author

bnaecker commented Nov 7, 2025

Thanks @iliana! I've restored the APIs and have them fail in b43358e. This should be good for a review.

Copy link
Contributor

@rcgoodfellow rcgoodfellow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for taking this on @bnaecker.

I've run commtest a few times in an a4x2 setup and things check out

$ $ pfexec ./target/debug/commtest \
    --api-timeout 30m \
    http://198.51.100.23 run \
    --ip-pool-begin 198.51.100.40 \
    --ip-pool-end 198.51.100.70 \
    --icmp-loss-tolerance 500 \
    --test-duration 200s \
    --packet-rate 10
the api is up
logging in ... done
classone project already exists
default ip pool already exists
ip range already exists
getting sled ids ... done
checking if probe0 exists
probe0 already exists
checking if probe1 exists
probe1 already exists
checking if probe2 exists
probe2 already exists
checking if probe3 exists
probe3 does not exist, creating ... done
testing connectivity to probes
addr            low     avg     high    last    sent    received  lost
198.51.100.41   0.657   1.378   42.539  2.640   1998    1989      8
198.51.100.40   0.691   1.423   40.797  2.245   1998    1998      0
198.51.100.43   0.808   1.408   40.519  1.672   1998    1818      0
198.51.100.42   0.932   1.592   43.530  1.817   1998    1998      0
all connectivity tests within loss tolerance

Just a few minor comments on the code.

//
// # Panics
//
// This panics if the `pagparams` is not in ascending order. This method
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is no longer true as the assert is commented out below?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch thanks. That was from a WIP, I'll remove this and the dead-code you pointed out below.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 1bc7af6

.inner_join(
vpc::dsl::vpc.on(vpc::dsl::id.eq(vpc_subnet::dsl::vpc_id)),
)
//.filter(probe::dsl::id.gt(marker))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove commented out filter/order/limit?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 1bc7af6

@bnaecker
Copy link
Collaborator Author

bnaecker commented Nov 8, 2025

Thanks @rcgoodfellow for the comments and the extra testing! Very nice to have more confidence on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

invert sled-agent's probe manager

4 participants