-
Notifications
You must be signed in to change notification settings - Fork 61
Push probe zones to sled-agent instead of pulling them from Nexus #9353
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
bnaecker
commented
Nov 6, 2025
- Add APIs in the sled-agent for creating / deleting probes, and have Nexus use them when managing probes from the external API, especially replacing the entire set of probes with a PUT.
- Rework the probe manager to accept the list of expected probes from Nexus, and drive the state toward that, rather than periodically pollling Nexus.
- Add background task for periodically pushing probes to sleds, and omdb innards for reporting its state.
- Closes invert sled-agent's probe manager #9157
08bcc6a to
72fa123
Compare
|
I did some manual testing on my local developer machine. I set up the full Omicron environment (virtual hardware, Nexus asked the sled-agent to create the probes, and it launched the zones with OPTE ports for them: We can also see the new background task in Nexus periodically pushing the full set of probes: And when a probe is deleted, the sled-agent still tears down the zone and the Nexus background task now propagates just the one remaining probe: I did notice that the OPTE port itself is not really gone, but in the I'm not sure exactly why this happened. I haven't modified any of the code that actually deletes the OPTE devices, so presumably this has always been the case. That might be worth more investigation. |
|
Will need to rebase on #9358 to build the TUF repo. |
|
With regards to the OPTE deletion failure, we might want to move this over to the |
- Add APIs in the sled-agent for creating / deleting probes, and have Nexus use them when managing probes from the external API, especially replacing the entire set of probes with a PUT. - Rework the probe manager to accept the list of expected probes from Nexus, and drive the state toward that, rather than periodically pollling Nexus. - Add background task for periodically pushing probes to sleds, and omdb innards for reporting its state. - Closes #9157
72fa123 to
7f2bcd8
Compare
|
I've updated the failing OMDB tests, but this will still fail |
The OpenAPI manager is written to prevent this exact case, heh :) They need to still be in the document but I think in this case it's fine for them to fail with 400 Bad Request or 410 Gone. You can change the documentation to reflect that the endpoints no longer do anything, I believe that is considered "compatible". |
rcgoodfellow
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for taking this on @bnaecker.
I've run commtest a few times in an a4x2 setup and things check out
$ $ pfexec ./target/debug/commtest \
--api-timeout 30m \
http://198.51.100.23 run \
--ip-pool-begin 198.51.100.40 \
--ip-pool-end 198.51.100.70 \
--icmp-loss-tolerance 500 \
--test-duration 200s \
--packet-rate 10
the api is up
logging in ... done
classone project already exists
default ip pool already exists
ip range already exists
getting sled ids ... done
checking if probe0 exists
probe0 already exists
checking if probe1 exists
probe1 already exists
checking if probe2 exists
probe2 already exists
checking if probe3 exists
probe3 does not exist, creating ... done
testing connectivity to probes
addr low avg high last sent received lost
198.51.100.41 0.657 1.378 42.539 2.640 1998 1989 8
198.51.100.40 0.691 1.423 40.797 2.245 1998 1998 0
198.51.100.43 0.808 1.408 40.519 1.672 1998 1818 0
198.51.100.42 0.932 1.592 43.530 1.817 1998 1998 0
all connectivity tests within loss tolerance
Just a few minor comments on the code.
| // | ||
| // # Panics | ||
| // | ||
| // This panics if the `pagparams` is not in ascending order. This method |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is no longer true as the assert is commented out below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch thanks. That was from a WIP, I'll remove this and the dead-code you pointed out below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 1bc7af6
| .inner_join( | ||
| vpc::dsl::vpc.on(vpc::dsl::id.eq(vpc_subnet::dsl::vpc_id)), | ||
| ) | ||
| //.filter(probe::dsl::id.gt(marker)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove commented out filter/order/limit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 1bc7af6
|
Thanks @rcgoodfellow for the comments and the extra testing! Very nice to have more confidence on it. |