[DRAFT] Allocate static Ipv6 addresses for propolis zones #801

jmpesp · 2022-03-23T12:49:48Z

Add ability for Nexus to hand static Ipv6 addresses for use by propolis
zones.

TODO still draft

jmpesp · 2022-03-23T13:16:41Z

There's a few XXX to deal with before merging, would like some suggestions on them please :)

common/src/sql/dbinit.sql

bnaecker · 2022-03-23T16:04:26Z

nexus/examples/config.toml

@@ -5,6 +5,12 @@
 # Identifier for this instance of Nexus
 id = "e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c"

+[static_v6_allocation_config]
+# Subnet to allocate static v6 IPs from
+subnet = "fd00:1234::/96"


I have two questions about this subnet:

What differentiates this from the subnet on the underlay assigned to the sled during rack setup?

It seems like this should be a /64, as with other sled-specific subnetworks, but I'm not 100% clear on the context.

nexus/src/db/datastore.rs

smklein

heh, lots of ben's comments mirror my own. Left 'em in for posterity though.

smklein · 2022-03-23T15:15:53Z

nexus/examples/config.toml

@@ -5,6 +5,12 @@
 # Identifier for this instance of Nexus
 id = "e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c"

+[static_v6_allocation_config]
+# Subnet to allocate static v6 IPs from
+subnet = "fd00:1234::/96"


If I'm reading https://rfd.shared.oxide.computer/rfd/0063#_ipv6_addressing correctly:

An availability-zone is represented by an IPv6 /48.
A rack and its resources is represented by an IPv6 /56.
A server and its resources are represented by an IPv6 /64.

It's unclear to me if this should currently change the allocation scheme - we aren't using the ULAs yet, really (though maybe we should) but it makes sense to consider a subsection of the /56 "rack-wide" physical address space for this purpose.

smklein · 2022-03-23T15:32:02Z

nexus/src/db/datastore.rs

+
+        self.pool()
+            .transaction(move |conn| {
+                // First, search for released addresses


I added @bnaecker to this review because this definitely looks like it overlaps with

omicron/nexus/src/db/subnet_allocation.rs

Line 721 in b259f45

fn push_select_next_available_ip_subquery(

somewhat.

smklein · 2022-03-23T15:51:34Z

nexus/src/internal_api/http_entrypoints.rs

@@ -277,3 +280,38 @@ async fn cpapi_artifact_download(

    Ok(Response::builder().status(StatusCode::OK).body(body.into())?)
 }
+
+/// Endpoint used by Sled Agents to allocate static V6 addresses


I think we may simplify the allocation space by allocating within the context of a sled.

This API implies that we're allocating from an IP space that's basically rack-wide - but if we're allocating within an individual sled's /64 space (which I think we planned to do anyway, according to RFD 63), we'll have less risk of collision, which should make the allocation problem a little easier?

Basically: WDYT about providing an identifier for the Sled as an input parameter here, and to the allocator? Then Nexus can basically make a "per-sled" subnet from which to allocate

That's a nice idea, and I think fits better with the rack setup service's behavior too. The operator will choose an AZ and rack number, and then each sled will have a further ID within those. That'll all combine to make an fd00::/64 for each sled, so a per-sled allocation here seems good.

It also suggests that the database table also might need the sled UUID in it, I believe.

Yeah, that tracks. Being able to predictably access all IP addresses allocated to a sled would also be important when we, for example, decommission a sled (or cope with it failing).

nexus/src/db/schema.rs

nexus/src/db/datastore.rs

smklein · 2022-03-23T16:14:24Z

nexus/src/db/datastore.rs

+                        }
+
+                        Err(_) => {
+                            // XXX should match on primary key conflict


@bnaecker 's subnet allocation has an example of this:

omicron/nexus/src/db/subnet_allocation.rs

Lines 568 to 573 in b259f45

// Primary key constraint violation. See notes above.

Some(constraint) if constraint == PRIMARY_KEY_CONSTRAINT => {

NetworkInterfaceError::DuplicatePrimaryKey(

interface.identity.id,

)

}

@jmpesp If you find yourself checking specific database error conditions or messages, let me know. I've been prototyping some tooling to make that kind of thing easier. That hasn't been too valuable yet, with only 2 use sites, but if we add more...

nexus/src/db/datastore.rs

sled-agent/src/instance.rs

bnaecker · 2022-03-23T16:27:16Z

nexus/src/db/datastore.rs

+        self.pool()
+            .transaction(move |conn| {


I'm not sure how important this is, but it'll definitely be feasible to allocate addresses in a single query without a transcation. The push_next_available_ip_subquery function is probably a good place to start. One could a left outer join with the table itself, joining on the address being non-NULL, and taking only the first.

For example, this query should work OK:

select 'fd00::'::INET + off as addr from generate_series(0, 10) as off left outer join static_v6_address on (address = 'fd00::'::INET + off) WHERE address is NULL LIMIT 1;

That's basically the same thing we have in the linked function, so it might be nice to factor it if possible. That takes the first address where there is no corresponding row already in the static_v6_address table. It's effectively the query you have here in the Rust transaction, just in one query. It's still linear in the number of allocated addresses, but requires no transaction or multiple attempts. Hope that helps!

Building on top of this: I wonder how long we could survive with a (simple) linear allocator by just starting at the "last allocated address", rather than "always start from the beginning". This would basically let us comb around the full address space, rather than repeatedly checking the first few addresses.

(Could definitely be a punted-until-later optimization, but seems like it would help both our address allocators)

Yeah, that's true. Part of the reason we're using IPv6 is that each sled has an entire /64 to work with. That's...a lot of addresses, so just picking the "next" rather than the lowest unallocated seems OK in the short term. Neither is great in the long term, but the "next after the last allocated address" approach is definitely simpler. The "first fit" one in the linked function makes more sense there, because we're talking about potentially tiny subnets in a customer's VPC Subnet, which can be as small as an IPv4 /26 I believe. That's only 16 addresses, 6 of which we reserve for ourselves!

nexus/src/internal_api/http_entrypoints.rs

sled-agent/src/instance.rs

bnaecker

This is a great start!

Besides my comments in the source, I'm not sure I understand the whole picture here. Are these just for propolis zones? Or for any Oxide service that the sled-agent starts and manages? The other main question is about the actual subnet these come from. IIUC, these should be on the underlay network, which means they must be allocated out of the /64 that the sled gets during rack setup. Not that we know what that subnet is, mind you, just that we know the size is /64/, not /96.

smklein · 2022-03-23T17:30:41Z

Besides my comments in the source, I'm not sure I understand the whole picture here. Are these just for propolis zones? Or for any Oxide service that the sled-agent starts and manages?

@jmpesp , please correct me if I'm wrong, but I believe this is attempting to address this chunk of code in instance creation:

omicron/sled-agent/src/instance.rs

Lines 475 to 477 in 015158d

    
           let running_zone = RunningZone::boot(installed_zone).await?; 
        
           let network = running_zone.ensure_address(AddressRequest::Dhcp).await?; 
        
           info!(inner.log, "Created address {} for zone: {}", network, zname);

Currently, when we allocate Propolis instances, we are giving them IPv4 addresses assigned via DHCP.

I think we want the underlay to actually be on IPv6, and chatting with James, I think there have been some difficulties faced by Propolis just trying to access other IPv6 addresses if it doesn't have one of it's own. I thiiiink this particularly popped up when trying to access Crucible services via IPv6 addresses - our workaround had been "allocate an arbitrary IPv6 address to Propolis, so it can speak to other IPv6 addresses", but that doesn't really scale.

smklein · 2022-03-23T17:39:09Z

I'm brainstorming a bit, but:

The proposal for the "bootstrap network" in RFD 63 suggests using a sequence of unique bits (i.e., a Sled's MAC address) prepended by a well-known prefix to form a unique IPv6 address.

What do you think about doing something similar here?

If we start with:

A server and its resources are represented by an IPv6 /64...
and we know that MAC addresses are 48 bits long...
and we know that each Propolis Zone gets a dedicated VNIC for control-plane communication...

Couldn't we give static addresses to Propolis like the following:

128 bit IPv6 address = 
[ 64 bits of Sled-Specific Underlay Prefix ] [ 48 bits of Propolis Zone VNIC MAC address ] [ 12 bits of whatever we want ]

This has some advantages:

The address remains unique within the physical network, because of the prefix
The address remains unique on the Sled, because of the VNIC's MAC address
We don't need to allocate anything / track any of this state in Nexus

rmustacc · 2022-03-23T17:43:33Z

So IPv6 already has a well formed way of transforming a MAC address into an EUI64. So what I'd recommend is that in general we don't try to allocate specific ordering of addresses, that we always go and conjoin things based on the MAC address. This means that the lower 64-bits will always be unique and will always be appended to. Now, that still comes down to how that MAC address for an underlying propolis instance (as opposed to the guest-visible things are being done), but that just makes it an AZ-wide random value that we want to be unique.

Add ability for Nexus to hand static Ipv6 addresses for use by propolis zones. Allocate the address as part of the instance create saga, and the instance migrate saga. Send to sled-agent as part of the instance ensure request so it can attach to the control vnic. TODO left in internal api endpoints, are they required anymore? TODO instance destroy should free IP? or through sled agent? TODO associate IP with instance?

jmpesp · 2022-03-24T21:30:25Z

Took another swing at this today:

addresses are allocated as part of instance create and instance migrate sagas
they're sent to the sled agent as part of the instance ensure request

lots of XXX aka things for me to resolve still.

jmpesp · 2022-03-24T21:36:44Z

Thoughts for tomorrow: I'm pretty sure control_ip is going to have to be a column in the instance, but I was hoping that the static_v6_address table would be a single location for allocation to avoid having to check many places.

jmpesp · 2022-03-25T02:02:13Z

Thoughts for tomorrow: I'm pretty sure control_ip is going to have to be a column in the instance, but I was hoping that the static_v6_address table would be a single location for allocation to avoid having to check many places.

This should have been titled "thoughts for hack night" :)

8aa5773 adds an associated UUID to the row, meaning callers that allocate a static IP have to now supply their UUID. It's difficult to go from this column to what actually allocated the address (read: one has to search every table for the ID), but this means that multiple types of resources can allocate IPs for themselves and conflicts are localized to one table.

jmpesp · 2022-04-08T20:10:44Z

Obviated by #891, closing

Bump crucible rev to pick up: - fix build break caused by merging #801 - Issue extent flushes in parallel - Add Logger Option to Volume construct method - Update Rust crate libc to 0.2.147 - Update Rust crate percent-encoding to 2.3 - Retry jobs until they succeed - Reorder select arms so pings can't be starved out - Treat a skipped IO like an error IO for ACK results. - Retry pantry requests - Remove panics and asserts in dummy tests - Update Rust crate csv to 1.2.2 - Update Rust crate reedline to 0.21.0 - Set open file resource limit to the max - Update Rust crate ringbuffer to 0.14 - DTrace meet cmon - Widen assert values to u128 to deal with u64::MAX - Change size_to_validate from usize to u64 - Turn on live-repair test in CI - Increase flush_timeout for some tests, collect cores - Update to latest dropshot Bump propolis rev to pick up: - Bump crucible rev to latest - Make the propolis zone self-assembling - Flesh out more PIIX3-PM to suppress log gripes - Bump crucible rev to latest - Restructure PM-timer under PIIX3 device - Fix inventory handling for nested child entities - Centralize vmm-data interface into bhyve_api - Clean up PCI device classes - Update openssl dep to 0.10.55 - Allow propolis-standalone to use VMM reservoir - only allow one request to reboot to be enqueued at a time Note that #3456 bumped package-manifest, but this commit bumps both that and Cargo.toml. All Propolis commits between 04a27573 and 25111a88 are listed above.

Bump crucible rev to pick up: - Self assembling zones must set their own MTU - fix build break caused by merging #801 - Issue extent flushes in parallel - Add Logger Option to Volume construct method - Update Rust crate libc to 0.2.147 - Update Rust crate percent-encoding to 2.3 - Retry jobs until they succeed - Reorder select arms so pings can't be starved out - Treat a skipped IO like an error IO for ACK results. - Retry pantry requests - Remove panics and asserts in dummy tests - Update Rust crate csv to 1.2.2 - Update Rust crate reedline to 0.21.0 - Set open file resource limit to the max - Update Rust crate ringbuffer to 0.14 - DTrace meet cmon - Widen assert values to u128 to deal with u64::MAX - Change size_to_validate from usize to u64 - Turn on live-repair test in CI - Increase flush_timeout for some tests, collect cores - Update to latest dropshot Bump propolis rev to pick up: - Self assembling zones must set their own MTU - Bump crucible rev to latest - Make the propolis zone self-assembling - Flesh out more PIIX3-PM to suppress log gripes - Bump crucible rev to latest - Restructure PM-timer under PIIX3 device - Fix inventory handling for nested child entities - Centralize vmm-data interface into bhyve_api - Clean up PCI device classes - Update openssl dep to 0.10.55 - Allow propolis-standalone to use VMM reservoir - only allow one request to reboot to be enqueued at a time Nexus is currently setting the MTU inside self-assembling zones, but this goes against the idea that self-assembling zones perform their own configuration. This will change shortly, so set the MTU of $DATALINK in the Clickhouse and CockroachDB method scripts to set the MTU of $DATALINK. Also remove the code in `RunningZone::boot` that performs commands on self-assembling zones.

Bump crucible rev to pick up: - Self assembling zones must set their own MTU - fix build break caused by merging oxidecomputer#801 - Issue extent flushes in parallel - Add Logger Option to Volume construct method - Update Rust crate libc to 0.2.147 - Update Rust crate percent-encoding to 2.3 - Retry jobs until they succeed - Reorder select arms so pings can't be starved out - Treat a skipped IO like an error IO for ACK results. - Retry pantry requests - Remove panics and asserts in dummy tests - Update Rust crate csv to 1.2.2 - Update Rust crate reedline to 0.21.0 - Set open file resource limit to the max - Update Rust crate ringbuffer to 0.14 - DTrace meet cmon - Widen assert values to u128 to deal with u64::MAX - Change size_to_validate from usize to u64 - Turn on live-repair test in CI - Increase flush_timeout for some tests, collect cores - Update to latest dropshot Bump propolis rev to pick up: - Self assembling zones must set their own MTU - Bump crucible rev to latest - Make the propolis zone self-assembling - Flesh out more PIIX3-PM to suppress log gripes - Bump crucible rev to latest - Restructure PM-timer under PIIX3 device - Fix inventory handling for nested child entities - Centralize vmm-data interface into bhyve_api - Clean up PCI device classes - Update openssl dep to 0.10.55 - Allow propolis-standalone to use VMM reservoir - only allow one request to reboot to be enqueued at a time Nexus is currently setting the MTU inside self-assembling zones, but this goes against the idea that self-assembling zones perform their own configuration. Remove the code in `RunningZone::boot` that performs commands on self-assembling zones, and set the MTU of $DATALINK in the Clickhouse and CockroachDB method scripts.

Bump crucible rev to pick up: - Self assembling zones must set their own MTU - fix build break caused by merging oxidecomputer#801 - Issue extent flushes in parallel - Add Logger Option to Volume construct method - Update Rust crate libc to 0.2.147 - Update Rust crate percent-encoding to 2.3 - Retry jobs until they succeed - Reorder select arms so pings can't be starved out - Treat a skipped IO like an error IO for ACK results. - Retry pantry requests - Remove panics and asserts in dummy tests - Update Rust crate csv to 1.2.2 - Update Rust crate reedline to 0.21.0 - Set open file resource limit to the max - Update Rust crate ringbuffer to 0.14 - DTrace meet cmon - Widen assert values to u128 to deal with u64::MAX - Change size_to_validate from usize to u64 - Turn on live-repair test in CI - Increase flush_timeout for some tests, collect cores - Update to latest dropshot Bump propolis rev to pick up: - Self assembling zones must set their own MTU - Bump crucible rev to latest - Make the propolis zone self-assembling - Flesh out more PIIX3-PM to suppress log gripes - Bump crucible rev to latest - Restructure PM-timer under PIIX3 device - Fix inventory handling for nested child entities - Centralize vmm-data interface into bhyve_api - Clean up PCI device classes - Update openssl dep to 0.10.55 - Allow propolis-standalone to use VMM reservoir - only allow one request to reboot to be enqueued at a time Nexus is currently setting the MTU inside self-assembling zones, but this goes against the idea that self-assembling zones perform their own configuration. Remove the code in `RunningZone::boot` that performs commands on self-assembling zones, and set the MTU of $DATALINK in the Clickhouse and CockroachDB method scripts. Fixes oxidecomputer#3512

Crucible changes Add test and fix for replay race condition (#1519) Fix clippy warnings (#1517) Add edition to `crucible-workspace-hack` (#1516) Split out Downstairs-specific stats (#1511) Move remaining `GuestWork` functionality into `Downstairs` (#1510) Track both jobs and bytes in each IO state (#1507) Fix new `rustc` and `clippy` warnings (#1509) Remove IOP/BW limits (for now) (#1506) Move `GuestBlockRes` into `DownstairsIO` (#1502) Update actions/checkout digest to eef6144 (#1499) Update Rust crate hyper-staticfile to 0.10.1 (#1411) Turn off test-up-2region-encrypted.sh (#1504) Add `IOop::Barrier` (#1494) Fix IPv6 addresses in `crutest` (#1503) Add region set options to more tests. (#1496) Simplify `CompleteJobs` (#1493) Removed ignored CI jobs (#1497) Minor cleanups to `print_last_completed` (#1501) Remove remaining `Arc<Volume>` instances (#1500) Add `VolumeBuilder` type (#1492) remove old unused scripts (#1495) More multiple region support. (#1484) Simplify matches (#1490) Move complete job tracker to a helper object (#1489) Expand summary and add documentation references to the README. (#1486) Remove `GuestWorkId` (2/2) (#1482) Remove `JobId` from `DownstairsIO` (1/2) (#1481) Remove unused `#[derive(..)]` (#1483) Update more tests to use dsc (#1480) Crutest now Volume only (#1479) Propolis changes manually impl Deserialize for PciPath for validation purposes (#801) phd: gate OS-specific tests, make others more OS-agnostic (#799) lib: log vCPU diagnostics on triple fault and for some unhandled exit types (#795) add marker trait to help check safety of guest memory reads (#794) clippy fixes for 1.82 (#796) lib: move cpuid::Set to cpuid_utils; prevent semantic subleaf conflicts (#782) PHD: write efivars in one go (#786) PHD: support guest-initiated reboot (#785) server: accept CPUID values in instance specs and plumb them to bhyve (#780) PHD: allow patched Crucible dependencies (#778) server: add a first-class error type to machine init (#777) PciPath to Bdf conversion is infallible; prove it and refactor (#774) instance spec rework: flatten InstanceSpecV0 (#767) Make PUT /instance/state 503 when waiting to init Less anxiety-inducing `Vm::{get, state_watcher}`

Crucible changes Add test and fix for replay race condition (#1519) Fix clippy warnings (#1517) Add edition to `crucible-workspace-hack` (#1516) Split out Downstairs-specific stats (#1511) Move remaining `GuestWork` functionality into `Downstairs` (#1510) Track both jobs and bytes in each IO state (#1507) Fix new `rustc` and `clippy` warnings (#1509) Remove IOP/BW limits (for now) (#1506) Move `GuestBlockRes` into `DownstairsIO` (#1502) Update actions/checkout digest to eef6144 (#1499) Update Rust crate hyper-staticfile to 0.10.1 (#1411) Turn off test-up-2region-encrypted.sh (#1504) Add `IOop::Barrier` (#1494) Fix IPv6 addresses in `crutest` (#1503) Add region set options to more tests. (#1496) Simplify `CompleteJobs` (#1493) Removed ignored CI jobs (#1497) Minor cleanups to `print_last_completed` (#1501) Remove remaining `Arc<Volume>` instances (#1500) Add `VolumeBuilder` type (#1492) remove old unused scripts (#1495) More multiple region support. (#1484) Simplify matches (#1490) Move complete job tracker to a helper object (#1489) Expand summary and add documentation references to the README. (#1486) Remove `GuestWorkId` (2/2) (#1482) Remove `JobId` from `DownstairsIO` (1/2) (#1481) Remove unused `#[derive(..)]` (#1483) Update more tests to use dsc (#1480) Crutest now Volume only (#1479) Propolis changes manually impl Deserialize for PciPath for validation purposes (#801) phd: gate OS-specific tests, make others more OS-agnostic (#799) lib: log vCPU diagnostics on triple fault and for some unhandled exit types (#795) add marker trait to help check safety of guest memory reads (#794) clippy fixes for 1.82 (#796) lib: move cpuid::Set to cpuid_utils; prevent semantic subleaf conflicts (#782) PHD: write efivars in one go (#786) PHD: support guest-initiated reboot (#785) server: accept CPUID values in instance specs and plumb them to bhyve (#780) PHD: allow patched Crucible dependencies (#778) server: add a first-class error type to machine init (#777) PciPath to Bdf conversion is infallible; prove it and refactor (#774) instance spec rework: flatten InstanceSpecV0 (#767) Make PUT /instance/state 503 when waiting to init Less anxiety-inducing `Vm::{get, state_watcher}` --------- Co-authored-by: Alan Hanson <alan@oxide.computer>

jmpesp requested a review from smklein March 23, 2022 12:49

smklein requested a review from bnaecker March 23, 2022 15:16