-
Notifications
You must be signed in to change notification settings - Fork 62
add CPU platforms to instances #8728
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
RFD 505 proposes that instances should be able to set a "minimum hardware platform" or "minimum CPU platform" that allows users to constrain an instance to run on sleds that have a specific set of CPU features available. This allows a user to opt a VM into advanced hardware features (e.g. AVX-512 support) by constraining it to run only on sleds that support those features. For this to work, Nexus needs to understand what CPUs are present in which sleds. Have sled-agent query CPUID to get CPU vendor and family information and report this to Nexus as part of the sled hardware manifest.
fb1e0a7 to
620d7c9
Compare
the existing plumbing was sufficient for sled-agent to report the CPU family at startup, but did not provide the CPU family when Nexus calls later for inventory collections. when you've upgraded to this version, the database migration sets the sled CPU family to `unknown` expecting that the next inventory collection will figure things out. this doesn't happen, and the initial check-in doesn't update the CPU type either (presumably because the sled is already known and initialized from the control plane's perspective?) this does... most of the plumbing to report a sled's CPU family for inventory collection, but it doesn't actually work. `SledCpuFamily` being both in `omicron-common` and `nexus-client` is kind of unworkable. probably need a `ConvertInto` or something to transform the shared into the `nexus-client` when needed..? i've been trying to figure out what exactly is necessary and what is just building a mess for myself for two hours and this feels like it's going nowhere.
iximeow
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
finally got to taking a fine-tooth comb through the CPUID bits here and differences between hardware and what guests currently see. for the most part this is in line with what guests already get from byhve defaults but i've noticed a few typos that unsurprisingly do not pose an issue booting at least Alpine guests. i'll clean that up and update 314 appropriately tomorrow.
nexus/src/app/instance_platform.rs
Outdated
| // See [RFD 314](https://314.rfd.oxide.computer/) section 6 for all the | ||
| // gnarly details. | ||
| const MILAN_CPUID: [CpuidEntry; 32] = [ | ||
| cpuid_leaf!(0x0, 0x0000000D, 0x68747541, 0x444D4163, 0x69746E65), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a guest currently sees eax=0x10 here, where leaves 0xe, 0xf, and 0x10 are all zeroes. 0xe is reserved and zero on the host. 0xf and 0x10 are zeroed because of the default byhve masking behavior. setting eax=0xd is just more precise: leaves 0xf and 0x10 being "present" but zero does not communicate any feature presence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i know we reference RFD 314 but it kinda seems like it would be nice to have the comment explaining this here?###
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i went a bit hard on the comment and now there's like 600 more lines of more-a-code-than-a-comment (https://github.com/oxidecomputer/omicron/pull/8728/files#diff-0c754f739cd972f46b539d2a2e5a6220cd0b72e8c9fe7f20a2013fab3f28aa21R167)
nexus/src/app/instance_platform.rs
Outdated
| 0xB, 0x0, 0x00000001, 0x00000002, 0x00000100, 0x00000000 | ||
| ), | ||
| cpuid_subleaf!( | ||
| 0xB, 0x1, 0x00000000, 0x00000000, 0x00000201, 0x00000000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the same as RFD 314 says, but eax ought to be different in the RFD. Propolis fixes up B.1.EBX here to the number of vCPUs, but eax of 0 implies that the subleaf is actually invalid (from the APM: "If this function is executed with an unimplemented level (passed in ECX), the instruction returns all zeroes in the EAX register." .. also taken faithfully, this implies the VM's topology is vCPU-many sockets with SMT pairs across each socket pair. oops).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should probably update the RFD to reflect that? also lol Oxide is the ONLY platform that lets you make a 64-socket VM...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, it turns out the RFD is fine here, there's a later sentence that says "Index 1's eax and ebx depend on the VM shape". But Propolis is wrong here: oxidecomputer/propolis#939
So I'm going to hide this leaf in the proposed Milan v1 platform, fix the Propolis bug, and we can expose leaf Bh in a v2.
RFD 505 proposes that instances should be able to set a "minimum hardware platform" or "minimum CPU platform" that allows uers to constrain an instance to run on sleds that have a specific set of CPU features available. Previously, actually-available CPU information was plumbed from sleds to Nexus. This actually adds a `min_cpu_platform` setting for instance creation and uses it to drive selection of guest CPUID leaves. As-is, this moves VMs on Gimlets away from the byhve-default CPUID leaves (which are effectively "host CPUID information, but features that are not or cannot be virtualized are masked out"), instead using the specific CPUID information set out in RFD 505. There is no provision for Turin yet, which instead gets CPUID leaves that look like Milan. Adding a set of CPUID information to advertise for an `amd_turin` CPU platform, from here, is fairly straightforward. This does not have a mechanism to enforce specific CPU platform use or disuse, either in a silo or rack-wide. One could imagine a simple system oriented around "this silo is permitted to specify these minimum CPU platforms", but that leaves uncomfortable issues like: if a silo A permits only Milan, and silo B permits Milan and Turin, all Milan CPUs are allocated already, and someone is attemting to create a new Milan-based VM in silo A, should this succeed using Turin CPUs potentially starving silo B?
…platform as defined in Nexus
620d7c9 to
5cf7b9c
Compare
|
the new |
For a reason I can't figure out, when this was ``` AND sled.cpu_family IN (...) ``` binding a parameter made that query end up like ``` AND sled.cpu_family IN ($1,) ``` which then expects a single `sled_cpu_family` enum for later binding. Binding an array then yielded a type error like ``` DatabaseError(Unknown, "invalid cast: sled_cpu_family[] -> sled_cpu_family") ``` ... but writing it like `AND sled.cpu_family = ANY (...)` does not get the extra comma and instead renders like `ANY ($1)` which happily takes a `sled_cpu_family[]`.
hawkw
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to leave my approval on this now --- I like the use of raw-cpuid. I will freely admit, however, that my disposition towards reviewing all the CPUID bits that get constructed here is basically "take ixi's word for it". I think it could be worthwhile to get @pfmooney to take a look at this change as well?
| // TODO(gjc): eww. the correct way to do this is to write this as | ||
| // | ||
| // "AND sled.cpu_family = ANY (" | ||
| // | ||
| // and then just have one `param` which can be bound to a | ||
| // `sql_types::Array<SledCpuFamilyEnum>` | ||
| if let Some(families) = sled_families { | ||
| query.sql(" AND sled.cpu_family IN ("); | ||
| for i in 0..families.len() { | ||
| if i > 0 { | ||
| query.sql(", "); | ||
| } | ||
| query.param(); | ||
| } | ||
| query.sql(")"); | ||
| if sled_families.is_some() { | ||
| query.sql(" AND sled.cpu_family = ANY ("); | ||
| query.param(); | ||
| query.sql(") "); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you love to see it
| /// feature-compatibile ahead of time. Instead, this is to check details like | ||
| /// "`clflush` operates on the same number of words". | ||
| #[allow(dead_code)] | ||
| pub fn functionally_same(base: CpuIdDump, target: CpuIdDump) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Broadly: i wonder if we want logging or to return a Result or something here, so that we can explain why two CPUIDs are not "functionally_same". At least we could have it return a Result<(), &'static str> or something that says "clflush operates on different sized cache lines" or whatever
| /// The Platonic ideal Milan. This is what we would "like" to define as "The | ||
| /// Milan vCPU platform" absent any other constraints. This is a slightly | ||
| /// slimmer version of the Milan platform defined in RFD 314, with | ||
| /// justifications there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
World's Normalest Milan
| /// where possible. This CPUID configuration as-is is untested; guests may not | ||
| /// boot, this may be too reductive, etc. | ||
| fn milan_ideal() -> CpuIdDump { | ||
| let mut cpuid = CpuId::with_cpuid_reader(CpuIdDump::new()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because this whole function is just constructing a pile of bits, it really feels like it should be able to be done in const-eval, but it looks like the raw-cpuid API won't let you do that. ah well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, i think it'd be reasonable to flip this and have the const arrays be the source of truth as-is. but it feels like we're closer than not to needing to poke a bit here and there based on some VM configuration so that'd have limited utility.
on the upside, i think most of the bit twiddling here gets mushed together when optimized, so this is something like 20 allocs, ~800 stores, and maybe 30 map inserts? i haven't really looked, though.
| // Extended APIC space support was originally provided to guests because the | ||
| // host supports it and it was passed through. The extended space is not | ||
| // supported in Bhyve, but we leave it set here to not change it from under | ||
| // guests. | ||
| // | ||
| // Bhyve now supports all six performance counters, so we could set the perf | ||
| // counter extension bit here, but again it is left as-is to not change | ||
| // CPUID from under a guest. | ||
| // | ||
| // RDTSCP requires some Bhyve and Propolis work to support, so it is masked | ||
| // off for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extended APIC support? that's a quirk. performance counter extension? quirk. RDTSCP? believe it or not, straight to quirks.
| // Entry order does not actually matter. Sort here because it's fast (~30-35 | ||
| // leaves) and looking at the vec in logs or on the wire *so* much nicer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| let migrate_url = | ||
| format!("/instances/{}/migrate", &instance_id.to_string()); | ||
| let instance = NexusRequest::new( | ||
| RequestBuilder::new(internal_client, Method::POST, &migrate_url) | ||
| .body(Some(&InstanceMigrateRequest { | ||
| dst_sled_id: dst_sled_id.into_untyped_uuid(), | ||
| })) | ||
| .expect_status(Some(StatusCode::OK)), | ||
| ) | ||
| .authn_as(AuthnMode::PrivilegedUser) | ||
| .execute() | ||
| .await | ||
| .unwrap() | ||
| .parsed_body::<Instance>() | ||
| .unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
considered not important: there's enough of these now that i kinda wonder if we should make a test helper for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, we've got like half an informal API client in instances.rs too. i kinda wanna take a pass over the whole file and split it up a bit..
| // If the Turin-requiring instance isn't on the Turin sled, either the | ||
| // default simulated sled is now Turin-compatible or something is very | ||
| // wrong. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we perhaps construct a default simulated sled and assert it isn't Turin? so that we know if the test failed for that reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that'd involve either skipping nexus_test here (really don't want to do that for this test, it'll definitely end up with weird test environment skew that way) or plumbing parameters through nexus_test (which i'm lukewarm on if we'll do it in exactly one test.. i'm not sure if its useful anywhere else). so that's how i ended up at the assert to point future-us the right way :(
FFXSR is used by Windows Server 2022 if available, which causes it to die very early in boot - the feature is not supported by illumos bhyve and we return a GP if the EFER bit is set. WBNOINVD is also masked by default bhyve CPUID masking, though this one is probably fine to pass through. Either way, because it is typically masked, guests using this are basically untested (though as some datapoints, Windows/Linux/illumos seem fine on bhyve with this bit passed through..)
|
I should mention here: I've taken the
and they've all at least come up and tolerated me puttering around in a shell. |
when run on its own (at least on my workstation), the test instance is always placed on the first test sled with a known CPU type. then, migration fails because there is no available sled with a compatible CPU, thus 507. when run with other tests (such as... in CI), the test instance may be placed on the unknown-CPU test sled, in which case it is unmigratable regardless of capacity by virtue of its VMM having an unknown CPU profile.
this materializes RFD 314 and in some respects, 505. builds on #8725 for CPU family information, which is a stand-in for the notion of sled families and generations described in RFD 314. There are a few important details here where CPU platforms differ from the sled CPU family and I've refreshed RFD 314 (and 505) as appropriate. We can (and do, in RFD 314) define Milan restrictively enough that we can present Turin (and probably later!) CPUs to guests "as if" they were Milan. Similarly I'd expect that Turin would be defined as roughly "Milan-plus-some-AVX-512-features" and pretty forward-compatible. Importantly these are related to but not directly representative of real CPUs; as an example I'd expect "Turin"-the-instance-CPU-platform to be able to run on a Turin Dense CPU. Conversely, there's probably not a reason _to_ define a "Turin Dense" CPU platform since from a guest perspective they'd look about the same. But at the same time the lineage through the AMD server part family splits at Zen 4 kind of, with Zen 4 vs Zen 4c-based parts and similar with Zen 5/c. It's somewhat hard (I think) to predict what workloads would be sensitive to this. And as #8730 gets into a bit, the details of a processor's packaging (core topology, frequency, cache size) can vary substantially even inside one CPU family. The important part here is that we do not expect CPU platforms to cover these details and it would probably be cumbersome to try; if the instance's constraint is "I want AVX256, and I want to be on high-frequency-capable processors only", then it doesn't actually matter if it's run on a Turin or a Milan and to tie it to that CPU platform may be overly restrictive. On instance CPU platforms, the hope is that by focusing on CPU features we're able to present a more linear path as the microarchitectures grow. I've walked back the initial description of an instance's CPU platform as the "minimum CPU platform". As present in other systems, "minimum CPU platform" would more analogously mean "can we put you on a Rome Gimlet or must we put you on a Milan Gimlet?", or "Genoa Cosmo vs Turin Cosmo?" - it doesn't seem _possible_ to say "this instance must have AVX 512, but otherwise I don't care what kind of hardware it runs on.", but that's more what _we mean_ by CPU platform. In a "minimum CPU platform" interpretation, we _could_ provide a bunch of Turin CPUID bits to a VM that said it wanted Milan. But since there's no upper bound here, if an OS has an issue with a future "Zen 14" or whatever, a user would discover that by their "minimum-Milan" instance getting scheduled on the new space-age processor and exploding on boot or something. OSes _shouldn't_ do that, but... Implementation-wise, this is really just about the names right now. You always get Milan CPUID leaves for the time being. When there are Turin CPUID leaves defined for the instance CPU platform, and Cosmos on which they make sense, this becomes more concrete. RFD 314 has a section now, and I've added a stub function, covering some more obvious ways that CPU platforms would be *incompatible*. This is particularly fraught if we consider being incorrect about topology an incompatibility, but even setting that aside several bits in CPUID are descriptive of architectural behaviors and are not easily (or at all) able to be emulated. `functionally_same()` and the CPUID profiles here may be fated to move out of Omicron and into another crate which can be shared with Propolis, where it can ensure that a requested profile is consistent with the hardware on which Propolis would create a VM (not to mention test uses) --------- Co-authored-by: Greg Colombo <greg@oxidecomputer.com>
this materializes RFD 314 and in some respects, 505.
builds on #8725 for CPU family information, which is a stand-in for the notion of sled families and generations described in RFD 314. There are a few important details here where CPU platforms differ from the sled CPU family and I've differed from 314/505 (and need to update the RFDs to match). I'd not noticed the sheer volume of comments on https://github.com/oxidecomputer/rfd/pull/515 so I'm taking a pass through those and the exact bits in
MILAN_CPUIDmay be further tweaked. I suspect the fixed array needs at least a few more tweaks anyway, cross-referencing RFD 314 turns out to make for awkward review and it's hard to eyeball the semantic of bits here (or which are to be filled in by some later component of the stack!)As-is: I think would be OK to merge but is not quite as polished as I'd like it to be, so it's a real PR but I expect further changes.
hardware CPU families are less linear than Oxide CPU platforms.
We can (and do, in RFD 314) define Milan restrictively enough that we can present Turin (and probably later!) CPUs to guests "as if" they were Milan. Similarly I'd expect that Turin would be defined as roughly "Milan-plus-some-AVX-512-features" and pretty forward-compatible. Importantly these are related to but not directly representative of real CPUs; as an example I'd expect "Turin"-the-instance-CPU-platform to be able to run on a Turin Dense CPU. Conversely, there's probably not a reason to define a "Turin Dense" CPU platform since from a guest perspective they'd look about the same.
But at the same time the lineage through the AMD server part family splits at Zen 4 kind of, with Zen 4 vs Zen 4c-based parts and similar with Zen 5/c. It's somewhat hard (I think) to predict what workloads would be sensitive to this. And as #8730 gets into a bit, the details of a processor's packaging (core topology, frequency, cache size) can vary substantially even inside one CPU family. The important detail here is that we do not expect CPU platforms to cover these details and it would probably be cumbersome to try; if the instance's constraint is "I want AVX256, and I want to be on high-frequency-capable processors only", then it doesn't actually matter if it's run on a Turin or a Milan and to tie it to that CPU platform may be overly restrictive.
On instance CPU platforms, the hope is that by focusing on CPU features we're able to present a more linear path as the microarchitecture grow.
instance platforms aren't "minimum"
I've walked back the initial description of an instance's CPU platform as the "minimum CPU platform". As present in other systems, "minimum CPU platform" would more analogously mean "can we put you on a Rome Gimlet or must we put you on a Milan Gimlet?", or "Genoa Cosmo vs Turin Cosmo?" - it doesn't seem possible to say "this instance must have AVX 512, but otherwise I don't care what kind of hardware it runs on.", but that's more what we mean by CPU platform.
In a "minimum CPU platform" interpretation, we could provide a bunch of Turin CPUID bits to a VM that said it wanted Milan. But since there's no upper bound here, if an OS has an issue with a future "Zen 14" or whatever, a user would discover that by their "minimum-Milan" instance getting scheduled on the new space-age processor and exploding on boot or something. OSes shouldn't do that, but...
Implementation-wise, this is really just about the names right now. You always get Milan CPUID leaves for the time being. When there are Turin CPUID leaves defined for the instance CPU platform, and Cosmos on which they make sense, this becomes more concrete.
"are these CPU platforms compatible?"
RFD 314 deserves a section outlining how we determine a new CPU can be a stand-in for an older platform, talking about this and other leaves (particularly if we pass topology information through, elsewhere). I've added a few words about the problem of a CPUID profile being reasonable to present on some hardware's different CPUID profile, but I've also outlined what I've spotted as potential incompatibilities in the currently-dead
functionally_same(). This is one of those things that will have the shape of "very boring and straightforward for eight years, and then a bit will change (or be defined) that's kind of annoying".