Handle virtual instruction traps for hypervisor CSRs during KVM initialization #102

abrestic-rivos · 2022-09-20T03:47:46Z

KVM will expect to be able to access a few of the hypervisor CSRs (hideleg, hgatp, etc.) at module load, i.e. not part of the run loop (which will use shared-memory for TVM CSR access). Salus needs to handle the resulting virtual instruction traps for these without blowing up the host VM. Most of these accesses we can ignore since we're not doing nested virtualization of legacy VMs and a TSM (regardless of deployment model) will overwrite these CSRs anyway. hgatp is the main exception as we need to allow KVM to discover the MMU mode and VMID bits.

cc @atishp04

The text was updated successfully, but these errors were encountered:

dgreid · 2022-09-20T15:24:00Z

Can we tell KVM not to read them instead? providing MMU mode and VMID bits available in a "get_info" call would be more consistent with how other TSM-related things happen.

abrestic-rivos · 2022-09-20T16:15:42Z

Can we tell KVM not to read them instead? providing MMU mode and VMID bits available in a "get_info" call would be more consistent with how other TSM-related things happen.

I would prefer this too, but if we decide that KVM should be able to support multiple "modes" at runtime (e.g. in a TSM-in-HS deployment model one would imagine that KVM could support legacy VMs with native virtualization plus TVMs using the AP-TEE interface) you'd have separate initialization paths where either or both would need to run and discover the capabilities of that particular virtualization mode. That said, this is what pKVM appears to be doing and we'll have to bifurcate most of the other KVM code paths for TEE vs non-TEE VMs anyway. Also, IMO, for PoC purposes it doesn't seem horrible to force KVM into a particular mode at startup.

@atishp04 thoughts?

atishp04 · 2022-09-20T17:49:23Z

Yes. That was my concern having multiple diverged paths. We have opportunity to align the same path for TEE and nested.
I was thinking we could just add all the H-mode CSRs in the shared memory instead of trap n emulate.
TSM can just choose to ignore the updated values for TVM. The host hypervisor in nested virtualization will use the shared memory to update the actual CSRs.

Here are the different path this can be invoked.

Legacy VM (native)
Legacy VM (nested)
TVM while host is in VS mode
TVM while host is in HS mode

It would be good to minimize the divergence between 2, 3 & 4. Except #1, everything else can just rely on the shared memory approach if it is available. Otherwise, we create 3 different path.

#1 Direct H CSRs
#2 Shared memory
#3 & #4 Do not access

For PoC standpoint, KVM can just ignore the access. No issues with that. I am trying to understand how the final code flow should like across all the cases.

abrestic-rivos · 2022-09-20T18:24:01Z

Another idea: could we just treat hedeleg, hideleg, hcounteren, and hvip as per-vCPU registers and not touch them in the global kvm_arch_hardware_enable() path? The latter two should probably be per-vCPU anyway.

That still leaves hgatp (for VMID/mode detection) and hgeie (for guest interrupt file detection), but we could probably expose the necessary information via an SBI call in the nested virt and AP-TEE cases.

Speaking of, we'll need some way for KVM to detect that it's running in a nested mode and shouldn't try to access the H* CSRs. (For AP-TEE we can just probe the SBI extension.)

atishp04 · 2022-09-20T19:25:51Z

Another idea: could we just treat hedeleg, hideleg, hcounteren, and hvip as per-vCPU registers and not touch them in the global kvm_arch_hardware_enable() path? The latter two should probably be per-vCPU anyway.

That still leaves hgatp (for VMID/mode detection) and hgeie (for guest interrupt file detection), but we could probably expose the necessary information via an SBI call in the nested virt and AP-TEE cases.

That would work but is there any benefit in doing this way compared to saving everything in the shared memory state ?
We probably need some sort of sync call for the path once the shared memory is updated.

I feel KVM code would be much cleaner this way.

Speaking of, we'll need some way for KVM to detect that it's running in a nested mode and shouldn't try to access the H* CSRs. (For AP-TEE we can just probe the SBI extension.)

Probing SBI extension will work for case #3 (in my previous comment). For #4, technically that's incorrect.
Detecting nested mode can be discovered using hstatus.SPV bit Correct ?

For TVM specific path, we have vm type anyways.

abrestic-rivos · 2022-09-20T22:29:53Z

That would work but is there any benefit in doing this way compared to saving everything in the shared memory state ?
We probably need some sort of sync call for the path once the shared memory is updated.

I feel KVM code would be much cleaner this way.

What's the "shared-memory state" in this context? The shared-memory region is per-vCPU; there are no vCPUs yet at KVM initialization. Unless you're proposing the shared-memory region is per physical CPU? (Or both per-physical-CPU and per-vCPU shared memory?)

It might seem cleaner based on the current KVM implementation, but what about the host hypervisor / TSM side? Whatever CSRs it gets blasted with through this other interface (i.e. outside of running a vCPU) it's just going to stash away (or completely discard) until the next time the guest hypervisor goes to run a vCPU. In other words, this just seems like an interface to accommodate the current KVM initialization flow, rather than something that immediately results in CSRs being updated.

Detecting nested mode can be discovered using hstatus.SPV bit Correct ?

That implies we're trapping-and-emulating hstatus, which puts us back to where we started :)

How about this for determining available virtualization modes:

In the nested virtualization case, the host hypervisor doesn't expose the H extension in the ISA string (or whatever other ISA discovery mechanism).
During the guest hypervisor's kvm_arch_init():
- If H extension is advertised, native virtualization is available and we can proceed as normal for legacy VMs.
- If H extension is not advertised but a NESTED_VIRT (or whatever) SBI extension is available, we can use that interface to start legacy VMs.
- If the AP_TEE extension is available, we can start TVMs.

We then discover the capabilities of each mode through the appropriate means: CSR accesses for native virtualization, and SBI calls for AP-TEE / nested virtualization.

dgreid · 2022-09-20T23:00:50Z

I like the direction this is going.

What if AP_TEE is a sub feature of NESTED_VIRT? It would signal if adding confidential memory is supported. All the other paths remain the same, they need to work in the confidential case and that should cover the non-confidential case as well. Probably want to use different feature names, NESTED_VIRT => SBI_VIRT with an optional confidential feature would better reflect this potential model.

Certainly have different paths in the kernel KVM code, but only two. One for native, and one for NESTED/TVM. Then do detection similar to abrestic's kvm_arch_init example above, but with the H extension presence determining how VM_CREATE works without the PROTECTED flag set. H extension + nested w/o protected would be an error(kernel's choice which one to use) as we don't have a way to expose that to userspace(and probably don't want to).

atishp04 · 2022-09-21T01:07:11Z

That would work but is there any benefit in doing this way compared to saving everything in the shared memory state ?
We probably need some sort of sync call for the path once the shared memory is updated.
I feel KVM code would be much cleaner this way.

What's the "shared-memory state" in this context? The shared-memory region is per-vCPU; there are no vCPUs yet at KVM initialization. Unless you're proposing the shared-memory region is per physical CPU? (Or both per-physical-CPU and per-vCPU shared memory?)

During the initiation phase, it will be per-physical CPU. At runtime, it will be per vCPU. However, I agree it would be an overkill to define the shared memory just for initialization phase. It's just one time per kvm initialization.

It might seem cleaner based on the current KVM implementation, but what about the host hypervisor / TSM side? Whatever CSRs it gets blasted with through this other interface (i.e. outside of running a vCPU) it's just going to stash away (or completely discard) until the next time the guest hypervisor goes to run a vCPU. In other words, this just seems like an interface to accommodate the current KVM initialization flow, rather than something that immediately results in CSRs being updated.

There are also

htimedelta, hvip which are accessed in vcpu_load/put path.
vcpu_load is also called in reset_vcpu path (for hotplug)

Then there are Hmode CSRs accessed for in kernel AIA emulation.

Detecting nested mode can be discovered using hstatus.SPV bit Correct ?

That implies we're trapping-and-emulating hstatus, which puts us back to where we started :)

How about this for determining available virtualization modes:

In the nested virtualization case, the host hypervisor doesn't expose the H extension in the ISA string (or whatever other ISA discovery mechanism).

During the guest hypervisor's kvm_arch_init():

If H extension is advertised, native virtualization is available and we can proceed as normal for legacy VMs.

If H extension is not advertised but a NESTED_VIRT (or whatever) SBI extension is available, we can use that interface to start legacy VMs.

If the AP_TEE extension is available, we can start TVMs.

We then discover the capabilities of each mode through the appropriate means: CSR accesses for native virtualization, and SBI calls for AP-TEE / nested virtualization.

Sounds good to me.

atishp04 · 2022-09-21T01:09:46Z

I like the direction this is going.

What if AP_TEE is a sub feature of NESTED_VIRT? It would signal if adding confidential memory is supported. All the other paths remain the same, they need to work in the confidential case and that should cover the non-confidential case as well. Probably want to use different feature names, NESTED_VIRT => SBI_VIRT with an optional confidential feature would better reflect this potential model.

AP-TEE & SBI_VIRT needs to be independent of each other to support model 4 (HOST is running in HS mode). Correct ?

Certainly have different paths in the kernel KVM code, but only two. One for native, and one for NESTED/TVM. Then do detection similar to abrestic's kvm_arch_init example above, but with the H extension presence determining how VM_CREATE works without the PROTECTED flag set. H extension + nested w/o protected would be an error(kernel's choice which one to use) as we don't have a way to expose that to userspace(and probably don't want to).

atishp04 · 2022-09-21T01:49:32Z

May be you meant in addition to regular APTEE detection(will be used for host in HS) and we provide another sub feature in SBI_VIRT ?

abrestic-rivos · 2022-09-21T02:50:25Z

What if AP_TEE is a sub feature of NESTED_VIRT? It would signal if adding confidential memory is supported. All the other paths remain the same, they need to work in the confidential case and that should cover the non-confidential case as well.

It's a little more complicated than just confidential memory as we also require the VMM to declare which parts of the address space are used for which purpose (confidential vs shared vs MMIO), which I assume is something we don't want to impose on the (non-confidential) nested virtualization case. The device assignment and AIA flows will have some differences as well. But otherwise I agree, there's a lot of overlap.

htimedelta, hvip which are accessed in vcpu_load/put path.
vcpu_load is also called in reset_vcpu path (for hotplug)

Then there are Hmode CSRs accessed for in kernel AIA emulation.

Ok, so thinking more about it there are a couple of "interesting" cases revolving around interrupts:

The guest hypervisor chooses not to delegate all VS-level interrupts to its nested guest, e.g. it wants its virtualized hideleg.VSTI to be clear, indicating that it wants to receive its nested-guest timer interrupts even while the guest isn't running.
The guest hypervisor wants to select which nested guests' external interrupts should cause it to trap in hgeie.

Case (1) gets kinda complicated -- the host/TSM would basically have to virtualize Sstc for the guest hypervisor. Fortunately KVM doesn't rely on this, though it's something we might want to solve for. Or maybe we just say that VS-level interrupts always get delegated to the nested guest.

Case (2) is something KVM actually uses, though it seems like it should be easier to virtualize.

Also cc @rsahita

dgreid · 2022-09-21T08:04:34Z

What if AP_TEE is a sub feature of NESTED_VIRT? It would signal if adding confidential memory is supported. All the other paths remain the same, they need to work in the confidential case and that should cover the non-confidential case as well.

It's a little more complicated than just confidential memory as we also require the VMM to declare which parts of the address space are used for which purpose (confidential vs shared vs MMIO), which I assume is something we don't want to impose on the (non-confidential) nested virtualization case. The device assignment and AIA flows will have some differences as well. But otherwise I agree, there's a lot of overlap.

For shared vs confidential => default shared. And for MMIO the hypervisor already has to have those ranges figured out at some point(and to run a TVM will have to have it figured out early). So I don't see that as a big imposition.

In general, the cost on the hypervisor/VMM by TVM requirements will have to be payed. Leveraging it for Nested seems like good reuse, even if it's slightly more than nested needs.

htimedelta, hvip which are accessed in vcpu_load/put path.
vcpu_load is also called in reset_vcpu path (for hotplug)
Then there are Hmode CSRs accessed for in kernel AIA emulation.

Ok, so thinking more about it there are a couple of "interesting" cases revolving around interrupts:

The guest hypervisor chooses not to delegate all VS-level interrupts to its nested guest, e.g. it wants its virtualized hideleg.VSTI to be clear, indicating that it wants to receive its nested-guest timer interrupts even while the guest isn't running.

The guest hypervisor wants to select which nested guests' external interrupts should cause it to trap in hgeie.

Case (1) gets kinda complicated -- the host/TSM would basically have to virtualize Sstc for the guest hypervisor. Fortunately KVM doesn't rely on this, though it's something we might want to solve for. Or maybe we just say that VS-level interrupts always get delegated to the nested guest.

Case (2) is something KVM actually uses, though it seems like it should be easier to virtualize.

Also cc @rsahita

dgreid · 2022-09-21T08:17:08Z

Ok, so thinking more about it there are a couple of "interesting" cases revolving around interrupts:

The guest hypervisor chooses not to delegate all VS-level interrupts to its nested guest, e.g. it wants its virtualized hideleg.VSTI to be clear, indicating that it wants to receive its nested-guest timer interrupts even while the guest isn't running.

The guest hypervisor wants to select which nested guests' external interrupts should cause it to trap in hgeie.

Case (1) gets kinda complicated -- the host/TSM would basically have to virtualize Sstc for the guest hypervisor. Fortunately KVM doesn't rely on this, though it's something we might want to solve for. Or maybe we just say that VS-level interrupts always get delegated to the nested guest.

Is it likely that this would ever be needed for a TVM? For nested I'm OK not supporting that directly. Optimized nested is going to be great for most use cases but there still might be some non-kvm ones that want full trap-and-emulate with features such as above.

Case (2) is something KVM actually uses, though it seems like it should be easier to virtualize.

guest external interrupt traps seems like something we should handle for TVMs too. right?

abrestic-rivos · 2022-09-21T15:04:20Z

Is it likely that this would ever be needed for a TVM? For nested I'm OK not supporting that directly. Optimized nested is going to be great for most use cases but there still might be some non-kvm ones that want full trap-and-emulate with features such as above.

Yeah, to be clear this doesn't apply to TVMs and it's not something that KVM makes use of presently. I agree that it's something we can say is unsupported for nested virtualization. We aren't advertising the H-extension to the guest hypervisor; we're providing a way to accelerate virtualization via calls to the underlying execution environment, which itself may or may not be making use of the H-extensions. Host hypervisors can instead do trap & emulate if they want to support genuine nested virtualization.

Other than SGEIs (mentioned below) and the H{L,S}V instructions (which we'd accelerate via an SBI call anyway), I'm having a hard time coming up with other examples where a hypervisor would want to modify its own execution environment using the H-extension CSRs.

guest external interrupt traps seems like something we should handle for TVMs too. right?

Yes, we for sure need to virtualize SGEIs. We could trap & emulate hgeie/hgeip/hie/hip, or we could use SBI calls to set which guest external interrupts should be enabled and check which are pending. I'm tempted to go with the latter, but:

We'd have to inject virtual SGEIs to the guest hypervisor (via hvictl). Looks a little weird to the guest hypervisor that we're injecting interrupts for which they can't directly access the corresponding enable bit, but maybe that's ok?
The APIs don't really make sense in the TSM-in-HS case since the host can manipulate the CSRs directly in order to decide which TVMs it wants SGEIs from.

Also, for posterity, pasting the TODO I left in vm_cpu.rs about virtualizing SGEIs:

// TODO, HGEIE programinng:
//  - Track which guests the host wants interrupts from (by trapping HGEIE accesses from
//    VS level) and update HGEIE[2:] appropriately.
//  - If this is the host: clear HGEIE[1] on entry; inject SGEI into host VM if we receive
//    any SGEI at HS level.
//  - If this is a guest: set HGEIE[1] on entry; switch to the host VM for any SGEI that
//    occur, injecting an SEI for the host interrupts and SGEI for guest VM interrupts.

atishp04 · 2023-01-20T17:41:16Z

@rajnesh-kanwal

abrestic-rivos self-assigned this Sep 20, 2022

abrestic-rivos added the kvm Needed for the AP-TEE KVM port. label Sep 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle virtual instruction traps for hypervisor CSRs during KVM initialization #102

Handle virtual instruction traps for hypervisor CSRs during KVM initialization #102

abrestic-rivos commented Sep 20, 2022

dgreid commented Sep 20, 2022

abrestic-rivos commented Sep 20, 2022

atishp04 commented Sep 20, 2022

abrestic-rivos commented Sep 20, 2022

atishp04 commented Sep 20, 2022

abrestic-rivos commented Sep 20, 2022

dgreid commented Sep 20, 2022

atishp04 commented Sep 21, 2022

atishp04 commented Sep 21, 2022

atishp04 commented Sep 21, 2022

abrestic-rivos commented Sep 21, 2022

dgreid commented Sep 21, 2022

dgreid commented Sep 21, 2022

abrestic-rivos commented Sep 21, 2022

atishp04 commented Jan 20, 2023

Handle virtual instruction traps for hypervisor CSRs during KVM initialization #102

Handle virtual instruction traps for hypervisor CSRs during KVM initialization #102

Comments

abrestic-rivos commented Sep 20, 2022

dgreid commented Sep 20, 2022

abrestic-rivos commented Sep 20, 2022

atishp04 commented Sep 20, 2022

abrestic-rivos commented Sep 20, 2022

atishp04 commented Sep 20, 2022

abrestic-rivos commented Sep 20, 2022

dgreid commented Sep 20, 2022

atishp04 commented Sep 21, 2022

atishp04 commented Sep 21, 2022

atishp04 commented Sep 21, 2022

abrestic-rivos commented Sep 21, 2022

dgreid commented Sep 21, 2022

dgreid commented Sep 21, 2022

abrestic-rivos commented Sep 21, 2022

atishp04 commented Jan 20, 2023