Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PV MMU Design #13

Open
bysui opened this issue Oct 8, 2024 · 1 comment
Open

PV MMU Design #13

bysui opened this issue Oct 8, 2024 · 1 comment

Comments

@bysui
Copy link
Collaborator

bysui commented Oct 8, 2024

Background

For shadow MMU, in order to intercept modifications to the guest page table, all guest page tables are write-protected. This means that each modification will trigger a #PF VM-exit and an instruction emulation, resulting in poor performance when the guest frequently modifies the page table, such as when the guest application frequently allocates and frees memory.

Even with synchronized and unsynchronized pages optimization, the L1 guest page table is allowed to be writable after write page fault and is made write-protected again when the guest performs TLB flushing. This reduces the #PFs and emulations when the guest modifies multiple gptes in the L1 page table. However, this also has some drawbacks, including the following:

  1. It is necessary to mark all upper level SPs as unsynchronized when allowing the L1 SP to be unsynchronized. This may increase the lock hold time when there are multiple processes in the Linux guest, all of which share the kernel mapping PGDs. A modification to the kernel mapping will mark all root SPs as unsynchronized, and each root SP will need to be synchronized again when it is loaded, as root SP synchronization only marks the current root SP as synchronized.
  2. INVLPG emulation only synchronizes modified gptes, but the SP is still marked as unsynchronized. If the guest uses INVLPG to do TLB flushing one by one, then only one vCPU needs to perform spte synchronization, but other vCPUs usually do nothing after acquiring the MMU lock. This increases the MMU lock contention.

Purpose

Firstly, not all page table modifications need to be notified; only changes that require TLB flushing later are necessary, meaning only permission demotions are needed.

Secondly, inspired by the synchronized and unsynchronized pages optimization, page table modifications do not need to be immediately notified to the hypervisor. They can be delayed until TLB flushing later. This allows the guest to cache the modifications and commit them together during TLB flushing, reducing the number of notifications (hypercalls).

Finally, without write protection, the guest needs to notify the hypervisor to free the SP when the guest frees the page table. Otherwise, KVM will reach the SP limit quickly and reclaim the SP frequently, leading to bad performance.

Design

  • Pagetable operations
    All pagetable modifications in guest need to use the specific PTE operation functions.

    • set_pgd/set_p4d/set_pud/set_pmd/set_pte
      These functions should be used when the guest modifies the PGD/P4D/PUD/PMD/PTE entry. If the modification requires a TLB flushing later, the gptep could be cached for later commitment.
    • release_pgd/release_p4d/release_pud/release_pmd/release_pte
      These functions should be used when the guest free the page table memory.
  • Lazy mode
    Follow the synchronized and unsynchronized pages design, guest can cache the modified gpteps in its ring buffer during the pagetable modification. In TLB flushing, all cached gpteps are committed to the hypervisor, hypervisor can synchronize the associated spteps directly.

    • Global ring buffer vs Per-CPU ring buffer
      Although TLB is CPU scoped, the page table is shared between all CPUs. Therefore, the ring buffer should be global, as when one CPU attempts to perform TLB flushing, all cached gpteps should be committed, and spteps could be synchronized. Then the CPU can see the PTE changes by other CPU and the updated shadow page table.

      A global ring buffer complicates things and requires protection semantics. Therefore, we are considering a Per-CPU ring buffer, where each vCPU caches the modified gpteps in its ring buffer. The vCPU should commit the cached gpteps first before sending IPIs to other vCPUs to shootdown TLBs.

      However, this only works when PTE modification and TLB flushing are atomic. There are some TLB delay mechanisms (e.g., mmu_gather and batched unmap TLB flush) in the Linux kernel memory management, which requires some extra changes to work correctly. When one CPU delays TLB flushing and marks TLB flushing as pending, another CPU in the munmap/mprotect path will issue the TLB flushing pending status if it sees the PTE is not present. If it sees the TLB flushing pending, it will attempt to do TLB flushing for this memory. However, the CPU that modified the page table may not load the memory, so it can't receive the TLB flushing IPI, meaning the cached gpteps can't be committed, and the shadow page table is not updated. As a result, this CPU will still see an outdated TLB. Therefore, the CPU that modifies the page table of another memory instead of the currently loaded memory should commit the gpteps immediately, or the CPU should set itself to the TLB flushing IPI target CPU range, so the CPU that needs to do TLB flushing can send the IPI to it.

  • Detect and Setup

    • Detect:
      • CPUID
        KVM_FEATURE_PV_MMU
    • Setup:
      • MSR_KVM_PV_MMU
        A new virtual MSR is used to record the GPA of the per-CPU ring buffer. One bit is used to indicate that the PV MMU mode is enabled.
  • Ring buffer
    For simplicity, the buffer in the first version is implemented as a simple buffer instead of a ring buffer, such as a perf ring buffer.

#define PV_MMU_PTEPS_BUFFER_LEN 	(PAGE_SIZE / sizeof(u64))
struct pv_mmu_buffer {
    u64 pteps[PV_MMU_PTEPS_BUFFER_LEN];
}

/* 
 * Only permissions demotions need to be notified, and only 3 bits are available in the gptep.
 * P  -> NP
 * It also includes changes to the page frame number (PFN), dropping the access bit, setting the reserved bit, 
 * and transitioning from User (U) to Supervisor (S)
 *
 * RW -> RO
 * It also includes dropping the dirty bit.
 *
 * X  -> NX
 */
#define PV_MMU_SET_PTE_NP   _BITUL(0)
#define PV_MMU_SET_PTE_RO   _BITUL(1)
#define PV_MMU_SET_PTE_NX   _BITUL(2)
  • Hypercall
    • KVM_HC_PV_MMU_SET_PTE
      Notify the hypervisor about the GPTE update.
      • a0: the start index within the buffer
      • a1: the count of cached gpteps within the buffer
      • a2: the TLB flushing related flags
      #define PV_MMU_FLUSH_TLB_CURRENT  _BITUL(0)
      #define PV_MMU_FLUSH_TLB                     _BITUL(1)
    • KVM_HC_PV_MMU_RELEASE_PT
      Notify the hypervisor that GPT would be released.
      • a0: the gpa of GPT
  • Guest
    • PVOPs
      • set_pte
      • release_pte
      • lazy_mode
      • start_context_switch/end_context_switch
    • PTE modification
      All PTE modifications should use the set_pte PVOPs after enabling PV MMU mode. However, some places in the Linux guest do not follow this, so we need to change them.
      • ptep_get_and_clear
      • ptep_set_wrprotect
      • ptep_test_and_clear_young
    • TLB shootdown
      The cached gpteps in the buffer should be committed before sending the TLB shootdown IPIs.
      • inc_mm_tlb_gen
      • flush_tlb_all/flush_tlb_kernel_range/arch_tlbbatch_flush
  • Hypervisor
    • PV MMU mode
      The PV MMU mode is an enlightened shadow MMU mode. After the guest enables it, write protection for the guest page table and synchronized/unsynchronized SP are dropped.
    • SPTE synchronization delay
      During the KVM_HC_PV_MMU_SET_PTE hypercall, all committed gpteps are cached in the VM global buffer, and spte synchronization is delayed when the guest needs the TLB flushing, which is kvm_mmu_sync_roots().
  • Debug
    • Problem
      How to intercept all PTE modifications if someone forgets to use the previous operations when there is no write protection for the guest page table.
@bysui bysui mentioned this issue Oct 8, 2024
@lkml-likexu
Copy link

  1. Is there any POC level code to show the details of your design? Especially on PTEPS_BUFFER.
  2. Is there any performance data to help us buy-in the design ?
  3. How does this PV MMU mode coexist with legacy shadow MMU mode, e.g. handling host/guest shared memory or DMA buffers ?
  4. How do ensure guest spte consistency in the pvm-mmu context when multiple vcpus triggering the TLB synchronization delay mechanism aiming at the same address space parallely ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants