forked from torvalds/linux
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PV MMU Design #13
Comments
Open
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Background
For shadow MMU, in order to intercept modifications to the guest page table, all guest page tables are write-protected. This means that each modification will trigger a #PF VM-exit and an instruction emulation, resulting in poor performance when the guest frequently modifies the page table, such as when the guest application frequently allocates and frees memory.
Even with
synchronized and unsynchronized pages
optimization, theL1
guest page table is allowed to be writable after write page fault and is made write-protected again when the guest performs TLB flushing. This reduces the #PFs and emulations when the guest modifies multiple gptes in theL1
page table. However, this also has some drawbacks, including the following:L1
SP to be unsynchronized. This may increase the lock hold time when there are multiple processes in the Linux guest, all of which share the kernel mapping PGDs. A modification to the kernel mapping will mark all root SPs as unsynchronized, and each root SP will need to be synchronized again when it is loaded, as root SP synchronization only marks the current root SP as synchronized.INVLPG
emulation only synchronizes modified gptes, but the SP is still marked as unsynchronized. If the guest usesINVLPG
to do TLB flushing one by one, then only one vCPU needs to perform spte synchronization, but other vCPUs usually do nothing after acquiring the MMU lock. This increases the MMU lock contention.Purpose
Firstly, not all page table modifications need to be notified; only changes that require TLB flushing later are necessary, meaning only permission demotions are needed.
Secondly, inspired by the synchronized and unsynchronized pages optimization, page table modifications do not need to be immediately notified to the hypervisor. They can be delayed until TLB flushing later. This allows the guest to cache the modifications and commit them together during TLB flushing, reducing the number of notifications (hypercalls).
Finally, without write protection, the guest needs to notify the hypervisor to free the SP when the guest frees the page table. Otherwise, KVM will reach the SP limit quickly and reclaim the SP frequently, leading to bad performance.
Design
Pagetable operations
All pagetable modifications in guest need to use the specific PTE operation functions.
These functions should be used when the guest modifies the PGD/P4D/PUD/PMD/PTE entry. If the modification requires a TLB flushing later, the gptep could be cached for later commitment.
These functions should be used when the guest free the page table memory.
Lazy mode
Follow the
synchronized and unsynchronized pages
design, guest can cache the modified gpteps in its ring buffer during the pagetable modification. In TLB flushing, all cached gpteps are committed to the hypervisor, hypervisor can synchronize the associated spteps directly.Global ring buffer vs Per-CPU ring buffer
Although TLB is CPU scoped, the page table is shared between all CPUs. Therefore, the ring buffer should be global, as when one CPU attempts to perform TLB flushing, all cached gpteps should be committed, and spteps could be synchronized. Then the CPU can see the PTE changes by other CPU and the updated shadow page table.
A global ring buffer complicates things and requires protection semantics. Therefore, we are considering a Per-CPU ring buffer, where each vCPU caches the modified gpteps in its ring buffer. The vCPU should commit the cached gpteps first before sending IPIs to other vCPUs to shootdown TLBs.
However, this only works when PTE modification and TLB flushing are atomic. There are some TLB delay mechanisms (e.g.,
mmu_gather
andbatched unmap TLB flush
) in the Linux kernel memory management, which requires some extra changes to work correctly. When one CPU delays TLB flushing and marks TLB flushing as pending, another CPU in themunmap/mprotect
path will issue the TLB flushing pending status if it sees the PTE is not present. If it sees the TLB flushing pending, it will attempt to do TLB flushing for this memory. However, the CPU that modified the page table may not load the memory, so it can't receive the TLB flushing IPI, meaning the cached gpteps can't be committed, and the shadow page table is not updated. As a result, this CPU will still see an outdated TLB. Therefore, the CPU that modifies the page table of another memory instead of the currently loaded memory should commit the gpteps immediately, or the CPU should set itself to the TLB flushing IPI target CPU range, so the CPU that needs to do TLB flushing can send the IPI to it.Detect and Setup
KVM_FEATURE_PV_MMU
A new virtual MSR is used to record the GPA of the per-CPU ring buffer. One bit is used to indicate that the PV MMU mode is enabled.
Ring buffer
For simplicity, the buffer in the first version is implemented as a simple buffer instead of a ring buffer, such as a perf ring buffer.
Notify the hypervisor about the GPTE update.
Notify the hypervisor that GPT would be released.
All PTE modifications should use the
set_pte
PVOPs after enabling PV MMU mode. However, some places in the Linux guest do not follow this, so we need to change them.The cached gpteps in the buffer should be committed before sending the TLB shootdown IPIs.
The PV MMU mode is an enlightened shadow MMU mode. After the guest enables it, write protection for the guest page table and synchronized/unsynchronized SP are dropped.
During the KVM_HC_PV_MMU_SET_PTE hypercall, all committed gpteps are cached in the VM global buffer, and spte synchronization is delayed when the guest needs the TLB flushing, which is
kvm_mmu_sync_roots()
.How to intercept all PTE modifications if someone forgets to use the previous operations when there is no write protection for the guest page table.
The text was updated successfully, but these errors were encountered: