-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to use HPA >= 4GB when vTLB is active #33
Comments
Ok, thanks for the notice and the explanation. I'll take a look at that if I have time. |
A patch for this is in the vtlb branch of my tree. @Nils-TUD @parthy feel free to give this a spin. You should boot the hypervisor with "vtlb" parameter to force vTLB operation and map some host-physical region beyond 4GB into the guest (obviously you'd need a 64bit host system for that). Note that this causes 3-level PAE paging to be used instead of 2-level paging, so I would expect some performance impact. It would be good to run some benchmark with vTLB enabled, once with and once without the patch to see how much overhead we add for this. |
I ran a single-core kernel compile today with vanilla NRE (without the hypervisor changes) and the two vTLB versions. I got 522.4s for the old version and 562.6s for the one. That would be around 7.6% overhead introduced by the patch. It would be interesting to see some other machine's results, too. |
What machine are you using? If you configure a boot entry on erwin, I can test it on Ivy Bridge. |
I'm on a SandyBridge. I'm not sure I can provide such a boot entry easily.. I just build NRE with |
@blitz - it would be more interesting to run this on an old machine, because for the new ones with EPT, you wouldn't want to use vTLB anyway. I'd be more interested in performance numbers for P4, Yonah, Merom, Penryn machines. AFAIR Carsten had a P4-based Presler machine (CPUID F:6:2), where vTLB would be mandatory. Nonetheless, performance numbers for newer machines are welcome as well, to get a wider picture. |
@udosteinberg I'll see what I can find. @parthy Can you mail me the pulsar config and all the binaries? (Yes, I am lazy.) |
I've just tested it with my i5:
The old version needs 615s and the new version 656s. That's an overhead of 6.67%. |
So far I only got it to work on my laptop (i7 L640: 770s -> 819s 6%). For vtlb-only boxes, I could only find a Pentium D, but I couldn't get the benchmark to run right away. Working on it. |
Based on the reported numbers so far, overhead seems to be in the 6-8 percent range. Because that is significant, I would rather not enable PAE for the vTLB by default. We could make it a compile-time option. What do others think? |
Is it possible to make it a runtime option? |
Sure, but it would inflate the binary because the code would have to include both vTLB versions. We could then even go as far as using the 2-level vTLB for VMs with HPA below 4GB and the 3-level vTLB for those with HPA above 4GB. I'm reluctant to do that because it results in non-deterministic VM performance for all sorts of benchmarks. |
On 08/03/2013 01:51 AM, Udo Steinberg wrote:
It would be nice, if NOVA Just Works™ even on old boxes. Maybe have the Julian |
It's not that you can't use NOVA on those machines ;) You just cannot assign memory beyond 4GiB (physical) to the guest. It is a limitation that you can circumvent, but not a killer. So to me it would make sense to do it the other way round: Only have a special case if you want to compile NOVA for the use on such an older machine with >4GiB RAM. Maybe you could have a compile switch saying "build and use both both vTLB versions"? With that, the default version advocated by the README would still be deterministic in this sense, but if you choose to overcome the limitation, you get to use higher addresses and only pay the performance overhead where necessary, but have to compile the kernel differently. |
It only works if a) the box has no memory beyond 4G or b) the userspace is aware of this limitation and will not use memory beyond 4G for VMs. |
Another option would be to configure the vTLB as follows:
This would work because the 32bit API does not support addresses beyond 4GB anyway. So 32bit vTLB will be ~7% faster than 64bit vTLB, and performance will be deterministic for all VMs. Another benefit is that this configuration would exercise both vTLB variants regularly. |
Sounds reasonable. |
I had to increase kernel memory to avoid out of memory on NRE bootstrap for both versions. Weird. Overhead stays at roughly 6%. I can't see a difference to the older patch. |
The older patch did not flag the guest's global pages correctly, thus address space switches would flush them from the vTLB. So now that we can keep them, I would expect decreasing overheads. But maybe Linux' use of global pages is not very significant. |
I can see a slight, but unstable improvement of 0.3-1%. So I would not expect a noticeable difference. Or at least not in a kernel compile, maybe we should try a different workload as well? |
It should be quite a bit more noticeable if you run L4/Fiasco + Pingpong in a VM on top of NOVA. Especially the Inter-AS benchmark should see some improvement. |
For me it doesn't make any difference. I've run the kernel-compile-test twice and both times it took 656s, i.e. the same time as with the previous vtlb-version. |
I've just pushed another version of the vtlb branch. This time it should really make a difference. |
With this version, I get almost exactly the same results as without PAE. |
The same here. Without PAE it takes 615s, with PAE 618s :) |
The patch seems to have introduced a bug, though. If I try to boot escape, it hangs in an endless vTLB-miss-loop @ 0xc0131623 (see nre/dist/imgs/escape.bin). This is the instruction behind the one that enables paging. This does not occur with the master branch of NOVA. |
How do I get the escape binaries? They are not in my tree. |
Just execute ./dist/download.sh in the directory nre. |
If it matters...I've tested it with qemu. So, for example by |
For some reason NRE does not seem to find its ISO image in the file system...
The vmconfig file looks like this...
|
That's the output of Escape, which isn't able to find the ROM-disk, i.e. the ISO-image here.
|
NOVA gets the TLB miss address from VMCB->exitinfo2 in It would be interesting to see what real AMD HW does. Obviously I don't have any here. Or if anyone can dig up the relevant part of the SVM spec that clarifies what happens to the high bits of exitinfo2 in 32bit mode, that would also help. |
Interesting. It's the same here on qemu. Unfortunatly, I don't have an AMD box either. |
FWIW, NRE + Escape works on Intel CPUs with vTLB. So it's clearly related to the SVM interface code. It shouldn't have anything to do with the recent vTLB changes and older versions should show the same symptoms. |
According to the recollection of an SVM architect, the intercepted page-fault VM exit should write to all 64 bits of the exitinfo2 field in the VMCB. Could someone with AMD hardware please verify that this is the case? Just add It will blow up in QEMU, and the question is whether it survives on AMD hardware. If it does, then we may need to file a QEMU bug. |
I'll look for one and report back. |
I've just seen that Björn has an AMD box here:
Therefore, I've tested it now and it works, i.e. the upper bits of cr2 are always zero. |
So this is a qemu bug? |
If the very same setup works on real HW and does not work in QEMU, then it is indeed a QEMU bug. Interestingly enough the problem is only exposed by Escape running as guest OS. I have not spent the time figuring out why, but I suspect that Escape causes some VM exit that sets high bits in exitinfo2 and those bits then remain set during vTLB-related page-fault exits. Linux does not expose the issue. So you really need to test identical setups with Escape as guest OS. |
It's not some other VM exit, it's the page-fault exit itself. With Escape I'm seeing several instances of...
The relevant code in QEMU seems to be
in |
I guess the reason why it does not occur with e.g. Linux is that Linux does not use this "GDT trick" as Escape does. Because Escape configures the GDT to have a base address of 0x40000000 so that the virtual address 0xC0000000 ends up at physical address 0x0. If you look at the addresses in exitinfo1 you see that bit 32 is set: 0x00000001001316b0. |
Yep, that sounds like a reasonable explanation for what we're seeing. So feel free to open a QEMU bug for this and maybe put a link here for future references. |
I've filed a bug: https://bugs.launchpad.net/qemu/+bug/1211910 |
This is not really an NRE bug, but something that NRE needs to be aware of, since we have seen 64-bit NRE use high addresses for VMs, which has exposed the problem...
For 32-bit guests, the vTLB uses a 2-level shadow page table with 4-byte PTEs. Even though a 64-bit VMM can install a GPA-to-HPA mapping where HPA >= 4GB, the vTLB cannot store HPA wider than 32-bit in its shadow PTEs. The most recent version of the microhypervisor catches such cases and terminates the vCPU.
To be able to make use of HPA beyond 4GB when the vTLB is active, the microhypervisor would have to use a 3-level PAE shadow page table with 8-byte PTEs, which has performance implications.
For the time being, NRE should avoid using HPA >= 4GB for VMs that use the vTLB.
The vTLB is used when "vtlb" is specified on the hypervisor command line. Otherwise it is used...
The text was updated successfully, but these errors were encountered: