-
Notifications
You must be signed in to change notification settings - Fork 408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is vmxoff path really safe/correct? #3
Comments
Thank you for the through description of questions. As for the second question, VMXOFF does not change a value of CR3. However, it does not mean that VMXON and VMXOFF can only be done on the SYSTEM process. As long as a value of CR3 after VMXOFF remains the same as that of before VM-exit, it will be fine. In case of HyperPlatform, it simply sets the same CR3 value for both the SYSTEM process (VmcsField::kGuestCr3) and the VMM (VmcsField::kHostCr3), and does not restore the CR3 that was being used at the time of VMCALL after VMXOFF. For this reason, VMXOFF can only be safely performed from the SYSTEM process in current implementation. The following are changes of CR3 w.r.t. VMXOFF:
So, if VMCALL leading to VMXOFF is executed with a different CR3 from that of VmcsField::kHostCr3, this would cause crashes, as an original CR3 is not going to be restored even after VMXOFF. Does this answer your question clear enough? As for the first question, I do not have an answer right now, but it sounds good pointing out. I will look into it and will reply here shortly. Thank you, |
As for the restoring registers, you are right. I believe that the VMM should save and restore non-interger registers too if they can be modified by the VMM since they are not automatically saved and restored by a processor on VM-exit and VM-enter. In case of HyperPlatform, it does not appear to be an issue because those unhandled registers are not modified by the VMM; or VM-exit does not happen under a situation where consistency of those registers is required. For this reason, I would hold off updates unless an concrete issue is seen. Thank you, |
Hi, Thank you for the explanation on CR3. That makes a lot of sense. It looks like the requirement for SYSTEM could go away, if, after the vmcall, the CR3 would be captured, and overwritten in the kGuestCr3 field before the vmxoff. But I don't think it's worth doing that. As for non-integer registers, on x64, the calling convention does support the compiler auto-generating function calls which use xmm registers. This will happen automatically in certain cases, in other cases, one must use __vectorcall as the calling convention to force this. Because of this, it is certainly possible for someone to have a vector function that is using volatile XMM registers, and hit a VMM Exit. Because you are using Visual Studio and because Windows Kernel x64 allows XMM registers, I think it's technically possible for hyperplatform.sys to end up using XMM registers -- for example if a memcpy becomes inlined + vectorized. So I think this is taking a risk that certain optimization settings/further code might hit. It may be OK now, but not if someone adds more routines without realizing them. Thanks for your insight! |
Thank you for sharing your knowledge on use of XMM registers. I was unaware of most of them. I agree with your thoughts on potential risk of not fixing it. While the issue may not hit at this time, it would be tough to figure out a bug is caused by XMM registers for someone who are not aware of it. Given the fact that this project is meant to be a platform, fix should be considered. For now, I have created #5 for reference and potential future work. I appreciate your thorough risk assessment and suggestion. |
while working on bareflank.org, I ran into a similar issue with XMM, as I do believe they need to be addressed. The old vmxcpu code never saved these registers because it only ran on 32bit, which did not included XMM registers in the ABI, which is why we never included them in the MoRE hypervisor (as it's based on vmxcpu). On Bareflank, we use Libc++ which touches XMM and FPU registers, so these absolutely need to be saved in our case. Our current solution is to use the fxsave and fxrstor functions. In our state save area. Figured this info might help you guys in the future if you choose to fix this problem. Here is our PR that adds this code if you want an example |
I highly recommend you use XSAVE/XRESTORE if available, otherwise you risk to miss certain registers and have hard to debug issues :)
|
Thanks for sharing your findings. While I am not entirely familiar with the extended states, looks like an XSAVE feature set is more comprehensive ones, as far as I can see from this slides from Intel. I originally thought a chance of hitting this issue was small and negligible, but it does not seem convincing anymore since Libc++, which one of the most common libraries, is affected. Libc++ may not going to be used inside HyperPlatform, but I would not surprise if other libraries triggered the same issue. I will study the extended states first (ie, supported processor generations, difference between XSAVE, -OPT, -S and -C, and any implications) and then will work on implementation for fix. |
Thanks for the advice, yeah we will be using XSAVE as well. The PR that is there will be updated with a couple of additional fixes as I found a nasty issue with -O3 as GCC uses SSE a lot with the heavy optimizations enabled. The problem is, GCC assumes the stack is 16 byte aligned, and so we were seeing segfaults as a result. The other thing I was missing was a restore on "promote" when coming out of VMX-non root. I should have my fixes in today for these. |
As far as preservation of x87, SSE and AVX are concerned (and that assumption is fine I think--need better understanding), the simplest way seems to use XSAVE and XRESTORE. It will require only checks for:
If either of them is not supported, using FXSAVE and FXRSTOR is the cleanest solution as they "save and restore the states of the XMM registers and the MXCSR register (SSE state), along with x87 state" (FXSAVE AND FXRSTOR INSTRUCTIONS), or RtlCaptureContext() as they seem to save SSE state too. Some notes below. Terms and Basics (XSAVE-SUPPORTED FEATURES AND STATE-COMPONENT BITMAPS)
Availability (ENUMERATION OF CPU SUPPORT FOR XSAVE INSTRUCTIONS AND XSAVESUPPORTED FEATURES)
Data structures (XSAVE AREA)
|
I am writing the code right now so I'll update the PR to bareflank that has at least what we are working on. With any luck I should have ours fixed by tonight. The approach we are taking is assuming that XSAVE is available as we only support SandyBridge and above. Once you have XSAVE, you can get the size of the save area by doing a
One thing I learned (took a bit to debug) was that xrstor will GPF if you don't zero out this memory. Since we have multi-core support, we do a lot of work on the stack to keep things thread-safe, and so not clearing the state save area was causing bits in the state save area to randomly trigger GPFs on restore. Also don't forget to make sure the state save area is 64byte aligned. The harder part is how to handle XSETBV and CPUID. There does not seem to be a way to fake the result of XGETBV. At least, a call to XSETBV always traps, so we can ignore writes that disable bits we are using (will likely add MPX support in the future). Not really sure how best to handle XGETBV, but hopefully Windows (as Linux is ok) just enables all of the XCR0 bits and leaves it alone. |
Ok, I updated the PR. It's not done yet as I still have to plum in the rest of the entry points, and I need to add the XSAVE logic to the exit handler, but this at least shows one way to use it |
Hi, Note that for Windows, there are APIs such as RtlGetProcessorExtendedFeatures and KeSaveExtendedProcessorState/KeRestoreExtendedProcessorState which should take care of most of this. |
Thank you for APIs, Alex. I did not realize XSTATE_MASK_GSSE meant the AVX state. Those functions are very helpful. I will write tests and update code shortly. @rianquinn, thanks for sharing gotchas and code. Thanks to API provided by Windows, I am likely to be able to avoid direct use of the XSAVE feature set instructions. As for XCR0, if I recall correctly, I hardly saw VM-exit via XSETBV on Windows, perhaps, only at the time of system shutdown or something under a non-regular situation. So it may not be a big issue on Windows too. |
I was going to use APIs, but KeSaveExtendedProcessorState() could be used only when IRQL <= DISPATCH_LEVE, while VM-exit could be executed at any IRQL. I will use xsave/xrestore directly once availability of those instructions and AVX support is validated with RtlGetEnabledExtendedFeatures() (for some reasons, I did not find declaration of the API and had to write a prototype of it manually). @rianquinn, you may be able to use intrinsics functions for xsave/xrestore etc depeding on compiler's support. At least MSVC supports them well enough for my porpose. |
Sadly in my case, the entry point for both the Kernel -> C++ and the entry point for the exit handler need to start with a simple ASM block as I have to control each register access very carefully, so I have a bit of hand written ASM to pull that off. But yeah... One thing that I will likely work on at some point is porting some of the intrinsics code that we do have to the Intel calls, as GCC does support these, and reducing the amount of ASM code we have would be a good thing. The code I linked above not only has to run xsave, but it also has to do MS64 ABI to System V ABI conversions (not in there yet), so some of that ASM work that you see is preperation for that code. |
I have not faced an issue and found documents w.r.t zero-ed out memory on XRSTOR either. Instead, I had to debug a bit to noticed that XSAVE/XRSTOR raises an exception when CR0.TS[bit 3] is 1. I only saw this situation on x86 Windows for now. As workaround, code like below worked.
It is kind of ugly though. |
In our case, we only support 64bit, so we don't have to worry about TS as it has no purpose on x86_64 since task switching is not available. That being said, Windows doesn't use x86 tasks, so you should be ok disabling it. |
Actually Windows (like Linux and OS X) uses tasks for Double Faults and Additionally, video mode switching is implemented, in some cases, with the Satoshi: IRQL requirements don't mean much when running in VMM context. For Best regards, On Mon, Jul 4, 2016 at 11:16 AM, Rian Quinn notifications@github.com
|
Thank you for the note, Alex. I do not think I understand why you cannot call API when interrupts are disabled. The eflags.IF is cleared when VM-exit happened but the IF only affects hardware interrupt, and exceptions can still occur. I tested that even page-in was processed fine in the VMM-context if IRQL is PASSIVE_LEVEL. To my knowledge, the IRQL requirement mostly stems from if page-in can be processed--in other words, if the process can enter wait state--, and interrupts are irrelevant. I bet that I am missing something and appreciate if you could explain a bit more about why disabling interrupts is technically the same as being IRQL==HIGH_LEVEL. |
A test program attached. ConsoleApplication1.zip |
Hi Satoshi, Wow -- I cannot believe you were crazy enough to try page-in from VMM
In other words, if you call, for example, ExAllocatePoolWithTag, and this Hope this makes sense. Best regards, On Mon, Jul 4, 2016 at 9:21 PM, Satoshi Tanda notifications@github.com
|
Wow, thank you for the clear explanation. I understand what you meant now. I knew that VMM context should not be context switched and that it did not receive the clock interrupt and IPI, but was not able to think of a connection between IPI and interrupt. It is obvious that IPI relies on a hardware interrupt. And if IPI did not work, TLB flush would not work neither. While I started to write an initial version of VMM, I noticed that calling API from VMM-context made the VMM noticeably unstable or caused funny symptoms like this even if I obeyed requirements of IRQL. So I decided to avoid using them as much as possible, but I was not able to explain well how it could cause trouble. The scenario of ExAllocatePoolWithTag + NonPagedPool is a such good example. By the way, the test for paging-in was needed to make my question clearer to ask you for explanation. Besides, such experiments are always fun :) |
Second that!!! |
Oh yeah, I meant "good crazy", not "stupid crazy" ;-) Best regards, On Tue, Jul 5, 2016 at 9:57 AM, Rian Quinn notifications@github.com wrote:
|
By the way @tandasat there is new Windows 10 API: KeGetEffectiveIrql() which will return HIGH_LEVEL at all times while in VMM mode ;-) Proving my point :) |
That is beautiful :) I should probably document that fact. |
Hi,
I experimented with turning VMX off by using DPC routine targeted to each processor, with synchronization barrier. If DPCs were delivered as part of SYSTEM process (such as from idle thread, or from system thread), everything worked fine. However, if a process was currently running, and the DPC interrupted the process, many times the process crashes with invalid memory access, or sometimes a kernel driver itself crashes (such as win32kfull running on the context of dwm.exe), when accessing session memory.
These problems go away if doing VMX off the way you do it, by sticking with a single thread that is already running under SYSTEM process. However, even if you have a comment that suggests you saw some strangeness.
So therefore, I have two questions:
Must vmx_launch and vmx_off be done in the same address space? In other words, will vmx_off restore the CR3 that was saved during vmx_launch? Because if so, I can understand the danger of capturing CR3 in system process (session -1), but then doing vmx_off in dwm.exe (session 0). The totally different CR3 will certainly cause a crash. Does this mean that it's only safe to do on/off from within SYSTEM process?
Thank you.
The text was updated successfully, but these errors were encountered: