Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I have a question about NCR3 switching (AMD) #12

Closed
ghost opened this issue Feb 21, 2021 · 4 comments
Closed

I have a question about NCR3 switching (AMD) #12

ghost opened this issue Feb 21, 2021 · 4 comments

Comments

@ghost
Copy link

ghost commented Feb 21, 2021

Hello, thank you for the awesome idea of swapping CR3 on vmexit as opposed to switching permissions of each individual page. I am currently trying to implement this in my own AMD hypervisor, for windows. I have a very strange issue when I do that,

I have two tables:
1 primary table, with all pages set to allow RWX, except for 1 hooked page which is RW only
1 secondary table, with all pages set to RW only, except for 1 hooked page, which is RWX and points to my modified copy of the original page

The guest RIP seems to be "stuck" when I swap to secondary table, and strangely this doesn't happen when I allow all pages to be RWX in the secondary table. By being "stuck", I mean that the guest RIP constantly switches back and forth between hooked page and non hooked page, in an infinite loop without even executing anything. I know this doesn't have anything to do with instructions being split across pages, I flushed TLB properly, and I also cleaned VMCB cache bits.

image

My question: Have you had any similar problems when you implemented NCR3 switching?

Here is a snippet of my code, in vmexit handler:

        NPTHOOK_ENTRY* nptHook = GetHookByPhysicalPage(g_HvData, FailAddress);

        if (nptHook) {
            VpData->GuestVmcb.ControlArea.NCr3 = g_HvData->SecondaryNCr3;
        }
        else {
            VpData->GuestVmcb.ControlArea.NCr3 = g_HvData->PrimaryNCr3;
        }

        KeInvalidateAllCaches();

        VpData->GuestVmcb.ControlArea.VmcbClean &= 0xFFFFFFEF;
        VpData->GuestVmcb.ControlArea.TlbControl = 1;

Thanks in advance
@Zero-Tang
Copy link
Owner

Hi, I didn't have this issue, though I admit this feature could sometimes cause high CPU usage due to VM-Exits, so this feature should only be applied to rarely-invoked functions.
Windows 7 x64-2021-02-22-08-24-07
The screenshot is taken from a Windows 7 x64 VM in VMware Workstation 15.5.7.
"The printer is out of paper" is a result of NtSetInformationFile Inline Hook. From ARK tools like PCHunter and Win64AST, you won't see my hook on it.
From my perspective, I don't see anything suspicious in your debug print. However, I do think that VM-Exits for stealth inline hook is frequent in AMD-V. You might want to remove debug prints, and I think you know debug prints would freeze the debugee.

@tandasat
Copy link

It could be due to timer interrupt firing immediately after returning to the guest. That would run the other page, which would lead to switching back to the primary table. Then after the interrupt handler completes, the guest would attempt to retry execution of non-executable page, which switches to the secondary table, resulting in an infinite loop.

This can happen when #VMEXIT handling takes too long. I too suggest removing the debug print if this is synchronous (eg, DbgPrint).

Side note:
I also think KeInvalidateAllCaches() should be removed. It issues IPI if there are more than one core, and IPI (#INTR) is held while #VMEXIT is handled (GIF==0). So, it could cause dead lock if any other processor also issues IPI within #VMEXIT at the same time (eg, CPU#0 waits for CPU#1 to process IPI, while CPU#1 waits for CPU#0 to process IPI, both in #VMEXIT not receiving #INTR). Even if I am mistaken on this, it does not sounds you have caches that need to be flushed since it is not changing contents of physical memory; just changing translation.

@ghost
Copy link
Author

ghost commented Feb 23, 2021

@Zero-Tang @tandasat Thank you guys very much for the reply!
Tandasat, you were correct about the timer interrupt problem. I looked at the location the guest RIP was stuck at, and it was stuck here:
image

I have significantly sped up my vmexit handling, I limited my project to use 1 hook, and I removed all NT function calls in vmexit handler. I basically reduced VMEXIT all the way down to a few variable assignments, and this is the entire VMEXIT handler:


void
HandleNestedPageFault(VPROCESSOR_DATA* VpData, GUEST_REGISTERS* GuestContext)
{
    NPF_EXITINFO1 ExitInfo1;

    ExitInfo1.AsUInt64 = VpData->GuestVmcb.ControlArea.ExitInfo1;

    ULONG64 FailAddress = VpData->GuestVmcb.ControlArea.ExitInfo2;

    PHYSICAL_ADDRESS NCr3;

    NCr3.QuadPart = VpData->GuestVmcb.ControlArea.NCr3;
 
    if (ExitInfo1.Fields.Execute == 1) {
        if (nptHook->nptEntry->PageFrameNumber == FailAddress >> PAGE_SHIFT) {
            VpData->GuestVmcb.ControlArea.NCr3 = g_HvData->SecondaryNCr3;
        }
        else {
            VpData->GuestVmcb.ControlArea.NCr3 = g_HvData->PrimaryNCr3;
        }

        VpData->GuestVmcb.ControlArea.VmcbClean &= 0xFFFFFFEF;
        VpData->GuestVmcb.ControlArea.TlbControl = 1;
    }
}

However, now it still appears to be stuck at that exception handler. I will try to gather more information.
I also found a similar issue opened which I think is relevant to this discussion:
tandasat/HyperPlatform#11

EDIT: I should also say that I am testing on 1 core

@ghost
Copy link
Author

ghost commented Feb 28, 2021

I finally fixed it, the issue was due to an instruction being split across 2 pages.

@ghost ghost closed this as completed Feb 28, 2021
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants