I have a question about NCR3 switching (AMD) #12

ghost · 2021-02-21T22:35:09Z

Hello, thank you for the awesome idea of swapping CR3 on vmexit as opposed to switching permissions of each individual page. I am currently trying to implement this in my own AMD hypervisor, for windows. I have a very strange issue when I do that,

I have two tables:
1 primary table, with all pages set to allow RWX, except for 1 hooked page which is RW only
1 secondary table, with all pages set to RW only, except for 1 hooked page, which is RWX and points to my modified copy of the original page

The guest RIP seems to be "stuck" when I swap to secondary table, and strangely this doesn't happen when I allow all pages to be RWX in the secondary table. By being "stuck", I mean that the guest RIP constantly switches back and forth between hooked page and non hooked page, in an infinite loop without even executing anything. I know this doesn't have anything to do with instructions being split across pages, I flushed TLB properly, and I also cleaned VMCB cache bits.

My question: Have you had any similar problems when you implemented NCR3 switching?

Here is a snippet of my code, in vmexit handler:

        NPTHOOK_ENTRY* nptHook = GetHookByPhysicalPage(g_HvData, FailAddress);

        if (nptHook) {
            VpData->GuestVmcb.ControlArea.NCr3 = g_HvData->SecondaryNCr3;
        }
        else {
            VpData->GuestVmcb.ControlArea.NCr3 = g_HvData->PrimaryNCr3;
        }

        KeInvalidateAllCaches();

        VpData->GuestVmcb.ControlArea.VmcbClean &= 0xFFFFFFEF;
        VpData->GuestVmcb.ControlArea.TlbControl = 1;

Thanks in advance

The text was updated successfully, but these errors were encountered:

Zero-Tang · 2021-02-22T00:36:12Z

Hi, I didn't have this issue, though I admit this feature could sometimes cause high CPU usage due to VM-Exits, so this feature should only be applied to rarely-invoked functions.

The screenshot is taken from a Windows 7 x64 VM in VMware Workstation 15.5.7.
"The printer is out of paper" is a result of NtSetInformationFile Inline Hook. From ARK tools like PCHunter and Win64AST, you won't see my hook on it.
From my perspective, I don't see anything suspicious in your debug print. However, I do think that VM-Exits for stealth inline hook is frequent in AMD-V. You might want to remove debug prints, and I think you know debug prints would freeze the debugee.

tandasat · 2021-02-22T04:46:50Z

It could be due to timer interrupt firing immediately after returning to the guest. That would run the other page, which would lead to switching back to the primary table. Then after the interrupt handler completes, the guest would attempt to retry execution of non-executable page, which switches to the secondary table, resulting in an infinite loop.

This can happen when #VMEXIT handling takes too long. I too suggest removing the debug print if this is synchronous (eg, DbgPrint).

Side note:
I also think KeInvalidateAllCaches() should be removed. It issues IPI if there are more than one core, and IPI (#INTR) is held while #VMEXIT is handled (GIF==0). So, it could cause dead lock if any other processor also issues IPI within #VMEXIT at the same time (eg, CPU#0 waits for CPU#1 to process IPI, while CPU#1 waits for CPU#0 to process IPI, both in #VMEXIT not receiving #INTR). Even if I am mistaken on this, it does not sounds you have caches that need to be flushed since it is not changing contents of physical memory; just changing translation.

ghost · 2021-02-23T02:33:01Z

@Zero-Tang @tandasat Thank you guys very much for the reply!
Tandasat, you were correct about the timer interrupt problem. I looked at the location the guest RIP was stuck at, and it was stuck here:

I have significantly sped up my vmexit handling, I limited my project to use 1 hook, and I removed all NT function calls in vmexit handler. I basically reduced VMEXIT all the way down to a few variable assignments, and this is the entire VMEXIT handler:


void
HandleNestedPageFault(VPROCESSOR_DATA* VpData, GUEST_REGISTERS* GuestContext)
{
    NPF_EXITINFO1 ExitInfo1;

    ExitInfo1.AsUInt64 = VpData->GuestVmcb.ControlArea.ExitInfo1;

    ULONG64 FailAddress = VpData->GuestVmcb.ControlArea.ExitInfo2;

    PHYSICAL_ADDRESS NCr3;

    NCr3.QuadPart = VpData->GuestVmcb.ControlArea.NCr3;
 
    if (ExitInfo1.Fields.Execute == 1) {
        if (nptHook->nptEntry->PageFrameNumber == FailAddress >> PAGE_SHIFT) {
            VpData->GuestVmcb.ControlArea.NCr3 = g_HvData->SecondaryNCr3;
        }
        else {
            VpData->GuestVmcb.ControlArea.NCr3 = g_HvData->PrimaryNCr3;
        }

        VpData->GuestVmcb.ControlArea.VmcbClean &= 0xFFFFFFEF;
        VpData->GuestVmcb.ControlArea.TlbControl = 1;
    }
}

However, now it still appears to be stuck at that exception handler. I will try to gather more information.
I also found a similar issue opened which I think is relevant to this discussion:
tandasat/HyperPlatform#11

EDIT: I should also say that I am testing on 1 core

ghost · 2021-02-28T17:42:42Z

I finally fixed it, the issue was due to an instruction being split across 2 pages.

ghost closed this as completed Feb 28, 2021

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I have a question about NCR3 switching (AMD) #12

I have a question about NCR3 switching (AMD) #12

ghost commented Feb 21, 2021 •

edited by ghost

Loading

Zero-Tang commented Feb 22, 2021

tandasat commented Feb 22, 2021

ghost commented Feb 23, 2021 •

edited by ghost

Loading

ghost commented Feb 28, 2021

I have a question about NCR3 switching (AMD) #12

I have a question about NCR3 switching (AMD) #12

Comments

ghost commented Feb 21, 2021 • edited by ghost Loading

Zero-Tang commented Feb 22, 2021

tandasat commented Feb 22, 2021

ghost commented Feb 23, 2021 • edited by ghost Loading

ghost commented Feb 28, 2021

ghost commented Feb 21, 2021 •

edited by ghost

Loading

ghost commented Feb 23, 2021 •

edited by ghost

Loading