PVM guest kernel panic after restore from snapshot #2

zhuangel · 2024-03-11T03:44:38Z

Description

Cloud Hypervisor support take a snapshot for running virtual machine, then restore virtual machine from snapshot file. We try to test this feature with virtual machine running with PVM guest kernel on PVM host kernel.
After we restore the virtual machine from the snapshot file, the virtual machine panic with such exception.

[ 36.865628] kernel BUG at fs/buffer.c:1309!
[ 36.866930] invalid opcode: 0000 [#1] SMP PTI
[ 36.868247] CPU: 0 PID: 274 Comm: systemd-rc-loca Not tainted 6.7.0-rc6-virt-pvm-guest+ torvalds#55
[ 36.870556] Hardware name: Cloud Hypervisor cloud-hypervisor, BIOS 0
[ 36.872259] RIP: 0010:__find_get_block+0x1f2/0x2c0
[ 36.873659] Code: 5f c3 31 db e8 ff 7e 62 00 90 48 85 db 0f 84 6f fe ff ff 48 8b 7b 10 e8 9c 74 f3 ff 48 89 d8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 <0f> 0b 4c 89 ff e8 b4 59 f3 ff e9 ee fe ff ff 3e ff 43 60 e9 d2 fe
[ 36.878668] RSP: 0018:ffffd2000084fa78 EFLAGS: 00010046
[ 36.880195] RAX: 0000000000000000 RBX: 0000000000001000 RCX: 0000000000108c48
[ 36.882268] RDX: 0000000000001000 RSI: 000000000000402c RDI: ffffc9800308e580
[ 36.884237] RBP: ffffc9800308e580 R08: 0000000000004021 R09: 0000000000105cfb
[ 36.886105] R10: ffffd2000084fab8 R11: 0000000000000000 R12: 0000000000000000
[ 36.887960] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 36.889867] FS: 00007f1f91c3f900(0000) GS:fffff0003ec00000(0000) knlGS:0000000000000000
[ 36.891976] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 36.893598] CR2: 000055c387859065 CR3: 00000000041c8003 CR4: 0000000000070ef0
[ 36.895722] Call Trace:
[ 36.896521]
[ 36.897192] ? __die_body+0x15/0x50
[ 36.898212] ? die+0x33/0x50
[ 36.899034] ? do_trap+0x100/0x110
[ 36.900007] ? __find_get_block+0x1f2/0x2c0
[ 36.901217] ? do_error_trap+0x65/0x80
[ 36.902302] ? __find_get_block+0x1f2/0x2c0
[ 36.903483] ? exc_invalid_op+0x49/0x60
[ 36.904621] ? __find_get_block+0x1f2/0x2c0
[ 36.905741] ? pvm_kernel_exception_entry+0x4b/0x100
[ 36.907046] ? __find_get_block+0x1f2/0x2c0
[ 36.908215] ? ext4_es_lookup_extent+0x101/0x150
[ 36.909450] ? __find_get_block+0xf/0x2c0
[ 36.910525] bdev_getblk+0x20/0x220
[ 36.911551] ext4_getblk+0xc2/0x2c0
[ 36.912612] ext4_bread_batch+0x4b/0x150
[ 36.913776] __ext4_find_entry+0x150/0x420
[ 36.914995] ? __d_alloc+0x11c/0x1c0
[ 36.916009] ? d_alloc_parallel+0xab/0x360
[ 36.917146] ext4_lookup+0x7d/0x1d0
[ 36.918119] __lookup_slow+0x8a/0x130
[ 36.919245] ? __legitimize_path.isra.46+0x27/0x60
[ 36.920616] walk_component+0x7e/0x160
[ 36.921686] path_lookupat.isra.53+0x62/0x130
[ 36.922943] filename_lookup.part.71+0xbe/0x180
[ 36.924297] ? strncpy_from_user+0x96/0x110
[ 36.925544] user_path_at_empty+0x4c/0x50
[ 36.926715] do_faccessat+0xf1/0x2f0
[ 36.927765] do_syscall_64+0x4d/0xf0
[ 36.928889] entry_SYSCALL_64_after_hwframe+0x46/0x4e
[ 36.930371] RIP: 0033:0x7f1f9263a94b
[ 36.931399] Code: 77 05 c3 0f 1f 40 00 48 8b 15 e1 54 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff c3 0f 1f 40 00 f3 0f 1e fa b8 15 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 b1 54 10 00 f7 d8
[ 36.936338] RSP: 002b:00007ffd6caff888 EFLAGS: 00000202 ORIG_RAX: 0000000000000015
[ 36.938320] RAX: ffffffffffffffda RBX: 000055c387859065 RCX: 00007f1f9263a94b
[ 36.940191] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 000055c387859065
[ 36.942067] RBP: 00007ffd6caff9b8 R08: 000055c3897c54d0 R09: 00000000ffffffff
[ 36.944034] R10: 0000000000000000 R11: 0000000000000202 R12: 00007ffd6caff9b8
[ 36.946046] R13: 000055c387858240 R14: 000055c38785ad18 R15: 00007f1f92a48040
[ 36.948088]
[ 36.948759] ---[ end trace 0000000000000000 ]---
[ 36.950000] RIP: 0010:__find_get_block+0x1f2/0x2c0
[ 36.951291] Code: 5f c3 31 db e8 ff 7e 62 00 90 48 85 db 0f 84 6f fe ff ff 48 8b 7b 10 e8 9c 74 f3 ff 48 89 d8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 <0f> 0b 4c 89 ff e8 b4 59 f3 ff e9 ee fe ff ff 3e ff 43 60 e9 d2 fe
[ 36.956169] RSP: 0018:ffffd2000084fa78 EFLAGS: 00010046
[ 36.957689] RAX: 0000000000000000 RBX: 0000000000001000 RCX: 0000000000108c48
[ 36.959772] RDX: 0000000000001000 RSI: 000000000000402c RDI: ffffc9800308e580
[ 36.961827] RBP: ffffc9800308e580 R08: 0000000000004021 R09: 0000000000105cfb
[ 36.963903] R10: ffffd2000084fab8 R11: 0000000000000000 R12: 0000000000000000
[ 36.965982] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 36.968078] FS: 00007f1f91c3f900(0000) GS:fffff0003ec00000(0000) knlGS:0000000000000000
[ 36.970375] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 36.972021] CR2: 000055c387859065 CR3: 00000000041c8003 CR4: 0000000000070ef0
[ 36.974080] note: systemd-rc-loca[274] exited with irqs disabled
[ 36.999660] ------------[ cut here ]------------
[ 37.000909] kernel BUG at fs/buffer.c:1309!
[ 37.002140] invalid opcode: 0000 [#2] SMP PTI
[ 37.003425] CPU: 0 PID: 276 Comm: systemd-system- Tainted: G D 6.7.0-rc6-virt-pvm-guest+ torvalds#55
[ 37.006174] Hardware name: Cloud Hypervisor cloud-hypervisor, BIOS 0
[ 37.008054] RIP: 0010:__find_get_block+0x1f2/0x2c0
[ 37.009464] Code: 5f c3 31 db e8 ff 7e 62 00 90 48 85 db 0f 84 6f fe ff ff 48 8b 7b 10 e8 9c 74 f3 ff 48 89 d8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 <0f> 0b 4c 89 ff e8 b4 59 f3 ff e9 ee fe ff ff 3e ff 43 60 e9 d2 fe
[ 37.014586] RSP: 0018:ffffd2000085fa78 EFLAGS: 00010046
[ 37.016040] RAX: 0000000000000000 RBX: 0000000000001000 RCX: 0000000000108c48
[ 37.018085] RDX: 0000000000001000 RSI: 0000000000004021 RDI: ffffc9800308e580
[ 37.020095] RBP: ffffc9800308e580 R08: 0000000000004021 R09: 0000000000105cfb
[ 37.022142] R10: ffffc98006ccc6c0 R11: 0000000000000005 R12: 0000000000000000
[ 37.024102] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 37.026007] FS: 00007faf42db1900(0000) GS:fffff0003ec00000(0000) knlGS:0000000000000000
[ 37.028327] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 37.029963] CR2: 000055a2ec223000 CR3: 00000000070c4001 CR4: 0000000000070ef0
[ 37.032008] Call Trace:
[ 37.032772]
[ 37.033443] ? __die_body+0x15/0x50
[ 37.034477] ? die+0x33/0x50
[ 37.035376] ? do_trap+0x100/0x110
[ 37.036339] ? __find_get_block+0x1f2/0x2c0
[ 37.037559] ? do_error_trap+0x65/0x80
[ 37.038631] ? __find_get_block+0x1f2/0x2c0
[ 37.039871] ? exc_invalid_op+0x49/0x60
[ 37.040949] ? __find_get_block+0x1f2/0x2c0
[ 37.042109] ? pvm_kernel_exception_entry+0x4b/0x100
[ 37.043444] ? __find_get_block+0x1f2/0x2c0
[ 37.044622] ? ext4_es_lookup_extent+0x101/0x150
[ 37.046005] ? __find_get_block+0xf/0x2c0
[ 37.047195] bdev_getblk+0x20/0x220
[ 37.048154] ext4_getblk+0xc2/0x2c0
[ 37.049200] ext4_bread_batch+0x4b/0x150
[ 37.050279] __ext4_find_entry+0x150/0x420
[ 37.051403] ? __d_alloc+0x11c/0x1c0
[ 37.052397] ? d_alloc_parallel+0xab/0x360
[ 37.053540] ext4_lookup+0x7d/0x1d0
[ 37.054579] __lookup_slow+0x8a/0x130
[ 37.055688] ? __legitimize_path.isra.46+0x27/0x60
[ 37.057056] walk_component+0x7e/0x160
[ 37.058181] path_lookupat.isra.53+0x62/0x130
[ 37.059414] filename_lookup.part.71+0xbe/0x180
[ 37.060693] ? strncpy_from_user+0x96/0x110
[ 37.061889] user_path_at_empty+0x4c/0x50
[ 37.063081] do_faccessat+0xf1/0x2f0
[ 37.064191] do_syscall_64+0x4d/0xf0
[ 37.065208] entry_SYSCALL_64_after_hwframe+0x46/0x4e
[ 37.066659] RIP: 0033:0x7faf437acaf4
[ 37.067715] Code: 89 cd 41 54 41 89 d4 55 53 48 81 ec a8 00 00 00 64 48 8b 04 25 28 00 00 00 48 89 84 24 98 00 00 00 31 c0 b8 b7 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 40 01 00 00 41 89 c0 85 c0 0f 84 15 01 00
[ 37.072853] RSP: 002b:00007ffe493719a0 EFLAGS: 00000246 ORIG_RAX: 00000000000001b7
[ 37.074871] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007faf437acaf4
[ 37.076815] RDX: 0000000000000000 RSI: 000055a2ec223000 RDI: 00000000ffffff9c
[ 37.078732] RBP: 00007ffe49371aa0 R08: 000055a2ede694d0 R09: 00000000ffffffff
[ 37.080761] R10: 0000000000000100 R11: 0000000000000246 R12: 0000000000000000
[ 37.082720] R13: 0000000000000100 R14: 000055a2ec224d08 R15: 00007faf43bba040
[ 37.084608]
[ 37.085240] ---[ end trace 0000000000000000 ]---
[ 37.086555] RIP: 0010:__find_get_block+0x1f2/0x2c0
[ 37.087948] Code: 5f c3 31 db e8 ff 7e 62 00 90 48 85 db 0f 84 6f fe ff ff 48 8b 7b 10 e8 9c 74 f3 ff 48 89 d8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 <0f> 0b 4c 89 ff e8 b4 59 f3 ff e9 ee fe ff ff 3e ff 43 60 e9 d2 fe
[ 37.093082] RSP: 0018:ffffd2000084fa78 EFLAGS: 00010046
[ 37.094588] RAX: 0000000000000000 RBX: 0000000000001000 RCX: 0000000000108c48
[ 37.096597] RDX: 0000000000001000 RSI: 000000000000402c RDI: ffffc9800308e580
[ 37.098620] RBP: ffffc9800308e580 R08: 0000000000004021 R09: 0000000000105cfb
[ 37.100562] R10: ffffd2000084fab8 R11: 0000000000000000 R12: 0000000000000000
[ 37.102429] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 37.104331] FS: 00007faf42db1900(0000) GS:fffff0003ec00000(0000) knlGS:0000000000000000
[ 37.106629] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 37.108329] CR2: 000055a2ec223000 CR3: 00000000070c4001 CR4: 0000000000070ef0
[ 37.110371] note: systemd-system-[276] exited with irqs disabled

Step to reproduce

Build PVM host kernel and PVM guest kernel follow pvm-get-started-with-kata.md
Guest VM resource from Guide
cloud-hypervisor v37
VM image from Guide
Start VM and Create snapshot
#Start VM
cloud-hypervisor.v37 \
--api-socket ch.sock \
--log-file vmm.log \
--cpus boot=1,max_phys_bits=43 \
--kernel vmlinux.virt-pvm-guest \
--cmdline 'console=ttyS0 root=/dev/vda1 rw clocksource=kvm-clock pti=off' \
--memory size=1G,hugepages=off,shared=false,prefault=off \
--disk id=disk_0,path=ubuntu-22.04-pvm-kata.raw \
-v --console off --serial tty

#Pause VM in another shell
curl --unix-socket ch.sock -i -X PUT 'http://localhost/api/v1/vm.pause'

#Generate snapshot
curl --unix-socket $path -i \
-X PUT 'http://localhost/api/v1/vm.snapshot' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"destination_url":"file:///tmp/snapshot"
}'

#Shutdown the paused VM
curl --unix-socket ch.sock -i -X PUT 'http://localhost/api/v1/vmm.shutdown'
Restore VM
#Restore VM from snapshot
cloud-hypervisor.v37 \
--api-socket ch.sock \
--restore source_url=file:///tmp/snapshot \
-v --log-file vmm.log

#Resume the VM in another shell
curl --unix-socket ch.sock -i -X PUT 'http://localhost/api/v1/vm.pause'

The text was updated successfully, but these errors were encountered:

…onally During the VM restore process, if the VMM (e.g., Cloud Hypervisor) restores MSRs before adding the user memory region, it can result in a failure in kvm_gpc_activate() because no memslot has been added yet. As a consequence, the VM will panic after the VM restore since the GPC is not active. However, if we store the value even if kvm_gpc_activate() fails later when the GPC is active, it can be refreshed by the addition of the user memory region before the VM entry. Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Link: #2

bysui · 2024-04-02T12:40:47Z

Thank you for your report! I apologize for the delayed response. I am currently working on resolving a similar migration issue in QEMU.

Based on the vmm.log, it appears that the restoration of MSR_PVM_VCPU_STRUCT has encountered a failure.

cloud-hypervisor: 834.108417ms: <vmm> INFO:arch/src/x86_64/mod.rs:579 -- Running under nested virtualisation. Hypervisor string: KVMKVMKVM
cloud-hypervisor: 834.127740ms: <vmm> INFO:arch/src/x86_64/mod.rs:585 -- Generating guest CPUID for with physical address size: 43
cloud-hypervisor: 834.202578ms: <vmm> INFO:vmm/src/cpu.rs:833 -- Request to create new vCPUs: desired = 1, max = 1, allocated = 0, present = 0
cloud-hypervisor: 834.232237ms: <vmm> INFO:vmm/src/cpu.rs:758 -- Creating vCPU: cpu_id = 0
cloud-hypervisor: 835.112786ms: <vmm> WARN:hypervisor/src/kvm/mod.rs:2074 -- Detected faulty MSR 0x4b564d02 while setting MSRs
cloud-hypervisor: 835.156128ms: <vmm> WARN:hypervisor/src/kvm/mod.rs:2074 -- Detected faulty MSR 0x4b564d04 while setting MSRs
cloud-hypervisor: 835.181187ms: <vmm> WARN:hypervisor/src/kvm/mod.rs:2074 -- Detected faulty MSR 0x4b564df1 while setting MSRs
cloud-hypervisor: 835.289422ms: <vmm> INFO:vmm/src/pci_segment.rs:104 -- Adding PCI segment: id=0, PCI MMIO config address: 0xe8000000, mem32 area [0xc0000000-0xe7ffffff, mem64 area [0x100000000-0x7feffffffff

The MSR_PVM_VCPU_STRUCT is crucial for PVM as it stores the address of PVCS (PVM VCPU Structure). The reason for the restoration failure is that kvm_gpc_activate() failed due to the absence of a memslot during the MSR setting process. It seems that Cloud Hypervisor attempts to restore guest MSRs before restoring guest memory. In fact, there are other MSRs associated with guest memory that also fail to restore. To resolve this issue, the following fix can be implemented:

diff --git a/arch/x86/kvm/pvm/pvm.c b/arch/x86/kvm/pvm/pvm.c
index d76f731d0b0d..f71290816e5f 100644
--- a/arch/x86/kvm/pvm/pvm.c
+++ b/arch/x86/kvm/pvm/pvm.c
@@ -1149,12 +1149,21 @@ static int pvm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
        case MSR_PVM_VCPU_STRUCT:
                if (!PAGE_ALIGNED(data))
                        return 1;
+               /*
+                * During the VM restore process, if the VMM restores MSRs
+                * before adding the user memory region, it can result in a
+                * failure in kvm_gpc_activate() because no memslot has been
+                * added yet. As a consequence, the VM will panic after the VM
+                * restore since the GPC is not active. However, if we store
+                * the value even if kvm_gpc_activate() fails later when the
+                * GPC is active, it can be refreshed by the addition of the
+                * user memory region before the VM entry.
+                */
+               pvm->msr_vcpu_struct = data;
                if (!data)
                        kvm_gpc_deactivate(&pvm->pvcs_gpc);
                else if (kvm_gpc_activate(&pvm->pvcs_gpc, data, PAGE_SIZE))
                        return 1;
-
-               pvm->msr_vcpu_struct = data;
                break;
        case MSR_PVM_SUPERVISOR_RSP:

Similar to the failed MSR_KVM_ASYNC_PF_EN, record the value before the kvm_gpc_activate() call. Then, when the memslot is created, the GPC can be refreshed with the recorded value.

However, I suspect that this issue may be specific to Cloud Hypervisor rather than KVM. In the case of the failed MSR_KVM_PV_EOI_EN, it does not record the value before the kvm_gfn_to_hva_cache_init() call, potentially leading to a broken feature.

pojntfx · 2024-04-07T02:04:51Z

We've gotten snapshot/resume to work in the @loopholelabs Firecracker fork with a simple patch that includes the PVM MSRs in the snapshot - should be the same procedures for other hypervisors, esp. Cloud Hypervisor: loopholelabs/firecracker@e266c12

bysui · 2024-04-08T11:43:21Z

Hi @pojntfx, thank you for trying and testing PVM with Firecracker. Yes, we encountered the same issue with QEMU snapshots. It doesn't save or restore PVM MSRs. However, Cloud Hypervisor does the right thing by using the supported MSRs index acquired from the KVM to perform MSRs save/restore. I believe this is the correct approach for VMMs, utilizing the information provided by the KVM.

zhuangel · 2024-04-10T08:19:55Z

@bysui Thanks for the fix!

I have verified the issue on my test environment, after update host kernel code with pvm-fix branch, I could restore the VM successfully, sorry for missed to check the failure message in cloud-hypervisor log file :).

…onally During the VM restore process, if the VMM (e.g., Cloud Hypervisor) restores MSRs before adding the user memory region, it can result in a failure in kvm_gpc_activate() because no memslot has been added yet. As a consequence, the VM will panic after the VM restore since the GPC is not active. However, if we store the value even if kvm_gpc_activate() fails later when the GPC is active, it can be refreshed by the addition of the user memory region before the VM entry. Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Link: #2

pojntfx · 2024-04-17T00:36:34Z

I just re-tested this issue with 6a03a61 on Firecracker. Without our patch, Firecracker still fails with the same error (since it doesn't store the MSRs). Would the clean fix for this be to port Cloud Hypervisor's MSR handling logic to Firecracker instead of hard-coding these additional MSRs to save like we currently do?

bysui · 2024-04-17T06:08:03Z

I just re-tested this issue with 6a03a61 on Firecracker. Without our patch, Firecracker still fails with the same error (since it doesn't store the MSRs). Would the clean fix for this be to port Cloud Hypervisor's MSR handling logic to Firecracker instead of hard-coding these additional MSRs to save like we currently do?

Yes, I believe this is a long-term and reasonable approach in case there are new MSRs added in the future. It is the responsibility of the VMM to use the information provided by KVM to do the right thing.

zhuangel changed the title ~~Guest kernel Panic after restore from snapshot~~ PVM guest kernel panic after restore from snapshot Mar 13, 2024

bysui added the bug Something isn't working label Apr 2, 2024

bysui self-assigned this Apr 2, 2024

bysui mentioned this issue Apr 10, 2024

Question: what's the state and next step for this new feature? #5

Closed

zhuangel closed this as completed Apr 11, 2024

bysui mentioned this issue Apr 26, 2024

PVM host kernel panic after restore from snapshot in Cloud Hypervisor on EC2 #7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PVM guest kernel panic after restore from snapshot #2

PVM guest kernel panic after restore from snapshot #2

zhuangel commented Mar 11, 2024

bysui commented Apr 2, 2024

pojntfx commented Apr 7, 2024 •

edited

Loading

bysui commented Apr 8, 2024

zhuangel commented Apr 10, 2024

pojntfx commented Apr 17, 2024

bysui commented Apr 17, 2024

PVM guest kernel panic after restore from snapshot #2

PVM guest kernel panic after restore from snapshot #2

Comments

zhuangel commented Mar 11, 2024

Description

Step to reproduce

bysui commented Apr 2, 2024

pojntfx commented Apr 7, 2024 • edited Loading

bysui commented Apr 8, 2024

zhuangel commented Apr 10, 2024

pojntfx commented Apr 17, 2024

bysui commented Apr 17, 2024

pojntfx commented Apr 7, 2024 •

edited

Loading