ACS is not able to restart VM during HA process #10435

GerorgeEG · 2025-02-20T09:27:45Z

problem

We are testing HA and shutting down the KVM hypervisor through BMC, host status changed to down and ACS tried to start VM on another host but it gets failed.

versions

ACS Version : 4.19.1.2

KVM : RHEL 8

Storage : NFS v3

The steps to reproduce the bug

Shut down the KVM hypervisor through BMC.
2.Both host and VM is HA enabled
Wait for the status of hosts to change to down
ACS tries to start VM but getting failed.

Below is the log

2025-02-18 02:33:48,554 DEBUG [c.c.c.CapacityManagerImpl] (Work-Job-Executor-3:ctx-dae42fc9 job-17413/job-17446 ctx-06b6ee00) (logid:6675dfa0) VM instance {"id":541,"instanceName":"i-19-541-VM","type":"User","uuid":"130c856a-d5e4-4745-9a6a-c41c2508573a"} state transited from [Starting] to [Stopped] with event [OperationFailed]. VM's original host: Host {"id":85,"name":" host1.xx.xxx.xxx ","type":"Routing","uuid":"804bcf95-e073-462e-810a-aa64e85c78bd"}, new host: null, host before state transition: Host {"id":127,"name":"host2.xx.xxx.xxx","type":"Routing","uuid":"a9698e0c-9c63-4392-ae28-b7dbdceffd9d"}

2025-02-18 02:33:48,580 ERROR [c.c.v.VmWorkJobHandlerProxy] (Work-Job-Executor-3:ctx-dae42fc9 job-17413/job-17446 ctx-06b6ee00) (logid:6675dfa0) Invocation exception, caused by: com.cloud.exception.InsufficientServerCapacityException: Unable to create a deployment for VM instance {"id":541,"instanceName":"i-19-541-VM","type":"User","uuid":"130c856a-d5e4-4745-9a6a-c41c2508573a"}Scope=interface com.cloud.dc.DataCenter; id=1

2025-02-18 02:33:48,580 INFO [c.c.v.VmWorkJobHandlerProxy] (Work-Job-Executor-3:ctx-dae42fc9 job-17413/job-17446 ctx-06b6ee00) (logid:6675dfa0) Rethrow exception com.cloud.exception.InsufficientServerCapacityException: Unable to create a deployment for VM instance {"id":541,"instanceName":"i-19-541-VM","type":"User","uuid":"130c856a-d5e4-4745-9a6a-c41c2508573a"}Scope=interface com.cloud.dc.DataCenter; id=1

com.cloud.exception.InsufficientServerCapacityException: Unable to create a deployment for VM instance {"id":541,"instanceName":"i-19-541-VM","type":"User","uuid":"130c856a-d5e4-4745-9a6a-c41c2508573a"}Scope=interface com.cloud.dc.DataCenter; id=1

2025-02-18 02:33:48,639 WARN [c.c.h.HighAvailabilityManagerImpl] (HA-Worker-2:ctx-10bdb53f work-1129) (logid:2a0083a1) Unable to restart VM instance {"id":541,"instanceName":"i-19-541-VM","type":"User","uuid":"130c856a-d5e4-4745-9a6a-c41c2508573a"} due to Unable to create a deployment for VM instance {"id":541,"instanceName":"i-19-541-VM","type":"User","uuid":"130c856a-d5e4-4745-9a6a-c41c2508573a"}

What to do about it?

we need HA functionality to make sure VM gets restarted in case of KVM host getting down due to any issue.

boring-cyborg · 2025-02-20T09:27:48Z

Thanks for opening your first issue here! Be sure to follow the issue template!

shwstppr · 2025-02-20T11:27:49Z

@GerorgeEG you seem to be getting an InsufficientServerCapacityException when the VM is being started on a different host. Do you have free compute (in the same cluster if using Cluster scope storage)?
Check if the issues is related to host/storage tags used.
You may have to check logs around this InsufficientServerCapacityException

GerorgeEG · 2025-02-24T05:04:40Z

Hi @shwstppr yes capacity is there and we are using storage tags to place VM on correct storage type but we figured out something else.

VM's which were down due to host failure were not able get powered on even by admin on another hosts in same cluster. We did some more log analysis and find out it an issue with storage lock.

org.libvirt.LibvirtException: internal error: process exited while connecting to monitor: 2025-02-22T06:42:45.557175Z qemu-kvm: -blockdev {"node-name":"libvirt-3-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"qcow2","file":"libvirt-3-storage","backing":"libvirt-4-format"}: Failed to get "write" lock.

We remounted the NFS share with below options and then HA start working as expected.

NFS mount options
rw,vers=3,sync,hard,_netdev,nolock

Now can you please help as we are not sure the if we should continue with lock option or is there anything else. Looking forward for suggestions with best approach.

shwstppr · 2025-02-24T13:24:26Z

@GerorgeEG my understanding of NFS configuration is limited. Maybe others can help cc @rohityadavcloud @NuxRo @andrijapanicsb

andrijapanicsb · 2025-02-24T14:48:35Z

@GerorgeEG that is not enough log/info to understand what kind of write lock could not be obtained. Is it write lock on the NFS (sounds like a system wide issue), or is it that it could not get write lock on the qcow2 file.

In recent versions of KVM, there is a lock places on the qcow2 file, so you can't even do a basic qemu-img info command, without passing the -U (or -S, or whatever it was) parameter, to force-access the qcow2 (KVM putting lock to protect another host mounting/using the same qcow2 disk, which makes sense - but not sure if you might be hitting some bug, where the lock stays somehow on the qcow2, so other hosts can not access the qcow2 file.

If you enabled debug logging on a specific host, and then try to start the "crashed" VM on this host specifically (I think ACS allows you to do that in most recent releases) - then you might be able to see more details logged.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ACS is not able to restart VM during HA process #10435

ACS is not able to restart VM during HA process #10435

GerorgeEG commented Feb 20, 2025

boring-cyborg bot commented Feb 20, 2025

shwstppr commented Feb 20, 2025

GerorgeEG commented Feb 24, 2025

shwstppr commented Feb 24, 2025

andrijapanicsb commented Feb 24, 2025

ACS is not able to restart VM during HA process #10435

ACS is not able to restart VM during HA process #10435

Comments

GerorgeEG commented Feb 20, 2025

problem

versions

The steps to reproduce the bug

Below is the log

What to do about it?

boring-cyborg bot commented Feb 20, 2025

shwstppr commented Feb 20, 2025

GerorgeEG commented Feb 24, 2025

shwstppr commented Feb 24, 2025

andrijapanicsb commented Feb 24, 2025