You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With the default configurations, CloudStack determines a KVM host is down in 15-20 minutes. The HA-enabled instances will be started on another host only after this process. While reviewing the delay for the host state investigation followed by a ping timeout I see one command that takes 10 minutes 'com.cloud.agent.api.CheckOnHostCommand printing in the logs the following message 'timed out after 3600'. Later the host is determined as down via the neighbouring host quickly.
I suspect there is some issue in this specific implementation and if fixed the VM HA delay in KVM could be reduced by 10 minutes.
2025-01-28 06:22:30,041 DEBUG [c.c.a.t.Request] (AgentTaskPool-4:ctx-1ece8add) (logid:49d74f9a) Seq 2-4979573812988215360: Sending { Cmd , MgmtId: 32988184186020, via: 2(ref-trl-5786-k-Mu22-jithin-raju-kvm2), Ver: v1, Flags: 100011, [{"com.cloud.agent.api.CheckOnHostCommand":{"host":{"guid":"439751ba-a6eb-3103-b60d-8321f53224fb-LibvirtComputingResource","privateNetwork":{"ip":"10.1.33.180","netmask":"255.255.240.0","mac":"1e:00:a9:00:0a:72","isSecurityGroupEnabled":"false"},"storageNetwork1":{"ip":"10.1.33.180","netmask":"255.255.240.0","mac":"1e:00:a9:00:0a:72","isSecurityGroupEnabled":"false"}},"reportCheckFailureIfOneStorageIsDown":"false","wait":"0","bypassHostMaintenance":"false"}}] }
2025-01-28 06:32:14,792 DEBUG [c.c.a.m.AgentAttache] (AgentTaskPool-4:ctx-1ece8add) (logid:49d74f9a) Seq 2-4979573812988215360: Waiting some more time because this is the current command
2025-01-28 06:32:14,792 DEBUG [c.c.a.m.AgentAttache] (AgentTaskPool-4:ctx-1ece8add) (logid:49d74f9a) Seq 2-4979573812988215360: Waiting some more time because this is the current command
2025-01-28 06:32:14,792 WARN [c.c.a.m.AgentAttache] (AgentTaskPool-4:ctx-1ece8add) (logid:49d74f9a) Seq 2-4979573812988215360: Timed out on Seq 2-4979573812988215360: { Cmd , MgmtId: 32988184186020, via: 2(ref-trl-5786-k-Mu22-jithin-raju-kvm2), Ver: v1, Flags: 100011, [{"com.cloud.agent.api.CheckOnHostCommand":{"host":{"guid":"439751ba-a6eb-3103-b60d-8321f53224fb-LibvirtComputingResource","privateNetwork":{"ip":"10.1.33.180","netmask":"255.255.240.0","mac":"1e:00:a9:00:0a:72","isSecurityGroupEnabled":"false"},"storageNetwork1":{"ip":"10.1.33.180","netmask":"255.255.240.0","mac":"1e:00:a9:00:0a:72","isSecurityGroupEnabled":"false"}},"reportCheckFailureIfOneStorageIsDown":"false","wait":"0","bypassHostMaintenance":"false"}}] }
2025-01-28 06:32:14,793 DEBUG [c.c.a.m.AgentAttache] (AgentTaskPool-4:ctx-1ece8add) (logid:49d74f9a) Seq 2-4979573812988215360: Cancelling.
2025-01-28 06:32:14,793 WARN [c.c.a.m.AgentManagerImpl] (AgentTaskPool-4:ctx-1ece8add) (logid:49d74f9a) Operation timed out: Commands 4979573812988215360 to Host 2 timed out after 3600
problem
With the default configurations, CloudStack determines a KVM host is down in 15-20 minutes. The HA-enabled instances will be started on another host only after this process. While reviewing the delay for the host state investigation followed by a ping timeout I see one command that takes 10 minutes 'com.cloud.agent.api.CheckOnHostCommand printing in the logs the following message 'timed out after 3600'. Later the host is determined as down via the neighbouring host quickly.
I suspect there is some issue in this specific implementation and if fixed the VM HA delay in KVM could be reduced by 10 minutes.
https://gist.github.com/rajujith/9a51c52163eb4862b497057a40e8b812#file-acs-kvm-vm-ha-host-down
versions
4.19.1.3
The steps to reproduce the bug
...
What to do about it?
Reduce the delay in the VM HA on KVM.
The text was updated successfully, but these errors were encountered: