You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description
A VM live migration succeeds but ONE receives an exit code of "1" by migrate script and assumes the migration has not succeeded. While in practice the VM has been live migrated successfully (only the SYNC_TIME part failed).
To Reproduce
Unsure what conditions lead up to this bug. We only have this behavior on one cluster. The error message is:
Jun 26 10:43:38 oned3 oned[481791]: [VM 0][Z0][VMM][I]: Failed to execute virtualization driver operation: migrate.
Jun 26 10:43:38 oned3 oned[481791]: [Z0][VMM][D]: Message received: MIGRATE FAILURE 0 virsh --connect qemu:///system migrate --live DOMAIN-ID qemu+ssh://some_host/system (9.769216494s) Error mirgating VM DOMAIN-ID to host some_host: undefined method `upcase' for nil:NilClass ["/var/tmp/one/vmm/kvm/migrate:234:in `<main>'"] ExitCode: 1
Jun 26 10:43:38 oned3 oned[481791]: [VM 0][Z0][VMM][E]: MIGRATE: virsh --connect qemu:///system migrate --live DOMAIN-ID qemu+ssh://some_host/system (9.769216494s) Error mirgating VM DOMAIN-ID to host some_host: undefined method `upcase' for nil:NilClass ["/var/tmp/one/vmm/kvm/migrate:234:in `<main>'"] ExitCode: 1
The piece of code that fails:
/var/lib/one/remotes/vmm/kvm/migrate
# Sync guest time
if ENV['SYNC_TIME'].upcase == 'YES'
cmds =<<~EOS
(
for I in $(seq 4 -1 1); do
if #{virsh} --readonly dominfo #{@deploy_id}; then
#{virsh} domtime --sync #{@deploy_id} && exit
[ "\$I" -gt 1 ] && sleep 5
else
exit
fi
done
) &>/dev/null &
EOS
Turns out that the virsh command does not result in a list with domains, is therefore empty, and the error is not handled properly. So ideally above piece of code gets fixed so it checks if the hash is not empty. If it is empty then a clear error message should be returned: unable to obtain $domain with virsh command here.
And ideally ONE should not assume that the VM is still running as it is currently doing: Jun 27 15:55:18 oned1 oned[1434842]: [VM 0][Z0][LCM][I]: Fail to live migrate VM. Assuming that the VM is still RUNNING. It will detect the VM is in poweroff state and the hypervisor the VM is running on detects a zombie. Ideally this is handled better: do some extra checks to see where the VM is running.
I can reproduce the behavior of virsh not listing domains by executing the ssh command without LIBVIRT_URI exported. This sounds like it could be related to this bug: LIBVIRT_URI
Strange enough I do not hit this issue when I explicitly export export SYNC_TIME=yes. It does not matter what value SYNC_TIME is set to. Either no, yes, or even broken will then_ not_ trigger this issue. So maybe somehow, if this is indeed the issue, kvmrc ENV vars are read when SYNC_TIME is exported (and otherwise not?).
Expected behavior
I expect ONE to handle the error gracefully instead of hitting an assert in the code. And to have ONE double check where a live migrated VM ended up living if a non zero exit code gets returned (instead of assuming the VM keeps on running on the source hypervisor).
Details
Hypervisor: KVM
Version: 6.8.3
Additional context
There seems to be a specific pre-condition that has to be true to hit this bug as we see it only happen on one specific (dedicated) cloud but as of now it is unclear what this is. I can reproduce this issue so if further debug information has to be gathered, please let me know.
Ubuntu 22.04 ONE / hypervisors
Progress Status
Code committed
Testing - QA
Documentation (Release notes - resolved issues, compatibility, known issues)
The text was updated successfully, but these errors were encountered:
So for the updating state, the monitor process will update the VM state if it is not running to poweroff (libvirt should keep it running). In the same way if the migration fails, the VM will be running in the source host.
So for the updating state, the monitor process will update the VM state if it is not running to poweroff (libvirt should keep it running). In the same way if the migration fails, the VM will be running in the source host.
Yes, I understand how this is currently handled. And on the destination host a "zombie" VM will be detected. This still leaves a time window for error (as long as storage fencing is not implemented). I think this behavior could be improved. To check for failure scenarios like this. One of the things ONE could do is perform a bit of extra checking on the destination host and double check if the VM is not there. And raise an error instead (with a helpful message). In this particular case it would not have helped, as it wasn't able to detect any domains running.
For my understanding: this issue has been closed. Was this already fixed in another PR for the 6.10 release? To be clear: the main issues here are: 1) Hitting the live-migration issue (ONE unable to detect running domains), 2) graceful handling of this error by the code (in case no domains are detected).
Description
A VM live migration succeeds but ONE receives an exit code of "1" by migrate script and assumes the migration has not succeeded. While in practice the VM has been live migrated successfully (only the
SYNC_TIME
part failed).To Reproduce
Unsure what conditions lead up to this bug. We only have this behavior on one cluster. The error message is:
The piece of code that fails:
/var/lib/one/remotes/vmm/kvm/migrate
Turns out that the
virsh
command does not result in a list with domains, is therefore empty, and the error is not handled properly. So ideally above piece of code gets fixed so it checks if the hash is not empty. If it is empty then a clear error message should be returned: unable to obtain $domain withvirsh command here
.And ideally ONE should not assume that the VM is still running as it is currently doing:
Jun 27 15:55:18 oned1 oned[1434842]: [VM 0][Z0][LCM][I]: Fail to live migrate VM. Assuming that the VM is still RUNNING.
It will detect the VM is inpoweroff
state and the hypervisor the VM is running on detects a zombie. Ideally this is handled better: do some extra checks to see where the VM is running.I can reproduce the behavior of
virsh
not listing domains by executing the ssh command withoutLIBVIRT_URI
exported. This sounds like it could be related to this bug: LIBVIRT_URIStrange enough I do not hit this issue when I explicitly export
export SYNC_TIME=yes
. It does not matter what valueSYNC_TIME
is set to. Eitherno
,yes
, or evenbroken
will then_ not_ trigger this issue. So maybe somehow, if this is indeed the issue,kvmrc
ENV vars are read whenSYNC_TIME
is exported (and otherwise not?).Expected behavior
I expect ONE to handle the error gracefully instead of hitting an assert in the code. And to have ONE double check where a live migrated VM ended up living if a non zero exit code gets returned (instead of assuming the VM keeps on running on the source hypervisor).
Details
Additional context
There seems to be a specific pre-condition that has to be true to hit this bug as we see it only happen on one specific (dedicated) cloud but as of now it is unclear what this is. I can reproduce this issue so if further debug information has to be gathered, please let me know.
Ubuntu 22.04 ONE / hypervisors
Progress Status
The text was updated successfully, but these errors were encountered: