ONE incorrectly handles "failed" live migration #6634

hydro-b · 2024-06-27T14:20:10Z

Description
A VM live migration succeeds but ONE receives an exit code of "1" by migrate script and assumes the migration has not succeeded. While in practice the VM has been live migrated successfully (only the SYNC_TIME part failed).

To Reproduce

Unsure what conditions lead up to this bug. We only have this behavior on one cluster. The error message is:

Jun 26 10:43:38 oned3 oned[481791]: [VM 0][Z0][VMM][I]: Failed to execute virtualization driver operation: migrate.
Jun 26 10:43:38 oned3 oned[481791]: [Z0][VMM][D]: Message received: MIGRATE FAILURE 0 virsh --connect qemu:///system migrate --live  DOMAIN-ID qemu+ssh://some_host/system (9.769216494s) Error mirgating VM DOMAIN-ID to host some_host: undefined method `upcase' for nil:NilClass ["/var/tmp/one/vmm/kvm/migrate:234:in `<main>'"] ExitCode: 1
Jun 26 10:43:38 oned3 oned[481791]: [VM 0][Z0][VMM][E]: MIGRATE: virsh --connect qemu:///system migrate --live  DOMAIN-ID qemu+ssh://some_host/system (9.769216494s) Error mirgating VM DOMAIN-ID to host some_host: undefined method `upcase' for nil:NilClass ["/var/tmp/one/vmm/kvm/migrate:234:in `<main>'"] ExitCode: 1

The piece of code that fails:

/var/lib/one/remotes/vmm/kvm/migrate

    # Sync guest time
    if ENV['SYNC_TIME'].upcase == 'YES'
        cmds =<<~EOS
            (
              for I in $(seq 4 -1 1); do
                if #{virsh} --readonly dominfo #{@deploy_id}; then
                  #{virsh} domtime --sync #{@deploy_id} && exit
                  [ "\$I" -gt 1 ] && sleep 5
                else
                  exit
                fi
              done
            ) &>/dev/null &
        EOS

Turns out that the virsh command does not result in a list with domains, is therefore empty, and the error is not handled properly. So ideally above piece of code gets fixed so it checks if the hash is not empty. If it is empty then a clear error message should be returned: unable to obtain $domain with virsh command here.

And ideally ONE should not assume that the VM is still running as it is currently doing: Jun 27 15:55:18 oned1 oned[1434842]: [VM 0][Z0][LCM][I]: Fail to live migrate VM. Assuming that the VM is still RUNNING. It will detect the VM is in poweroff state and the hypervisor the VM is running on detects a zombie. Ideally this is handled better: do some extra checks to see where the VM is running.

I can reproduce the behavior of virsh not listing domains by executing the ssh command without LIBVIRT_URI exported. This sounds like it could be related to this bug: LIBVIRT_URI

Strange enough I do not hit this issue when I explicitly export export SYNC_TIME=yes. It does not matter what value SYNC_TIME is set to. Either no, yes, or even broken will then_ not_ trigger this issue. So maybe somehow, if this is indeed the issue, kvmrc ENV vars are read when SYNC_TIME is exported (and otherwise not?).

Expected behavior
I expect ONE to handle the error gracefully instead of hitting an assert in the code. And to have ONE double check where a live migrated VM ended up living if a non zero exit code gets returned (instead of assuming the VM keeps on running on the source hypervisor).

Details

Hypervisor: KVM
Version: 6.8.3

Additional context
There seems to be a specific pre-condition that has to be true to hit this bug as we see it only happen on one specific (dedicated) cloud but as of now it is unclear what this is. I can reproduce this issue so if further debug information has to be gathered, please let me know.

Ubuntu 22.04 ONE / hypervisors

Progress Status

Code committed
Testing - QA
Documentation (Release notes - resolved issues, compatibility, known issues)

The text was updated successfully, but these errors were encountered:

rsmontero · 2024-07-31T09:07:59Z

So for the updating state, the monitor process will update the VM state if it is not running to poweroff (libvirt should keep it running). In the same way if the migration fails, the VM will be running in the source host.

hydro-b · 2024-07-31T09:38:05Z

So for the updating state, the monitor process will update the VM state if it is not running to poweroff (libvirt should keep it running). In the same way if the migration fails, the VM will be running in the source host.

Yes, I understand how this is currently handled. And on the destination host a "zombie" VM will be detected. This still leaves a time window for error (as long as storage fencing is not implemented). I think this behavior could be improved. To check for failure scenarios like this. One of the things ONE could do is perform a bit of extra checking on the destination host and double check if the VM is not there. And raise an error instead (with a helpful message). In this particular case it would not have helped, as it wasn't able to detect any domains running.

For my understanding: this issue has been closed. Was this already fixed in another PR for the 6.10 release? To be clear: the main issues here are: 1) Hitting the live-migration issue (ONE unable to detect running domains), 2) graceful handling of this error by the code (in case no domains are detected).

rsmontero · 2024-07-31T09:41:44Z

yes it is fixed in 6.10, but actually it solves the migration with SYNC_TIME part.

So maybe we can keep this open, for future improvements as we did not address any change to current behavior

hydro-b added the Type: Bug label Jun 27, 2024

hydro-b mentioned this issue Jul 2, 2024

Implement Ceph RBD fencing #6640

Open

3 tasks

tinova added the Sponsored label Jul 30, 2024

tinova modified the milestones: Release 7.0, Release 6.10.0 Jul 30, 2024

tinova assigned dgarcia18 Jul 30, 2024

tinova added Category: KVM Status: Accepted Priority: Normal labels Jul 30, 2024

rsmontero closed this as completed Jul 31, 2024

rsmontero reopened this Jul 31, 2024

tinova modified the milestones: Release 6.10.0, Release 6.10.1 Aug 29, 2024

rsmontero closed this as completed Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ONE incorrectly handles "failed" live migration #6634

ONE incorrectly handles "failed" live migration #6634

hydro-b commented Jun 27, 2024

rsmontero commented Jul 31, 2024

hydro-b commented Jul 31, 2024

rsmontero commented Jul 31, 2024

ONE incorrectly handles "failed" live migration #6634

ONE incorrectly handles "failed" live migration #6634

Comments

hydro-b commented Jun 27, 2024

Progress Status

rsmontero commented Jul 31, 2024

hydro-b commented Jul 31, 2024

rsmontero commented Jul 31, 2024