Migration failure on 6.10.0 - missing to_s. How to recover?

Hi all,

I have just upgraded to 6.10.0 from 6.8 (CE), and wanted to reboot all nodes. During onehost flush I have seen migrating VMs failing with the following message in the VM log:

Mon Feb 3 14:18:16 2025 [Z0][VMM][I]: Command execution fail (exit code: 1): cat << 'EOT' | /var/lib/one/tmp/vmm/kvm/migrate '0b62ee41-3530-459d-9f92-ab0de19d826a' 'node5' 'node4' 3853 node4
Mon Feb 3 14:18:16 2025 [Z0][VMM][I]: virsh --connect qemu:///system migrate --live 0b62ee41-3530-459d-9f92-ab0de19d826a qemu+ssh://node5/system (23.462960391s)
Mon Feb 3 14:18:16 2025 [Z0][VMM][I]: Error mirgating VM 0b62ee41-3530-459d-9f92-ab0de19d826a to host node5: undefined method `upcase' for nil:NilClass
Mon Feb 3 14:18:16 2025 [Z0][VMM][I]: ["/var/lib/one/tmp/vmm/kvm/migrate:255:in `<main>'"]
Mon Feb 3 14:18:16 2025 [Z0][VMM][I]: ExitCode: 1
Mon Feb 3 14:18:16 2025 [Z0][VMM][I]: Successfully execute transfer manager driver operation: tm_failmigrate.
Mon Feb 3 14:18:16 2025 [Z0][VMM][I]: Failed to execute virtualization driver operation: migrate.
Mon Feb 3 14:18:16 2025 [Z0][VMM][E]: MIGRATE: virsh --connect qemu:///system migrate --live 0b62ee41-3530-459d-9f92-ab0de19d826a qemu+ssh://node5/system (23.462960391s) Error mirgating VM 0b62ee41-3530-459d-9f92-ab0de19d826a to host node5: undefined method `upcase' for nil:NilClass ["/var/lib/one/tmp/vmm/kvm/migrate:255:in `<main>'"] ExitCode: 1
Mon Feb 3 14:18:16 2025 [Z0][VM][I]: New LCM state is RUNNING
Mon Feb 3 14:18:16 2025 [Z0][LCM][I]: Fail to live migrate VM. Assuming that the VM is still RUNNING.
Mon Feb 3 14:18:47 2025 [Z0][LCM][I]: VM running but monitor state is POWEROFF

Now the VM seems to be running on node5 (i.e. it migrated successfully), but OpenNebula reports that it is in POWEROFF state.

The fix seems to be simple:

--- /var/lib/one/remotes-6.10.0-1.el9-dist/vmm/kvm/migrate	2024-08-27 18:27:44.000000000 +0200
+++ /var/lib/one/remotes/vmm/kvm/migrate	2025-02-03 14:58:18.190160184 +0100
@@ -252,7 +252,7 @@
 
     # Compact memory
     # rubocop:disable Layout/LineLength
-    if ENV['CLEANUP_MEMORY_ON_STOP'].upcase == 'YES'
+    if ENV['CLEANUP_MEMORY_ON_STOP'].to_s.upcase == 'YES'
         `(sudo -l | grep -q sysctl) && sudo -n sysctl vm.drop_caches=3 vm.compact_memory=1 &>/dev/null &`
     end
     # rubocop:enable Layout/LineLength

But how can I recover the VMs without disruption? As I said, they are running on new hosts, so I just need to tell that to ONe. How can I do this? Thanks!

-Yenya

Update: trying to recover running QEMU processes while ONe thinks the VMs are POWEROFF. I tried to do

  • figure out the host where QEMU is really running

  • figure out the number of placements, something like onevm show --json $VM_ID | jq .VM.HISTORY_RECORDS.HISTORY[-1].SEQ

  • onedb update-history --id $VM_ID --seq $LAST_SEQ (there really should be a --last-seq switch instead of just --seq N) - edit the entry to reflect the hostname and ID of the host where the QEMU process is running.

  • onedb update-body vm --id $VM_ID – set all of STATE, LCM_STATE, PREV_STATE and PREV_LCM_STATE to 3.

  • onevm resched $VM_ID to set up a new QEMU process based on what ONe expects

This more or less works, but these VMs now do not have a VNC console. Clicking on the console icon in sunstone displays a new browser tab, but connection error Something went wrong, connection is closed is displayed instead of the VNC session. Some of the VMs now have “None” console set up in VM tabConfUpdate ConfigurationInput/Output. But even when I enable VNC manually, the console is still inaccessible.

So, what is the correct way how to tell ONe about a running QEMU ONe seems to forget?
Thanks,

-Yenya

To recover the VNC for those VMs (in the new Sunstone, port 2616), I needed to restart also opennebula-guacd.

@jorel: ok, thanks! We still use the original Ruby Sunstone by default.

Anyway, does the fix in the first post look feasible? Can you apply it?

Thanks,

-Yenya

It was actually already fixed on Sep 2.

but it was only released in EE hotfixes.

@jorel: thanks.

This is pretty sad to leave such a critical bug in CE and fix it only in EE. But never mind, it is not my project.

-Yenya