Hello Support,
Our plan is to migrate from Opennebula 5.2 to 5.4. We are in the lucky situation, that we have some spare servers to test the new version. So we set up a frontend cluster with three (the new HA mode) nodes and added two KVM nodes. We used GlusterFS, so on all frontends and kvms are shared datastores. We deployed two vm on each node for testing purposes. Live migration and cold migration works fine, when triggered from the frontend. Everything is working fine except the host error hook.
Our assumption is the following:
- The host goes down initiated by a IPMI poweroff
- Opennebula recognizes that the host is down (we can see this via āonehost listā and in the oned.log)
- Since only cold migration could work on host down we assume that this should happen
- The host hook is launched, the fencing works and the vms seem to be reschuled (we can see this in sched.log and host_error.log). Whats confusing us, is that the oned.log shows āLive-migrating VM XYā
- Sometimes only one vm is āmigratedā to the other sometimes both but allways they are stuck in state MIGRATE
oned.log:
Mon Aug 28 13:28:45 2017 [Z0][InM][D]: Host one-mgmrz-01 (0) successfully monitored.
Mon Aug 28 13:28:46 2017 [Z0][InM][I]: Command execution fail: āif [ -x ā/var/tmp/one/im/run_probesā ]; then /var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124 20 1 one-mgmrz-02; else exit 42; fiā
Mon Aug 28 13:28:46 2017 [Z0][InM][I]: ssh: connect to host one-mgmrz-02 port 22: Connection timed out
Mon Aug 28 13:28:46 2017 [Z0][InM][I]: ExitCode: 255
Mon Aug 28 13:28:48 2017 [Z0][InM][I]: Command execution fail: āif [ -x ā/var/tmp/one/im/run_probesā ]; then /var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124 20 1 one-mgmrz-02; else exit 42; fiā
Mon Aug 28 13:28:48 2017 [Z0][InM][I]: ssh: connect to host one-mgmrz-02 port 22: No route to host
Mon Aug 28 13:28:48 2017 [Z0][InM][I]: ExitCode: 255
Mon Aug 28 13:28:51 2017 [Z0][InM][I]: Command execution fail: āif [ -x ā/var/tmp/one/im/run_probesā ]; then /var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124 20 1 one-mgmrz-02; else exit 42; fiā
Mon Aug 28 13:28:51 2017 [Z0][InM][I]: ssh: connect to host one-mgmrz-02 port 22: No route to host
Mon Aug 28 13:28:51 2017 [Z0][InM][I]: ExitCode: 255
Mon Aug 28 13:28:54 2017 [Z0][InM][I]: Command execution fail: āif [ -x ā/var/tmp/one/im/run_probesā ]; then /var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124 20 1 one-mgmrz-02; else exit 42; fiā
Mon Aug 28 13:28:54 2017 [Z0][InM][I]: ssh: connect to host one-mgmrz-02 port 22: No route to host
Mon Aug 28 13:28:54 2017 [Z0][InM][I]: ExitCode: 255
Mon Aug 28 13:28:54 2017 [Z0][ONE][E]: Error monitoring Host one-mgmrz-02 (1): -
Mon Aug 28 13:29:05 2017 [Z0][InM][D]: Host one-mgmrz-01 (0) successfully monitored.
Mon Aug 28 13:29:12 2017 [Z0][MKP][D]: Monitoring marketplace OpenNebula Public (0)
Mon Aug 28 13:29:13 2017 [Z0][InM][D]: Monitoring datastore default (1)
Mon Aug 28 13:29:13 2017 [Z0][InM][D]: Monitoring datastore files (2)
Mon Aug 28 13:29:13 2017 [Z0][InM][D]: Monitoring datastore system-qcow2 (100)
Mon Aug 28 13:29:13 2017 [Z0][InM][D]: Monitoring datastore image-qcow2 (101)
Mon Aug 28 13:29:13 2017 [Z0][InM][D]: Monitoring datastore image-raw (102)
Mon Aug 28 13:29:13 2017 [Z0][ImM][D]: Datastore default (1) successfully monitored.
Mon Aug 28 13:29:13 2017 [Z0][ImM][D]: Datastore image-raw (102) successfully monitored.
Mon Aug 28 13:29:14 2017 [Z0][ImM][D]: Datastore image-qcow2 (101) successfully monitored.
Mon Aug 28 13:29:14 2017 [Z0][ImM][D]: Datastore files (2) successfully monitored.
Mon Aug 28 13:29:14 2017 [Z0][MKP][D]: Marketplace OpenNebula Public (0) successfully monitored.
Mon Aug 28 13:29:14 2017 [Z0][VMM][D]: VM 76 successfully monitored: DISK_SIZE=[ID=0,SIZE=8] SNAPSHOT_SIZE=[ID=0,DISK_ID=0,SIZE=8] DISK_SIZE=[ID=1,SIZE=1]
Mon Aug 28 13:29:14 2017 [Z0][ImM][D]: Datastore system-qcow2 (100) successfully monitored.
Mon Aug 28 13:29:25 2017 [Z0][InM][D]: Host one-mgmrz-01 (0) successfully monitored.
Mon Aug 28 13:29:29 2017 [Z0][DiM][D]: Setting rescheduling flag on VM 76
Mon Aug 28 13:29:29 2017 [Z0][HKM][D]: Message received: EXECUTE SUCCESS 1 error:
Mon Aug 28 13:29:35 2017 [Z0][DiM][D]: Live-migrating VM 76
Mon Aug 28 13:29:37 2017 [Z0][InM][D]: Monitoring host one-mgmrz-02 (1)
Mon Aug 28 13:29:40 2017 [Z0][InM][I]: Command execution fail: āif [ -x ā/var/tmp/one/im/run_probesā ]; then /var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124 20 1 one-mgmrz-02; else exit 42; fiā
Mon Aug 28 13:29:40 2017 [Z0][InM][I]: ssh: connect to host one-mgmrz-02 port 22: No route to host
Mon Aug 28 13:29:40 2017 [Z0][InM][I]: ExitCode: 255
Mon Aug 28 13:29:42 2017 [Z0][MKP][D]: Monitoring marketplace OpenNebula Public (0)
Mon Aug 28 13:29:43 2017 [Z0][InM][D]: Monitoring datastore default (1)
Mon Aug 28 13:29:43 2017 [Z0][InM][D]: Monitoring datastore files (2)
Mon Aug 28 13:29:43 2017 [Z0][InM][D]: Monitoring datastore system-qcow2 (100)
Mon Aug 28 13:29:43 2017 [Z0][InM][D]: Monitoring datastore image-qcow2 (101)
Mon Aug 28 13:29:43 2017 [Z0][InM][D]: Monitoring datastore image-raw (102)
Mon Aug 28 13:29:43 2017 [Z0][InM][I]: Command execution fail: āif [ -x ā/var/tmp/one/im/run_probesā ]; then /var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124 20 1 one-mgmrz-02; else exit 42; fiā
Mon Aug 28 13:29:43 2017 [Z0][InM][I]: ssh: connect to host one-mgmrz-02 port 22: No route to host
Mon Aug 28 13:29:43 2017 [Z0][InM][I]: ExitCode: 255
Mon Aug 28 13:29:43 2017 [Z0][ImM][D]: Datastore files (2) successfully monitored.
Mon Aug 28 13:29:43 2017 [Z0][ImM][D]: Datastore default (1) successfully monitored.
Mon Aug 28 13:29:43 2017 [Z0][ImM][D]: Datastore image-qcow2 (101) successfully monitored.
Mon Aug 28 13:29:44 2017 [Z0][ImM][D]: Datastore image-raw (102) successfully monitored.
Mon Aug 28 13:29:44 2017 [Z0][VMM][D]: VM 76 successfully monitored: DISK_SIZE=[ID=0,SIZE=8] SNAPSHOT_SIZE=[ID=0,DISK_ID=0,SIZE=8] DISK_SIZE=[ID=1,SIZE=1]
Mon Aug 28 13:29:44 2017 [Z0][ImM][D]: Datastore system-qcow2 (100) successfully monitored.
Mon Aug 28 13:29:44 2017 [Z0][MKP][D]: Marketplace OpenNebula Public (0) successfully monitored.
Mon Aug 28 13:29:46 2017 [Z0][InM][I]: Command execution fail: āif [ -x ā/var/tmp/one/im/run_probesā ]; then /var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124 20 1 one-mgmrz-02; else exit 42; fiā
Mon Aug 28 13:29:46 2017 [Z0][InM][I]: ssh: connect to host one-mgmrz-02 port 22: No route to host
Mon Aug 28 13:29:46 2017 [Z0][InM][I]: ExitCode: 255
Mon Aug 28 13:29:49 2017 [Z0][InM][I]: Command execution fail: āif [ -x ā/var/tmp/one/im/run_probesā ]; then /var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124 20 1 one-mgmrz-02; else exit 42; fiā
Mon Aug 28 13:29:49 2017 [Z0][InM][I]: ssh: connect to host one-mgmrz-02 port 22: No route to host
Mon Aug 28 13:29:49 2017 [Z0][InM][I]: ExitCode: 255
Mon Aug 28 13:29:49 2017 [Z0][ONE][E]: Error monitoring Host one-mgmrz-02 (1): -
Mon Aug 28 13:29:50 2017 [Z0][InM][D]: Host one-mgmrz-01 (0) successfully monitored.
vm.log:
Mon Aug 28 13:10:40 2017 [Z0][VM][I]: New LCM state is RUNNING
Mon Aug 28 13:28:54 2017 [Z0][VM][I]: New LCM state is UNKNOWN
Mon Aug 28 13:29:14 2017 [Z0][LCM][I]: VM found again by the drivers
Mon Aug 28 13:29:14 2017 [Z0][VM][I]: New LCM state is RUNNING
Mon Aug 28 13:29:35 2017 [Z0][VM][I]: New LCM state is MIGRATE
sched.log:
Mon Aug 28 13:29:34 2017 [Z0][SCHED][D]: Getting scheduled actions information. Total time: 0.01s
Mon Aug 28 13:29:34 2017 [Z0][VM][D]: Pending/rescheduling VM and capacity requirements:
ACTION VM CPU Memory PCI System DS Image DS
RESCHED 76 50 524288 0 2048 DS 101: 0
Mon Aug 28 13:29:34 2017 [Z0][HOST][D]: Discovered Hosts (enabled):
ID : 0
CLUSTER_ID : 0
MEM_USAGE : 0
CPU_USAGE : 0
MAX_MEM : 8127736
MAX_CPU : 400
FREE_DISK : 442067
RUNNING_VMS : 0
PUBLIC : 0
DSID FREE_MB
0 442067
PCI ADDRESS CLASS VENDOR DEVICE VMID
Mon Aug 28 13:29:34 2017 [Z0][VM][D]: VMs in VMGroups:
Mon Aug 28 13:29:34 2017 [Z0][SCHED][D]: Getting VM and Host information. Total time: 0.04s
Mon Aug 28 13:29:34 2017 [Z0][VMGRP][D]: VM Group Scheduling information
Mon Aug 28 13:29:34 2017 [Z0][SCHED][D]: Setting VM groups placement constraints. Total time: 0.00s
Mon Aug 28 13:29:34 2017 [Z0][SCHED][D]: Match Making statistics:
Number of VMs: 1
Total time: 0s
Total Cluster Match time: 0s
Total Host Match time: 0.00s
Total Host Ranking time: 0.00s
Total DS Match time: 0s
Total DS Ranking time: 0s
Mon Aug 28 13:29:34 2017 [Z0][SCHED][D]: Scheduling Results:
Virtual Machine: 76
PRI ID - HOSTS
0 0
PRI ID - DATASTORES
0 100
Mon Aug 28 13:29:35 2017 [Z0][SCHED][D]: Dispatching VMs to hosts:
VMID Priority Host System DS
76 0 0 100
Mon Aug 28 13:29:35 2017 [Z0][SCHED][D]: Dispatching VMs to hosts. Total time: 0.55s
Mon Aug 28 13:30:05 2017 [Z0][SCHED][D]: Getting scheduled actions information. Total time: 0.01s
Mon Aug 28 13:30:05 2017 [Z0][SCHED][D]: Getting VM and Host information. Total time: 0.00s
host_error.log:
[2017-08-28 13:28:54 +0200][HOST 1][I] Hook launched
[2017-08-28 13:28:54 +0200][HOST 1][I] hostname: one-mgmrz-02
[2017-08-28 13:28:54 +0200][HOST 1][I] Wait 1 cycles.
[2017-08-28 13:28:54 +0200][HOST 1][I] Sleeping 30 seconds.
[2017-08-28 13:29:24 +0200][HOST 1][I] Fencing enabled
[2017-08-28 13:29:29 +0200][HOST 1][I] Success: Rebooted
[2017-08-28 13:29:29 +0200][HOST 1][I] Fencing success
[2017-08-28 13:29:29 +0200][HOST 1][I] states: 3, 5, 8
[2017-08-28 13:29:29 +0200][HOST 1][I] vms: [ā76ā]
[2017-08-28 13:29:29 +0200][HOST 1][I] resched 76
[2017-08-28 13:29:29 +0200][HOST 1][I] Hook finished
Some Questions:
- Why is the oned.log showing this āMon Aug 28 13:29:14 2017 [Z0][VMM][D]: VM 76 successfully monitored: DISK_SIZE=[ID=0,SIZE=8] SNAPSHOT_SIZE=[ID=0,DISK_ID=0,SIZE=8] DISK_SIZE=[ID=1,SIZE=1]ā, but the host is down where the vm was running? The vm.log show at the same time āVM found again by the driversā. This could not be possible
- Why is the oned.log showing āMon Aug 28 13:29:29 2017 [Z0][DiM][D]: Setting rescheduling flag on VM 76
Mon Aug 28 13:29:29 2017 [Z0][HKM][D]: Message received: EXECUTE SUCCESS 1 error:
Mon Aug 28 13:29:35 2017 [Z0][DiM][D]: Live-migrating VM 76ā? Shouldnt this be a cold migration? What means āEXECUTIVE SUCCESS 1 errorā? - Why is the VM stuck in state MIGRATE?
- How can we fix this behaviour?
THX 4 UR HELP