[SOLVED] Error monitoring Host (2): Error executing probes

Hello,

last friday at about 3:13AM a kvm host somehow something went wrong and all VMs on that host are in UNKNOWN state.
I’ve checked ssh from frontend and a onehost show gives information.
oned.log just tells me that Error monitoring Host MYHOSTNAME (2): Error executing probes.
This host has been up for 50 days now and hadn’t shown any problem since last friday.

ONE version is 4.6.2 under centOS 6.

What can I do to manually execute these probes from frontend server and see what’s missing or not working on the host?

Thanks,
Sergi

Look at your probe location as configured as SCRIPTS_REMOTE_DIR
on the host. (default is /var/tmp/one)
Make sure that the scripts didn’t get deleted off of the host for some reason.
By default on redhattish linux, they do, and you have to change tmpwatch config to make that not happen.
You can restore the scripts usually with a onehost disable /onehost enable

Steve Timm

Hello Steve,

thanks for your answer … and is it safe to onahost disable / onehost enable if the host has vps?

Thanks,
regards,
Sergi

Hello,

I read that disable/enable does nothing to running vm, so I did it.
Host keeps in err state.
onehost show can see what vm’s has running, but no more information about why this error :
ERROR="Thu May 21 11:27:44 2015 : Error monitoring Host puigmal.lacuinadel.cat (2): Error executing probes"
Acces to datastores seems OK.
How can I manually execute the probes?

Thanks,

Hello,

well, enable disable didn’t work.
I had to delete /var/tmp/one on affected host and then :
su - oneadmin
onehost sync <host_id> --force

I also deleted /etc/cron.daily/tmpwatch

Regards,

i have encountered the same problem, could not resolve by above steps

Hi,

Could you provide some more info? This thread is almost three years old and it is hard to believe that you are using same version and setup of OS and OpenNebula as described.

BR,
Anton Todorov

i had same problem like this

ERROR=“Wed Jan 30 03:28:53 2019 : Error monitoring Host xxx.xxx.com (21): Timeout executing ‘if [ -x “/var/tmp/one/im/run_probes” ]; then /var/tmp/one/im/run_probes kvm /var/lib/one//datastores 4124 60 21 mwncloud5.maintenis.com; else exit 42; fi’”
HOSTNAME=“xxx.xxx.com
HYPERVISOR=“kvm”

and still could not resolve by above. is there any other solution for this ?

I have the same kind of problem. I’m running opennebula 5.8.0 on debian 9.8 for the master/frontend and on the node according to dpkg it is 5.8.0-1 for opennebula-common and opennebula-node on debian 9.

It all started by key access issues because the hostname of the master/frontend was not synced on the node.
After fixing the key issue the my VM was FAILED with errors related to checkpoint.xml that was empty when I checked it.
After failing to recover the failed state of the VM by running many times “onehost sync --force” I decided to delete the VM as this is not yet a production VM. I also delete /var/tmp/one and /var/lib/one/datastores to ensure a fresh start.
After re-adding the node I can recreate VMs however the nodes keeps getting in failing state, it seems like there are outdated informations in the master/frontend related to old VMs I already delete and that are not being displayed i the UI anymore that are used to launch the commands that fails since the folder doesn’t exist anymore …

It is worrisome to have a unusable VM because of some out of sync data, am I doing something wrong ?

this is the current error being displayed :

Fri Mar 29 07:29:14 2019 : Error monitoring Host sg.mydomain.com (2): Timeout executing ‘if [ -x “/var/tmp/one/im/run_probes” ]; then /var/tmp/one/im/run_probes kvm /var/lib/one/datastores 4124 60 2 sg.mydomain.com; else exit 42; fi’

This is the error log of the VM that caused the issues. Starting from a key error, then ended up failing because of a similar error message as the previous one and checkpoint.xml errors. I ended up deleting it without being able to retrieve it. I was not able to boot the VM anymore

Thu Mar 28 12:41:25 2019 [Z0][VM][I]: New state is CLONING
Thu Mar 28 12:41:26 2019 [Z0][VM][I]: New state is PENDING
Thu Mar 28 12:41:34 2019 [Z0][VM][I]: New state is ACTIVE
Thu Mar 28 12:41:34 2019 [Z0][VM][I]: New LCM state is PROLOG
Thu Mar 28 12:41:53 2019 [Z0][VM][I]: New LCM state is BOOT
Thu Mar 28 12:41:53 2019 [Z0][VMM][I]: Generating deployment file: /var/lib/one/vms/42/deployment.0
Thu Mar 28 12:42:00 2019 [Z0][VMM][I]: Successfully execute transfer manager driver operation: tm_context.
Thu Mar 28 12:42:04 2019 [Z0][VMM][I]: pre: Executed “sudo brctl addbr onebr9”.
Thu Mar 28 12:42:04 2019 [Z0][VMM][I]: pre: Executed “sudo ip link set onebr9 up”.
Thu Mar 28 12:42:04 2019 [Z0][VMM][I]: pre: Executed “sudo brctl addif onebr9 enp1s0f0”.
Thu Mar 28 12:42:04 2019 [Z0][VMM][I]: ExitCode: 0
Thu Mar 28 12:42:04 2019 [Z0][VMM][I]: Successfully execute network driver operation: pre.
Thu Mar 28 12:42:04 2019 [Z0][VMM][I]: ExitCode: 0
Thu Mar 28 12:42:04 2019 [Z0][VMM][I]: Successfully execute virtualization driver operation: deploy.
Thu Mar 28 12:42:04 2019 [Z0][VMM][I]: ExitCode: 0
Thu Mar 28 12:42:04 2019 [Z0][VMM][I]: Successfully execute network driver operation: post.
Thu Mar 28 12:42:04 2019 [Z0][VM][I]: New LCM state is RUNNING
Thu Mar 28 13:15:05 2019 [Z0][VM][I]: New LCM state is SAVE_STOP
Thu Mar 28 13:15:09 2019 [Z0][VMM][I]: /var/tmp/one/vmm/kvm/save: line 58: warning: command substitution: ignored null byte in input
Thu Mar 28 13:15:09 2019 [Z0][VMM][I]: ExitCode: 0
Thu Mar 28 13:15:09 2019 [Z0][VMM][I]: Successfully execute virtualization driver operation: save.
Thu Mar 28 13:15:09 2019 [Z0][VMM][I]: clean: Executed “sudo ip link delete onebr9”.
Thu Mar 28 13:15:09 2019 [Z0][VMM][I]: ExitCode: 0
Thu Mar 28 13:15:09 2019 [Z0][VMM][I]: Successfully execute network driver operation: clean.
Thu Mar 28 13:15:09 2019 [Z0][VM][I]: New LCM state is EPILOG_STOP
Thu Mar 28 13:15:18 2019 [Z0][TM][I]: Command execution failed (exit code: 255): /var/lib/one/remotes/tm/ssh/mv sg.mydomain.com:/var/lib/one//datastores/0/42 opennebula:/var/lib/one//datastores/0/42 42 0
Thu Mar 28 13:15:18 2019 [Z0][TM][I]: mv: Moving sg.mydomain.com:/var/lib/one/datastores/0/42 to opennebula:/var/lib/one/datastores/0/42
Thu Mar 28 13:15:18 2019 [Z0][TM][E]: mv: Command “set -e -o pipefail
Thu Mar 28 13:15:18 2019 [Z0][TM][I]:
Thu Mar 28 13:15:18 2019 [Z0][TM][I]: tar -C /var/lib/one/datastores/0 --sparse -cf - 42 | ssh opennebula ‘tar -C /var/lib/one/datastores/0 --sparse -xf -’
Thu Mar 28 13:15:18 2019 [Z0][TM][I]: rm -rf /var/lib/one/datastores/0/42” failed: ssh: Could not resolve hostname opennebula: Name or service not known
Thu Mar 28 13:15:18 2019 [Z0][TM][E]: Error copying disk directory to target host
Thu Mar 28 13:15:18 2019 [Z0][TM][E]: Error executing image transfer script: Error copying disk directory to target host
Thu Mar 28 13:15:18 2019 [Z0][VM][I]: New LCM state is EPILOG_STOP_FAILURE
Fri Mar 29 04:07:33 2019 [Z0][VM][I]: New LCM state is EPILOG_STOP
Fri Mar 29 04:07:35 2019 [Z0][TM][I]: Command execution failed (exit code: 255): /var/lib/one/remotes/tm/ssh/mv sg.mydomain.com:/var/lib/one//datastores/0/42 opennebula.mydomain.com:/var/lib/one//datastores/0/42 42 0
Fri Mar 29 04:07:35 2019 [Z0][TM][E]: Error creating directory /var/lib/one/datastores/0/42 at opennebula.mydomain.com: Host key verification failed.
Fri Mar 29 04:07:35 2019 [Z0][TM][E]: Error executing image transfer script: Error creating directory /var/lib/one/datastores/0/42 at opennebula.mydomain.com: Host key verification failed.
Fri Mar 29 04:07:35 2019 [Z0][VM][I]: New LCM state is EPILOG_STOP_FAILURE
Fri Mar 29 04:15:19 2019 [Z0][VM][I]: New LCM state is EPILOG_STOP
Fri Mar 29 04:15:20 2019 [Z0][TM][I]: Command execution failed (exit code: 255): /var/lib/one/remotes/tm/ssh/mv sg.mydomain.com:/var/lib/one//datastores/0/42 opennebula.mydomain.com:/var/lib/one//datastores/0/42 42 0
Fri Mar 29 04:15:20 2019 [Z0][TM][E]: Error creating directory /var/lib/one/datastores/0/42 at opennebula.mydomain.com: Host key verification failed.
Fri Mar 29 04:15:20 2019 [Z0][TM][E]: Error executing image transfer script: Error creating directory /var/lib/one/datastores/0/42 at opennebula.mydomain.com: Host key verification failed.
Fri Mar 29 04:15:20 2019 [Z0][VM][I]: New LCM state is EPILOG_STOP_FAILURE
Fri Mar 29 04:21:26 2019 [Z0][VM][I]: New LCM state is EPILOG_STOP
Fri Mar 29 04:21:26 2019 [Z0][TM][I]: Command execution failed (exit code: 255): /var/lib/one/remotes/tm/ssh/mv sg.mydomain.com:/var/lib/one//datastores/0/42 opennebula.mydomain.com:/var/lib/one//datastores/0/42 42 0
Fri Mar 29 04:21:26 2019 [Z0][TM][E]: Error creating directory /var/lib/one/datastores/0/42 at opennebula.mydomain.com: Host key verification failed.
Fri Mar 29 04:21:26 2019 [Z0][TM][E]: Error executing image transfer script: Error creating directory /var/lib/one/datastores/0/42 at opennebula.mydomain.com: Host key verification failed.
Fri Mar 29 04:21:26 2019 [Z0][VM][I]: New LCM state is EPILOG_STOP_FAILURE
Fri Mar 29 04:23:54 2019 [Z0][VM][I]: New LCM state is EPILOG_STOP
Fri Mar 29 04:23:54 2019 [Z0][TM][I]: Command execution failed (exit code: 255): /var/lib/one/remotes/tm/ssh/mv sg.mydomain.com:/var/lib/one//datastores/0/42 opennebula.mydomain.com:/var/lib/one//datastores/0/42 42 0
Fri Mar 29 04:23:54 2019 [Z0][TM][E]: Error creating directory /var/lib/one/datastores/0/42 at opennebula.mydomain.com: Host key verification failed.
Fri Mar 29 04:23:54 2019 [Z0][TM][E]: Error executing image transfer script: Error creating directory /var/lib/one/datastores/0/42 at opennebula.mydomain.com: Host key verification failed.
Fri Mar 29 04:23:54 2019 [Z0][VM][I]: New LCM state is EPILOG_STOP_FAILURE
Fri Mar 29 04:25:03 2019 [Z0][VM][I]: New LCM state is EPILOG_STOP
Fri Mar 29 04:25:25 2019 [Z0][VM][I]: New state is STOPPED
Fri Mar 29 04:25:25 2019 [Z0][VM][I]: New LCM state is LCM_INIT
Fri Mar 29 04:25:32 2019 [Z0][TM][W]: Ignored: LOG I 42 Command execution failed (exit code: 1): /var/lib/one/remotes/tm/ssh/mv sg.mydomain.com:/var/lib/one//datastores/0/42 opennebula.mydomain.com:/var/lib/one//datastores/0/42 42 0

Fri Mar 29 04:25:32 2019 [Z0][TM][W]: Ignored: LOG I 42 mv: Moving sg.mydomain.com:/var/lib/one/datastores/0/42 to opennebula.mydomain.com:/var/lib/one/datastores/0/42

Fri Mar 29 04:25:32 2019 [Z0][TM][W]: Ignored: LOG E 42 mv: Command "set -e -o pipefail

Fri Mar 29 04:25:32 2019 [Z0][TM][W]: Ignored: LOG I 42

Fri Mar 29 04:25:32 2019 [Z0][TM][W]: Ignored: LOG I 42 tar -C /var/lib/one/datastores/0 --sparse -cf - 42 | ssh opennebula.mydomain.com ‘tar -C /var/lib/one/datastores/0 --sparse -xf -’

Fri Mar 29 04:25:32 2019 [Z0][TM][W]: Ignored: LOG I 42 rm -rf /var/lib/one/datastores/0/42" failed: tar: 42/disk.0: file changed as we read it

Fri Mar 29 04:25:32 2019 [Z0][TM][W]: Ignored: LOG I 42 tar: 42/disk.1: File removed before we read it

Fri Mar 29 04:25:32 2019 [Z0][TM][W]: Ignored: LOG I 42 tar: 42/checkpoint: File removed before we read it

Fri Mar 29 04:25:32 2019 [Z0][TM][W]: Ignored: LOG I 42 tar: 42/deployment.0: File removed before we read it

Fri Mar 29 04:25:32 2019 [Z0][TM][W]: Ignored: LOG I 42 tar: 42: file changed as we read it

Fri Mar 29 04:25:32 2019 [Z0][TM][W]: Ignored: LOG E 42 Error copying disk directory to target host

Fri Mar 29 04:25:32 2019 [Z0][TM][W]: Ignored: TRANSFER FAILURE 42 Error copying disk directory to target host

Fri Mar 29 04:26:33 2019 [Z0][VM][I]: New state is PENDING
Fri Mar 29 04:26:56 2019 [Z0][VM][I]: New state is ACTIVE
Fri Mar 29 04:26:56 2019 [Z0][VM][I]: New LCM state is PROLOG_RESUME
Fri Mar 29 04:27:04 2019 [Z0][VM][I]: New LCM state is BOOT_STOPPED
Fri Mar 29 04:27:04 2019 [Z0][VMM][I]: Successfully execute transfer manager driver operation: tm_context.
Fri Mar 29 04:27:04 2019 [Z0][VMM][I]: pre: Executed “sudo brctl addbr onebr9”.
Fri Mar 29 04:27:04 2019 [Z0][VMM][I]: pre: Executed “sudo ip link set onebr9 up”.
Fri Mar 29 04:27:04 2019 [Z0][VMM][I]: pre: Executed “sudo brctl addif onebr9 enp1s0f0”.
Fri Mar 29 04:27:04 2019 [Z0][VMM][I]: ExitCode: 0
Fri Mar 29 04:27:04 2019 [Z0][VMM][I]: Successfully execute network driver operation: pre.
Fri Mar 29 04:27:04 2019 [Z0][VMM][I]: Command execution fail: cat << EOT | /var/tmp/one/vmm/kvm/restore ‘/var/lib/one//datastores/0/42/checkpoint’ ‘sg.mydomain.com’ ‘one-42’ 42 sg.mydomain.com
Fri Mar 29 04:27:04 2019 [Z0][VMM][I]: /var/tmp/one/vmm/kvm/restore: line 43: warning: command substitution: ignored null byte in input
Fri Mar 29 04:27:04 2019 [Z0][VMM][E]: restore: Command “set -e -o pipefail
Fri Mar 29 04:27:04 2019 [Z0][VMM][I]:
Fri Mar 29 04:27:04 2019 [Z0][VMM][I]: # extract the xml from the checkpoint
Fri Mar 29 04:27:04 2019 [Z0][VMM][I]:
Fri Mar 29 04:27:04 2019 [Z0][VMM][I]: virsh --connect qemu:///system save-image-dumpxml /var/lib/one//datastores/0/42/checkpoint > /var/lib/one//datastores/0/42/checkpoint.xml
Fri Mar 29 04:27:04 2019 [Z0][VMM][I]:
Fri Mar 29 04:27:04 2019 [Z0][VMM][I]: # Eeplace all occurrences of the DS_LOCATION/<DS_ID>/<VM_ID> with the specific
Fri Mar 29 04:27:04 2019 [Z0][VMM][I]: # DS_ID where the checkpoint is placed. This is done in case there was a
Fri Mar 29 04:27:04 2019 [Z0][VMM][I]: # system DS migration
Fri Mar 29 04:27:04 2019 [Z0][VMM][I]:
Fri Mar 29 04:27:04 2019 [Z0][VMM][I]: sed -i “s%/var/lib/one//datastores/[0-9]+/42/%/var/lib/one//datastores/0/42/%g” /var/lib/one//datastores/0/42/checkpoint.xml
Fri Mar 29 04:27:04 2019 [Z0][VMM][I]: sed -i “s%/var/lib/one/datastores/[0-9]+//%/var/lib/one//datastores/0//%g” /var/lib/one//datastores/0/42/checkpoint.xml” failed: error: Failed to open file ‘/var/lib/one//datastores/0/42/checkpoint’: No such file or directory
Fri Mar 29 04:27:04 2019 [Z0][VMM][E]: Could not recalculate paths in /var/lib/one//datastores/0/42/checkpoint.xml
Fri Mar 29 04:27:04 2019 [Z0][VMM][I]: ExitCode: 1
Fri Mar 29 04:27:04 2019 [Z0][VMM][I]: Failed to execute virtualization driver operation: restore.
Fri Mar 29 04:27:04 2019 [Z0][VMM][E]: Error restoring VM: Could not recalculate paths in /var/lib/one//datastores/0/42/checkpoint.xml
Fri Mar 29 04:27:04 2019 [Z0][VM][I]: New LCM state is BOOT_STOPPED_FAILURE
Fri Mar 29 04:45:47 2019 [Z0][VM][I]: New LCM state is BOOT_STOPPED
Fri Mar 29 04:50:09 2019 [Z0][VMM][I]: Successfully execute transfer manager driver operation: tm_context.
Fri Mar 29 04:52:21 2019 [Z0][VMM][I]: ExitCode: 0
Fri Mar 29 04:52:21 2019 [Z0][VMM][I]: Successfully execute network driver operation: pre.
Fri Mar 29 04:52:21 2019 [Z0][VMM][I]: Command execution fail: cat << EOT | /var/tmp/one/vmm/kvm/restore ‘/var/lib/one//datastores/0/42/checkpoint’ ‘sg.mydomain.com’ ‘one-42’ 42 sg.mydomain.com
Fri Mar 29 04:52:21 2019 [Z0][VMM][I]: /var/tmp/one/vmm/kvm/restore: line 43: warning: command substitution: ignored null byte in input
Fri Mar 29 04:52:21 2019 [Z0][VMM][E]: restore: Command “set -e -o pipefail
Fri Mar 29 04:52:21 2019 [Z0][VMM][I]:
Fri Mar 29 04:52:21 2019 [Z0][VMM][I]: # extract the xml from the checkpoint
Fri Mar 29 04:52:21 2019 [Z0][VMM][I]:
Fri Mar 29 04:52:21 2019 [Z0][VMM][I]: virsh --connect qemu:///system save-image-dumpxml /var/lib/one//datastores/0/42/checkpoint > /var/lib/one//datastores/0/42/checkpoint.xml
Fri Mar 29 04:52:21 2019 [Z0][VMM][I]:
Fri Mar 29 04:52:21 2019 [Z0][VMM][I]: # Eeplace all occurrences of the DS_LOCATION/<DS_ID>/<VM_ID> with the specific
Fri Mar 29 04:52:21 2019 [Z0][VMM][I]: # DS_ID where the checkpoint is placed. This is done in case there was a
Fri Mar 29 04:52:21 2019 [Z0][VMM][I]: # system DS migration
Fri Mar 29 04:52:21 2019 [Z0][VMM][I]:
Fri Mar 29 04:52:21 2019 [Z0][VMM][I]: sed -i “s%/var/lib/one//datastores/[0-9]+/42/%/var/lib/one//datastores/0/42/%g” /var/lib/one//datastores/0/42/checkpoint.xml
Fri Mar 29 04:52:21 2019 [Z0][VMM][I]: sed -i “s%/var/lib/one/datastores/[0-9]+//%/var/lib/one//datastores/0//%g” /var/lib/one//datastores/0/42/checkpoint.xml” failed: error: Failed to open file ‘/var/lib/one//datastores/0/42/checkpoint’: No such file or directory
Fri Mar 29 04:52:21 2019 [Z0][VMM][E]: Could not recalculate paths in /var/lib/one//datastores/0/42/checkpoint.xml
Fri Mar 29 04:52:21 2019 [Z0][VMM][I]: ExitCode: 1
Fri Mar 29 04:52:21 2019 [Z0][VMM][I]: Failed to execute virtualization driver operation: restore.
Fri Mar 29 04:52:21 2019 [Z0][VMM][E]: Error restoring VM: Could not recalculate paths in /var/lib/one//datastores/0/42/checkpoint.xml
Fri Mar 29 04:52:21 2019 [Z0][VM][I]: New LCM state is BOOT_STOPPED_FAILURE
Fri Mar 29 04:52:59 2019 [Z0][VM][I]: New LCM state is BOOT_STOPPED
Fri Mar 29 04:57:22 2019 [Z0][VMM][I]: Successfully execute transfer manager driver operation: tm_context.
Fri Mar 29 04:59:33 2019 [Z0][VMM][I]: ExitCode: 0
Fri Mar 29 04:59:33 2019 [Z0][VMM][I]: Successfully execute network driver operation: pre.
Fri Mar 29 04:59:33 2019 [Z0][VMM][I]: Command execution fail: cat << EOT | /var/tmp/one/vmm/kvm/restore ‘/var/lib/one//datastores/0/42/checkpoint’ ‘sg.mydomain.com’ ‘one-42’ 42 sg.mydomain.com
Fri Mar 29 04:59:33 2019 [Z0][VMM][I]: /var/tmp/one/vmm/kvm/restore: line 43: warning: command substitution: ignored null byte in input
Fri Mar 29 04:59:33 2019 [Z0][VMM][E]: restore: Command “set -e -o pipefail
Fri Mar 29 04:59:33 2019 [Z0][VMM][I]:
Fri Mar 29 04:59:33 2019 [Z0][VMM][I]: # extract the xml from the checkpoint
Fri Mar 29 04:59:33 2019 [Z0][VMM][I]:
Fri Mar 29 04:59:33 2019 [Z0][VMM][I]: virsh --connect qemu:///system save-image-dumpxml /var/lib/one//datastores/0/42/checkpoint > /var/lib/one//datastores/0/42/checkpoint.xml
Fri Mar 29 04:59:33 2019 [Z0][VMM][I]:
Fri Mar 29 04:59:33 2019 [Z0][VMM][I]: # Eeplace all occurrences of the DS_LOCATION/<DS_ID>/<VM_ID> with the specific
Fri Mar 29 04:59:33 2019 [Z0][VMM][I]: # DS_ID where the checkpoint is placed. This is done in case there was a
Fri Mar 29 04:59:33 2019 [Z0][VMM][I]: # system DS migration
Fri Mar 29 04:59:33 2019 [Z0][VMM][I]:
Fri Mar 29 04:59:33 2019 [Z0][VMM][I]: sed -i “s%/var/lib/one//datastores/[0-9]+/42/%/var/lib/one//datastores/0/42/%g” /var/lib/one//datastores/0/42/checkpoint.xml
Fri Mar 29 04:59:33 2019 [Z0][VMM][I]: sed -i “s%/var/lib/one/datastores/[0-9]+//%/var/lib/one//datastores/0//%g” /var/lib/one//datastores/0/42/checkpoint.xml” failed: error: Failed to open file ‘/var/lib/one//datastores/0/42/checkpoint’: No such file or directory
Fri Mar 29 04:59:33 2019 [Z0][VMM][E]: Could not recalculate paths in /var/lib/one//datastores/0/42/checkpoint.xml
Fri Mar 29 04:59:33 2019 [Z0][VMM][I]: ExitCode: 1
Fri Mar 29 04:59:33 2019 [Z0][VMM][I]: Failed to execute virtualization driver operation: restore.
Fri Mar 29 04:59:33 2019 [Z0][VMM][E]: Error restoring VM: Could not recalculate paths in /var/lib/one//datastores/0/42/checkpoint.xml
Fri Mar 29 04:59:33 2019 [Z0][VM][I]: New LCM state is BOOT_STOPPED_FAILURE
Fri Mar 29 05:03:55 2019 [Z0][VMM][I]: Successfully execute transfer manager driver operation: tm_context.
Fri Mar 29 05:06:06 2019 [Z0][VMM][I]: ExitCode: 0
Fri Mar 29 05:06:06 2019 [Z0][VMM][I]: Successfully execute network driver operation: pre.
Fri Mar 29 05:06:06 2019 [Z0][VMM][I]: Command execution fail: cat << EOT | /var/tmp/one/vmm/kvm/restore ‘/var/lib/one//datastores/0/42/checkpoint’ ‘sg.mydomain.com’ ‘one-42’ 42 sg.mydomain.com
Fri Mar 29 05:06:06 2019 [Z0][VMM][I]: /var/tmp/one/vmm/kvm/restore: line 43: warning: command substitution: ignored null byte in input
Fri Mar 29 05:06:06 2019 [Z0][VMM][E]: restore: Command “set -e -o pipefail
Fri Mar 29 05:06:06 2019 [Z0][VMM][I]:
Fri Mar 29 05:06:06 2019 [Z0][VMM][I]: # extract the xml from the checkpoint
Fri Mar 29 05:06:06 2019 [Z0][VMM][I]:
Fri Mar 29 05:06:06 2019 [Z0][VMM][I]: virsh --connect qemu:///system save-image-dumpxml /var/lib/one//datastores/0/42/checkpoint > /var/lib/one//datastores/0/42/checkpoint.xml
Fri Mar 29 05:06:06 2019 [Z0][VMM][I]:
Fri Mar 29 05:06:06 2019 [Z0][VMM][I]: # Eeplace all occurrences of the DS_LOCATION/<DS_ID>/<VM_ID> with the specific
Fri Mar 29 05:06:06 2019 [Z0][VMM][I]: # DS_ID where the checkpoint is placed. This is done in case there was a
Fri Mar 29 05:06:06 2019 [Z0][VMM][I]: # system DS migration
Fri Mar 29 05:06:06 2019 [Z0][VMM][I]:
Fri Mar 29 05:06:06 2019 [Z0][VMM][I]: sed -i “s%/var/lib/one//datastores/[0-9]+/42/%/var/lib/one//datastores/0/42/%g” /var/lib/one//datastores/0/42/checkpoint.xml
Fri Mar 29 05:06:06 2019 [Z0][VMM][I]: sed -i “s%/var/lib/one/datastores/[0-9]+//%/var/lib/one//datastores/0//%g” /var/lib/one//datastores/0/42/checkpoint.xml” failed: error: Failed to open file ‘/var/lib/one//datastores/0/42/checkpoint’: No such file or directory
Fri Mar 29 05:06:06 2019 [Z0][VMM][E]: Could not recalculate paths in /var/lib/one//datastores/0/42/checkpoint.xml
Fri Mar 29 05:06:06 2019 [Z0][VMM][I]: ExitCode: 1
Fri Mar 29 05:06:06 2019 [Z0][VMM][I]: Failed to execute virtualization driver operation: restore.
Fri Mar 29 05:06:06 2019 [Z0][VMM][E]: Error restoring VM: Could not recalculate paths in /var/lib/one//datastores/0/42/checkpoint.xml
Fri Mar 29 05:09:48 2019 [Z0][VM][I]: New LCM state is RUNNING
Fri Mar 29 05:10:05 2019 [Z0][VM][I]: New LCM state is SAVE_STOP
Fri Mar 29 05:10:57 2019 [Z0][VM][I]: New LCM state is EPILOG_STOP
Fri Mar 29 05:11:17 2019 [Z0][VM][I]: New state is STOPPED
Fri Mar 29 05:11:17 2019 [Z0][VM][I]: New LCM state is LCM_INIT
Fri Mar 29 05:11:25 2019 [Z0][VM][I]: New state is PENDING
Fri Mar 29 05:12:14 2019 [Z0][VMM][W]: Ignored: LOG I 42 Command execution fail: cat << EOT | /var/tmp/one/vmm/kvm/save ‘one-42’ ‘/var/lib/one//datastores/0/42/checkpoint’ ‘sg.mydomain.com’ 42 sg.mydomain.com

Fri Mar 29 05:12:14 2019 [Z0][VMM][W]: Ignored: LOG E 42 save: Command “virsh --connect qemu:///system save one-42 /var/lib/one//datastores/0/42/checkpoint” failed: error: failed to get domain ‘one-42’

Fri Mar 29 05:12:14 2019 [Z0][VMM][W]: Ignored: LOG I 42 error: Domain not found: no domain with matching name ‘one-42’

Fri Mar 29 05:12:14 2019 [Z0][VMM][W]: Ignored: LOG E 42 Could not save one-42 to /var/lib/one//datastores/0/42/checkpoint

Fri Mar 29 05:12:14 2019 [Z0][VMM][W]: Ignored: LOG I 42 ExitCode: 1

Fri Mar 29 05:12:14 2019 [Z0][VMM][W]: Ignored: LOG I 42 Failed to execute virtualization driver operation: save.

Fri Mar 29 05:12:14 2019 [Z0][VMM][W]: Ignored: SAVE FAILURE 42 Could not save one-42 to /var/lib/one//datastores/0/42/checkpoint

Fri Mar 29 05:17:40 2019 [Z0][TM][W]: Ignored: TRANSFER SUCCESS 42 -

Fri Mar 29 05:46:08 2019 [Z0][VM][I]: New state is ACTIVE
Fri Mar 29 05:46:08 2019 [Z0][VM][I]: New LCM state is PROLOG_RESUME
Fri Mar 29 05:46:21 2019 [Z0][VM][I]: New LCM state is BOOT_STOPPED
Fri Mar 29 05:46:21 2019 [Z0][VMM][I]: Successfully execute transfer manager driver operation: tm_context.
Fri Mar 29 05:46:21 2019 [Z0][VMM][I]: pre: Executed “sudo brctl addbr onebr9”.
Fri Mar 29 05:46:21 2019 [Z0][VMM][I]: pre: Executed “sudo ip link set onebr9 up”.
Fri Mar 29 05:46:21 2019 [Z0][VMM][I]: pre: Executed “sudo brctl addif onebr9 enp1s0f0”.
Fri Mar 29 05:46:21 2019 [Z0][VMM][I]: ExitCode: 0
Fri Mar 29 05:46:21 2019 [Z0][VMM][I]: Successfully execute network driver operation: pre.
Fri Mar 29 05:46:21 2019 [Z0][VMM][I]: Command execution fail: cat << EOT | /var/tmp/one/vmm/kvm/restore ‘/var/lib/one//datastores/0/42/checkpoint’ ‘sg.mydomain.com’ ‘one-42’ 42 sg.mydomain.com
Fri Mar 29 05:46:21 2019 [Z0][VMM][I]: /var/tmp/one/vmm/kvm/restore: line 43: warning: command substitution: ignored null byte in input
Fri Mar 29 05:46:21 2019 [Z0][VMM][E]: restore: Command “set -e -o pipefail
Fri Mar 29 05:46:22 2019 [Z0][VMM][I]:
Fri Mar 29 05:46:22 2019 [Z0][VMM][I]: # extract the xml from the checkpoint
Fri Mar 29 05:46:22 2019 [Z0][VMM][I]:
Fri Mar 29 05:46:22 2019 [Z0][VMM][I]: virsh --connect qemu:///system save-image-dumpxml /var/lib/one//datastores/0/42/checkpoint > /var/lib/one//datastores/0/42/checkpoint.xml
Fri Mar 29 05:46:22 2019 [Z0][VMM][I]:
Fri Mar 29 05:46:22 2019 [Z0][VMM][I]: # Eeplace all occurrences of the DS_LOCATION/<DS_ID>/<VM_ID> with the specific
Fri Mar 29 05:46:22 2019 [Z0][VMM][I]: # DS_ID where the checkpoint is placed. This is done in case there was a
Fri Mar 29 05:46:22 2019 [Z0][VMM][I]: # system DS migration
Fri Mar 29 05:46:22 2019 [Z0][VMM][I]:
Fri Mar 29 05:46:22 2019 [Z0][VMM][I]: sed -i “s%/var/lib/one//datastores/[0-9]+/42/%/var/lib/one//datastores/0/42/%g” /var/lib/one//datastores/0/42/checkpoint.xml
Fri Mar 29 05:46:22 2019 [Z0][VMM][I]: sed -i “s%/var/lib/one/datastores/[0-9]+//%/var/lib/one//datastores/0//%g” /var/lib/one//datastores/0/42/checkpoint.xml” failed: error: operation failed: failed to read qemu header
Fri Mar 29 05:46:22 2019 [Z0][VMM][E]: Could not recalculate paths in /var/lib/one//datastores/0/42/checkpoint.xml
Fri Mar 29 05:46:22 2019 [Z0][VMM][I]: ExitCode: 1
Fri Mar 29 05:46:22 2019 [Z0][VMM][I]: Failed to execute virtualization driver operation: restore.
Fri Mar 29 05:46:22 2019 [Z0][VMM][E]: Error restoring VM: Could not recalculate paths in /var/lib/one//datastores/0/42/checkpoint.xml
Fri Mar 29 05:46:22 2019 [Z0][VM][I]: New LCM state is BOOT_STOPPED_FAILURE
Fri Mar 29 06:43:26 2019 [Z0][VM][I]: New LCM state is BOOT_STOPPED
Fri Mar 29 06:47:47 2019 [Z0][VMM][I]: Successfully execute transfer manager driver operation: tm_context.

Was there any thing I could have done to save my VM ? Is there any way to stop my empty host to keep getting the current error I posted first ?

It turns out the timeout error was due to ipv6 networking issues, I didn’t noticed it because when sshing from front/end node and vice versa it’s using ipv4.

However regarding the checkpoint.xml error, in case it happens again, I’m curious to know if there are any procedure to retrieve a VM in failed state with error “Could not recalculate paths in /var/lib/one//datastores/0/42/checkpoint.xml”