Hello all, I am pretty new to OpenNebula however I have inherited a misbehaving OpenNebula KVM based cluster with CEPH datastore (Ubuntu 14.04 and OpenNebula 4.8.0). I’ve looked into how-to documents and implementation is not adhering strictly to them but it was working until recently and nothing was changed manually.
I’ve been browsing the forum but I cannot find the answer to my problem. And the problem is: when trying to instantiate VM with KVM cluster based on CEPH I’m hitting the following errors in log:
Tue Mar 8 10:55:15 2016 [Z0][TM][D]: Message received: TRANSFER SUCCESS 1667 -
Tue Mar 8 10:55:15 2016 [Z0][VMM][D]: Message received: LOG I 1667 ExitCode: 0
Tue Mar 8 10:55:15 2016 [Z0][VMM][D]: Message received: LOG I 1667 Successfully execute network driver operation: pre.
Tue Mar 8 10:55:15 2016 [Z0][VMM][D]: Message received: LOG I 1667 Command execution fail: cat << EOT | /var/tmp/one/vmm/kvm/deploy '/var/lib/one//datastores/108/1667/deployment.0' 'fras004' 1667 fras004
Tue Mar 8 10:55:15 2016 [Z0][VMM][D]: Message received: LOG I 1667 error: Failed to create domain from /var/lib/one//datastores/108/1667/deployment.0
Tue Mar 8 10:55:15 2016 [Z0][VMM][D]: Message received: LOG I 1667 error: Failed to open file '/var/lib/one//datastores/108/1667/disk.1': No such file or directory
Tue Mar 8 10:55:15 2016 [Z0][VMM][D]: Message received: LOG E 1667 Could not create domain from /var/lib/one//datastores/108/1667/deployment.0
Tue Mar 8 10:55:15 2016 [Z0][VMM][D]: Message received: LOG I 1667 ExitCode: 255
Tue Mar 8 10:55:15 2016 [Z0][VMM][D]: Message received: LOG I 1667 Failed to execute virtualization driver operation: deploy.
Tue Mar 8 10:55:15 2016 [Z0][VMM][D]: Message received: DEPLOY FAILURE 1667 Could not create domain from /var/lib/one//datastores/108/1667/deployment.0
Log of the VM says pretty much the same:
Tue Mar 8 10:55:03 2016 [Z0][DiM][I]: New VM state is ACTIVE.
Tue Mar 8 10:55:03 2016 [Z0][LCM][I]: New VM state is PROLOG.
Tue Mar 8 10:55:15 2016 [Z0][LCM][I]: New VM state is BOOT
Tue Mar 8 10:55:15 2016 [Z0][VMM][I]: Generating deployment file: /var/lib/one/vms/1667/deployment.0
Tue Mar 8 10:55:15 2016 [Z0][VMM][I]: ExitCode: 0
Tue Mar 8 10:55:15 2016 [Z0][VMM][I]: Successfully execute network driver operation: pre.
Tue Mar 8 10:55:15 2016 [Z0][VMM][I]: Command execution fail: cat << EOT | /var/tmp/one/vmm/kvm/deploy '/var/lib/one//datastores/108/1667/deployment.0' 'fras004' 1667 fras004
Tue Mar 8 10:55:15 2016 [Z0][VMM][I]: error: Failed to create domain from /var/lib/one//datastores/108/1667/deployment.0
Tue Mar 8 10:55:15 2016 [Z0][VMM][I]: error: Failed to open file '/var/lib/one//datastores/108/1667/disk.1': No such file or directory
Tue Mar 8 10:55:15 2016 [Z0][VMM][E]: Could not create domain from /var/lib/one//datastores/108/1667/deployment.0
Tue Mar 8 10:55:15 2016 [Z0][VMM][I]: ExitCode: 255
Tue Mar 8 10:55:15 2016 [Z0][VMM][I]: Failed to execute virtualization driver operation: deploy.
Tue Mar 8 10:55:15 2016 [Z0][VMM][E]: Error deploying virtual machine: Could not create domain from /var/lib/one//datastores/108/1667/deployment.0
Tue Mar 8 10:55:15 2016 [Z0][DiM][I]: New VM state is FAILED
On the hypervisor there is a directory created containing only deployment.0 and disk.1.iso shortcut pointing to a non existing file.
From both controller and hypervisor I can list rbd images.
KVM log of the VM /var/log/libvirt/qemu/one-1667.log says only:
2016-03-08 09:55:15.982+0000: shutting down
If I try to execute virsh --connect qemu:///system create /var/lib/one/datastores/108/1667/deployment.0 I get the same error as previously error: Failed to create domain from /var/lib/one/datastores/108/1667/deployment.0 error: Failed to open file '/var/lib/one//datastores/108/1667/disk.1': No such file or directory
Like said previously disk.1 file is not transferred in the directory but I cannot find the reason why.
Oned.log, syslog, dmesg provides no further insight into the problem.
Any idea where can I look further to find more detail on the error or how can I proceed with solving it first.
It looks like datastore 108, the system datastore, is using TM_MAD=ceph, can you change that to TM_MAD=ssh? in fact, you have named it “system_ssh_ceph”, do you know if you’ve changed the TM for some reason?
That was the problem for one of the controller nodes (we have two clustered over pacemaker/corosync) the other one does not create rbd context file. But that is other thing to deal with.
I don’t know how the driver was changed, I did restart the OpenNebula services though, but didn’t touch the config, didn’t see the point in altering the configuration that was working previously.