"terminate" fails after upgrade to 5.0.2

We upgraded from 4.14 to 5.0.2
Thankfully it was successful. I can create VMs, live-migrate, etc. However, after stopping a test VM and terminating I get an error:

Fri Aug 19 10:55:41 2016 [Z0][TM][I]: Command execution fail: /var/lib/one/remotes/tm/ceph/delete 
Fri Aug 19 10:55:41 2016 [Z0][TM][I]: delete: Deleting /var/lib/one/datastores/0/285/disk.0
Fri Aug 19 10:55:41 2016 [Z0][TM][E]: delete: Command " RBD="rbd --id libvirt"
Fri Aug 19 10:55:41 2016 [Z0][TM][I]: 
Fri Aug 19 10:55:41 2016 [Z0][TM][I]: if [ "$(rbd_format one/one-92-285-0)" = "2" ]; then
Fri Aug 19 10:55:41 2016 [Z0][TM][I]: rbd_rm_r $(rbd_top_parent one/one-92-285-0)
Fri Aug 19 10:55:41 2016 [Z0][TM][I]: 
Fri Aug 19 10:55:41 2016 [Z0][TM][I]: if [ -n "285-0" ]; then
Fri Aug 19 10:55:41 2016 [Z0][TM][I]: rbd_rm_snap one/one-92 285-0
Fri Aug 19 10:55:41 2016 [Z0][TM][I]: fi
Fri Aug 19 10:55:41 2016 [Z0][TM][I]: else
Fri Aug 19 10:55:41 2016 [Z0][TM][I]: rbd --id libvirt rm one/one-92-285-0

Fri Aug 19 10:55:41 2016 [Z0][TM][I]: bash: line 109: rbd: command not found
Fri Aug 19 10:55:41 2016 [Z0][TM][I]: bash: line 221: rbd: command not found
Fri Aug 19 10:55:41 2016 [Z0][TM][E]: Error deleting one/one-92-285-0 in xxxxxxx
Fri Aug 19 10:55:41 2016 [Z0][TM][I]: ExitCode: 127
Fri Aug 19 10:55:41 2016 [Z0][TM][E]: Error executing image transfer script: Error deleting one/one-92-285-0
Fri Aug 19 10:55:41 2016 [Z0][VM][I]: New LCM state is EPILOG_FAILURE

I can then go into Recover menu and delete the VM there. Any thoughts?

I’d check why rbd cannot be executed and issue a recover --retry to cleanup
disks

Somehow since the upgrade from 4.14 to 5.0.2 it is not ssh-ing to the ceph nodes to run the “rbd --id libvirt rm one/one-92-285-0” command. We don’t have ceph-common installed on the host running OpenNebula. The BRIDGE_LIST variable looks correct though, based on output of “onedatastore show XXX” command. It has the ceph nodes at least.

Here is part of the 'onedatastore show" command for ceph datastore:

DATASTORE TEMPLATE                                                              
BRIDGE_LIST="host1 host2 host3 host4"
CEPH_SECRET="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
CEPH_USER="libvirt"
CLONE_TARGET="SELF"
DISK_TYPE="RBD"
DS_MAD="ceph"
LN_TARGET="NONE"
POOL_NAME="one"
TM_MAD="ceph"
TYPE="IMAGE_DS"

I added the variable “CEPH_HOST” to the ceph datastore template (c.f. http://docs.opennebula.org/5.0/deployment/open_cloud_storage_setup/ceph_ds.html), but it had no effect. Creating a test vm, I noticed this time that “Terminate” was greyed out but “Terminate hard” was available through Sunstone gui. This command failed (again) however with same errors as above. Deleting the vm by using “Recover -> Delete” button produced the following messages:

 Wed Sep  7 12:35:04 2016 [Z0][ReM][D]: Req:3808 UID:7 VirtualMachineInfo invoked , 287
Wed Sep  7 12:35:04 2016 [Z0][ReM][D]: Req:3808 UID:7 VirtualMachineInfo result SUCCESS, "<VM><ID>287</ID><UID..."
Wed Sep  7 12:35:04 2016 [Z0][ReM][D]: Req:5632 UID:7 VirtualMachineRecover invoked , 287, 3
Wed Sep  7 12:35:04 2016 [Z0][DiM][D]: Deleting VM 287
Wed Sep  7 12:35:04 2016 [Z0][ReM][D]: Req:5632 UID:7 VirtualMachineRecover result SUCCESS, 287
Wed Sep  7 12:35:04 2016 [Z0][ONE][E]: Trying to remove VM 287, that it is not associated to host 12.
Wed Sep  7 12:35:04 2016 [Z0][ReM][D]: Req:7776 UID:7 VirtualMachineInfo invoked , 287
Wed Sep  7 12:35:04 2016 [Z0][ReM][D]: Req:7776 UID:7 VirtualMachineInfo result SUCCESS, "<VM><ID>287</ID><UID..."
Wed Sep  7 12:35:04 2016 [Z0][TM][D]: Message received: LOG I 287 Driver command for 287 cancelled
Wed Sep  7 12:35:05 2016 [Z0][ReM][D]: Req:1600 UID:0 VirtualMachineInfo invoked , 287
Wed Sep  7 12:35:05 2016 [Z0][ReM][D]: Req:1600 UID:0 VirtualMachineInfo result SUCCESS, "<VM><ID>287</ID><UID..."

Where the test vm is 287 and the ceph host is host 12. Also, I checked on the ceph cluster using “rbd ls -p one --id libvirt” and the vm snapshot was deleted.

Another clue: ‘terminate’ from the command line (“onevm terminate XXX”) works.