"terminate" fails after upgrade to 5.0.2

dwelch · August 19, 2016, 3:12pm

We upgraded from 4.14 to 5.0.2
Thankfully it was successful. I can create VMs, live-migrate, etc. However, after stopping a test VM and terminating I get an error:

Fri Aug 19 10:55:41 2016 [Z0][TM][I]: Command execution fail: /var/lib/one/remotes/tm/ceph/delete 
Fri Aug 19 10:55:41 2016 [Z0][TM][I]: delete: Deleting /var/lib/one/datastores/0/285/disk.0
Fri Aug 19 10:55:41 2016 [Z0][TM][E]: delete: Command " RBD="rbd --id libvirt"
Fri Aug 19 10:55:41 2016 [Z0][TM][I]: 
Fri Aug 19 10:55:41 2016 [Z0][TM][I]: if [ "$(rbd_format one/one-92-285-0)" = "2" ]; then
Fri Aug 19 10:55:41 2016 [Z0][TM][I]: rbd_rm_r $(rbd_top_parent one/one-92-285-0)
Fri Aug 19 10:55:41 2016 [Z0][TM][I]: 
Fri Aug 19 10:55:41 2016 [Z0][TM][I]: if [ -n "285-0" ]; then
Fri Aug 19 10:55:41 2016 [Z0][TM][I]: rbd_rm_snap one/one-92 285-0
Fri Aug 19 10:55:41 2016 [Z0][TM][I]: fi
Fri Aug 19 10:55:41 2016 [Z0][TM][I]: else
Fri Aug 19 10:55:41 2016 [Z0][TM][I]: rbd --id libvirt rm one/one-92-285-0

…

Fri Aug 19 10:55:41 2016 [Z0][TM][I]: bash: line 109: rbd: command not found
Fri Aug 19 10:55:41 2016 [Z0][TM][I]: bash: line 221: rbd: command not found
Fri Aug 19 10:55:41 2016 [Z0][TM][E]: Error deleting one/one-92-285-0 in xxxxxxx
Fri Aug 19 10:55:41 2016 [Z0][TM][I]: ExitCode: 127
Fri Aug 19 10:55:41 2016 [Z0][TM][E]: Error executing image transfer script: Error deleting one/one-92-285-0
Fri Aug 19 10:55:41 2016 [Z0][VM][I]: New LCM state is EPILOG_FAILURE

I can then go into Recover menu and delete the VM there. Any thoughts?

ruben · August 29, 2016, 12:23pm

I’d check why rbd cannot be executed and issue a recover --retry to cleanup
disks

dwelch · September 6, 2016, 6:42pm

Somehow since the upgrade from 4.14 to 5.0.2 it is not ssh-ing to the ceph nodes to run the “rbd --id libvirt rm one/one-92-285-0” command. We don’t have ceph-common installed on the host running OpenNebula. The BRIDGE_LIST variable looks correct though, based on output of “onedatastore show XXX” command. It has the ceph nodes at least.

Here is part of the 'onedatastore show" command for ceph datastore:

DATASTORE TEMPLATE                                                              
BRIDGE_LIST="host1 host2 host3 host4"
CEPH_SECRET="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
CEPH_USER="libvirt"
CLONE_TARGET="SELF"
DISK_TYPE="RBD"
DS_MAD="ceph"
LN_TARGET="NONE"
POOL_NAME="one"
TM_MAD="ceph"
TYPE="IMAGE_DS"

I added the variable “CEPH_HOST” to the ceph datastore template (c.f. http://docs.opennebula.org/5.0/deployment/open_cloud_storage_setup/ceph_ds.html), but it had no effect. Creating a test vm, I noticed this time that “Terminate” was greyed out but “Terminate hard” was available through Sunstone gui. This command failed (again) however with same errors as above. Deleting the vm by using “Recover -> Delete” button produced the following messages:

 Wed Sep  7 12:35:04 2016 [Z0][ReM][D]: Req:3808 UID:7 VirtualMachineInfo invoked , 287
Wed Sep  7 12:35:04 2016 [Z0][ReM][D]: Req:3808 UID:7 VirtualMachineInfo result SUCCESS, "<VM><ID>287</ID><UID..."
Wed Sep  7 12:35:04 2016 [Z0][ReM][D]: Req:5632 UID:7 VirtualMachineRecover invoked , 287, 3
Wed Sep  7 12:35:04 2016 [Z0][DiM][D]: Deleting VM 287
Wed Sep  7 12:35:04 2016 [Z0][ReM][D]: Req:5632 UID:7 VirtualMachineRecover result SUCCESS, 287
Wed Sep  7 12:35:04 2016 [Z0][ONE][E]: Trying to remove VM 287, that it is not associated to host 12.
Wed Sep  7 12:35:04 2016 [Z0][ReM][D]: Req:7776 UID:7 VirtualMachineInfo invoked , 287
Wed Sep  7 12:35:04 2016 [Z0][ReM][D]: Req:7776 UID:7 VirtualMachineInfo result SUCCESS, "<VM><ID>287</ID><UID..."
Wed Sep  7 12:35:04 2016 [Z0][TM][D]: Message received: LOG I 287 Driver command for 287 cancelled
Wed Sep  7 12:35:05 2016 [Z0][ReM][D]: Req:1600 UID:0 VirtualMachineInfo invoked , 287
Wed Sep  7 12:35:05 2016 [Z0][ReM][D]: Req:1600 UID:0 VirtualMachineInfo result SUCCESS, "<VM><ID>287</ID><UID..."

Where the test vm is 287 and the ceph host is host 12. Also, I checked on the ceph cluster using “rbd ls -p one --id libvirt” and the vm snapshot was deleted.

dwelch · September 8, 2016, 2:42pm

Another clue: ‘terminate’ from the command line (“onevm terminate XXX”) works.

Topic		Replies	Views
Error deploying VM Product Support	0	452	August 9, 2017
Problem with Ceph Mimic Product Support	0	407	July 12, 2018
LXD - Cannot terminate container Product Support	5	462	November 11, 2021
After upgrade to 5 one vm stuck in FAILED state Product Support	6	871	October 30, 2016
[SOLVED] Non-persistent Ceph image results in clone failure Product Support	1	856	January 26, 2017

"terminate" fails after upgrade to 5.0.2

Related topics