VMs deleted after onevm stop

Hi,
I found a very weird and quite dangerous behaviour only on one of my two kvm hypervisors. I am using shared nfs image and system datastore. ONE version 5.0.2

When I execute this.

onevm stop 128

the whole directory of the VM just disappears:

ls -l /var/lib/one//datastores/0/128
ls: cannot access '/var/lib/one//datastores/0/128': No such file or directory

It also happens when I “cold migrate” a vm off this host (because “cold migration” involves “vmm save” and “vmm restore” operation)

oned.log tells me that save operation returned 0:

Tue Nov 22 13:00:41 2016 [Z0][VMM][D]: Message received: LOG I 128 ExitCode: 0
Tue Nov 22 13:00:41 2016 [Z0][VMM][D]: Message received: LOG I 128 Successfully execute virtualization driver operation: save.
Tue Nov 22 13:00:41 2016 [Z0][VMM][D]: Message received: LOG I 128 ExitCode: 0
Tue Nov 22 13:00:41 2016 [Z0][VMM][D]: Message received: LOG I 128 Successfully execute network driver operation: clean.
Tue Nov 22 13:00:41 2016 [Z0][VMM][D]: Message received: SAVE SUCCESS 128 -
Tue Nov 22 13:00:42 2016 [Z0][TM][D]: Message received: TRANSFER SUCCESS 128 -

If I put “exit 1” to the very end of script /var/lib/one/remotes/vmm/kvm/save (and of course onehost sync to the hypervisor), opennebula does not do any further operations and logs an error in oned.log.
So It seems that this script does not have a problem at all, everything in the vms systemstore directory is as it’s supposed to be:

-rw-rw-rw- 1 oneadmin oneadmin 213571371 Nov  22 13:00 checkpoint
-rw-rw-r-- 1 oneadmin oneadmin      1002 Nov  22 11:51 deployment.0
-rw-r--r-- 1 oneadmin oneadmin 364773376 Nov  22 12:17 disk.0
-rw-r--r-- 1 oneadmin oneadmin    382976 Nov  22 11:51 disk.1

So this brings me to the conclusion that deletion of the VM directory in the System DS is happening right AFTER the vmm/kvm/save script.

So the question is: What exactely is happening after this script?
Acoording to the states flowchart (http://docs.opennebula.org/5.2/_images/states-complete.png) the action called after save_stop is named epilog_stop.

Can someone point me to the right shell or ruby script which defines what happens at epilog_stop?

And just noting this again: The weird thing is that it only happens on one of my two hypervisors. On the one that is not the frontend!
Yes I checked for problems with the NFS share. Seems to work (if it wouldn’t I probably couldn’t do all other VM operations properly)

Thanks
all the best
Jojo

Hi jojo,

After vmm/driver/save the tm/driver/mv is called in two contexts.
First for each VM disk the corresponding tm/imageds/mv is called.
Then the tm/systemds/mv is called with path to the VM home. I think here something is going wrong because the script should do nothing if the source and destination paths are same. Else. it will simply do mv src_path dst_path.

Same is called on cold-migrate too.

Please take a look at this doc http://docs.opennebula.org/5.2/integration/infrastructure_integration/sd.html#an-example-vm

Hope this helps.

Kind Regards,
Anton Todorov

Thanks a lot Anton that was exactely the information I needed to further debug this!

In the meantime I also came up with another option to debug this issue.
I just set all vm files to immutable:

chattr +i ....datastores/0/132/*

to see exactely which commands are messing with my vms files:

Tue Nov 22 14:56:47 2016 [Z0][TM][D]: Message received: LOG I 132 Command execution fail: /var/lib/one/remotes/tm/ssh/mv dell2:/var/lib/one//datastores/0/132 dell1:/var/lib/one//datastores/0/132 132 0
Tue Nov 22 14:56:47 2016 [Z0][TM][D]: Message received: LOG I 132 mv: Moving dell2:/var/lib/one/datastores/0/132 to dell1:/var/lib/one/datastores/0/132
Tue Nov 22 14:56:47 2016 [Z0][TM][D]: Message received: LOG E 132 mv: Command "rm -rf '/var/lib/one/datastores/0/132'" failed: rm: cannot remove '/var/lib/one/datastores/0/132/deployment.0': Operation not permitted
Tue Nov 22 14:56:47 2016 [Z0][TM][D]: Message received: LOG I 132 rm: cannot remove '/var/lib/one/datastores/0/132/disk.0': Operation not permitted
Tue Nov 22 14:56:47 2016 [Z0][TM][D]: Message received: LOG I 132 rm: cannot remove '/var/lib/one/datastores/0/132/disk.1': Operation not permitted
Tue Nov 22 14:56:47 2016 [Z0][TM][D]: Message received: LOG E 132 Error removing target path to prevent overwrite errors
Tue Nov 22 14:56:47 2016 [Z0][TM][D]: Message received: LOG I 132 ExitCode: 1
Tue Nov 22 14:56:47 2016 [Z0][TM][D]: Message received: TRANSFER FAILURE 132 Error removing target path to prevent overwrite errors

now i realized that all my datastores are set to TM “ssh”, which is actually not what I want.
And the ssh driver is actually trying to do what it is supposed to: copying back everything from hosts system to frontends system ds and then deleting from hosts system ds :wink:

I thought tm type is shared by default. I actually never touched this setting, as far as I remember.

Was this default changed in 5.0.2??? Wasn’t the default datastore TM type “shared”? In my other opennebula cluster (that I initially installed with 4.14 and then upgraded to 5.0.2) everything is and was always set to “shared”

alright, but finally we found the root of the problem, wrong TM setting:

ID NAME                SIZE AVAIL CLUSTERS     IMAGES TYPE DS      TM      STAT
 0 system-nfs             - -     100               0 sys  -       ssh     on  
 1 image-nfs          28.7G 70%   100               3 img  fs      ssh     on  
 2 files-nfs          28.7G 70%   100               0 fil  fs      ssh     on