Possible race condition leading to a vm out of the herd

Hi,

just in case it helps to catch some bug…

We got to some state (we don’t now how), where a VM machine is defined and running in some host’s hypervisor (kvm), but it’s no longer in Opennebula UI (or database).
Checking the hypervisor with libvirt, we can see the machine “one-1437” running in our host, but the machine is no longer in Opennebula.

We realized the problem because we were seeing IP collisions (the IP the machine had in Opennebula has been reused, as the machine no longer exists for Opennebula).

The logs for the machine are:

Mon Apr 10 17:20:08 2017 [Z0][HKM][I]: Success executing Hook: notify_create_hook: {"status":"ok"}{"status":"ok"}.
Mon Apr 10 17:20:53 2017 [Z0][VM][I]: New state is ACTIVE
Mon Apr 10 17:20:53 2017 [Z0][VM][I]: New LCM state is PROLOG
Mon Apr 10 17:21:00 2017 [Z0][VM][I]: New LCM state is BOOT
Mon Apr 10 17:21:00 2017 [Z0][VMM][I]: Generating deployment file: /var/lib/one/vms/1437/deployment.0
Mon Apr 10 17:21:02 2017 [Z0][VMM][I]: Successfully execute transfer manager driver operation: tm_context.
Mon Apr 10 17:21:02 2017 [Z0][VMM][I]: ExitCode: 0
Mon Apr 10 17:21:02 2017 [Z0][VMM][I]: ExitCode: 0
Mon Apr 10 17:21:02 2017 [Z0][VMM][I]: Successfully execute network driver operation: pre.
Mon Apr 10 17:21:03 2017 [Z0][VM][I]: New LCM state is CLEANUP_DELETE
Mon Apr 10 17:21:03 2017 [Z0][VMM][I]: Driver command for 1437 cancelled
Mon Apr 10 17:21:03 2017 [Z0][VM][I]: New state is DONE
Mon Apr 10 17:21:03 2017 [Z0][VM][I]: New LCM state is LCM_INIT
Mon Apr 10 17:21:04 2017 [Z0][VMM][W]: Ignored: LOG I 1437 Successfully execute transfer manager driver operation: tm_delete.
Mon Apr 10 17:21:04 2017 [Z0][VMM][W]: Ignored: LOG I 1437 Successfully execute transfer manager driver operation: tm_delete.
Mon Apr 10 17:21:04 2017 [Z0][VMM][W]: Ignored: CLEANUP SUCCESS 1437

Regards,
Andrés

What version are you using and which action do you use to kill the VM?

Version is 5.2.1

Copyright 2002-2017, OpenNebula Project, OpenNebula Systems
OpenNebula 5.2.1 is distributed and licensed for use under the terms of the
Apache License, Version 2.0 (http://www.apache.org/licenses/LICENSE-2.0).

I think we killed the machine with the Terminate HARD option in the WebUI (but I’m not 100% sure).

This line means that the shutdown action was cancelled:

Mon Apr 10 17:21:03 2017 [Z0][VMM][I]: Driver command for 1437 cancelled

Maybe the terminate was issued twice in a row or recover --delete was executed while the VM was being shutdown. Does it happen to more VMs or only this one?

It only happened to this one…