I’m looking to the best solution to protect againt the duplicate running instance between multiple hosts.
I’m using virtlockd with libvirt in order to lock each disks when VM is in running state and an error appeared on a host creating an auto migration with the hooks configured from opennebula manager.
I’m using a SAN connected by ISCSI with OCFS2 File System on each nodes and manager and use the shared filesystem as datastore in OpenNebula. So all the disks (datastore 1) and all the running instance (datastore 0) are shared between hosts and manager.
That seems to be a good solution but the VM is steel starting on another host when I stop the libvirtd process on the first node.
Does the architecture that I have implemented myself has something wrong or that’s simply not possible ?
Nobody as any idea about that point ?
I’m steel looking for a solution, but I didn’t resolved the problem.
I can’t understand why the virtlockd wouldn’t lock the disk and prevent from starting on the other Node.
When an Image is persistent OpenNebula will not allow another VM run from
it. So, I guess that is the best way o protect against a duplicate running
VM from the same disk image
The point is that the problem happening when a crash or stop of libvirtd occured. So the VM is automatically migrated to the other host using the hooks, but when the other Node come back, the VM is running from 2 node so there is 2 different access to the disks that could create filesystem’s problems.
Fencing is the standard approach to prevent split brain conditions in a
distributed system. AFAIK the locking mechanism of libvirt is not safe… A
proper fence requires a dedicated network and device.
So I will install watchdog to act as fencing system which one will reboot host if the libvirt process is detected as stopped or crashed. Do you think that is a good solution, do you think about another fencing mechanism ?
I succeed to make the protection against duplicates running instances.
I used sanlock with watchdog to create a lock file and monitor the read/write access.
I spoke with a libvirt engineer which one said me that virtlockd is not yet compliant with OCFS2 Filesystem. So that was the reasons why I had the lock files created but the VM could still start on the other Node.
So right now, I get this error when a VM try to boot on the other node and a lock file persist:
Fri Apr 24 14:37:39 2015 [Z0][VMM][I]: error: Failed to create domain from /var/lib/one//datastores/0/129/deployment.4
Fri Apr 24 14:37:39 2015 [Z0][VMM][I]: error: internal error: Failed to acquire lock: error -243
Fri Apr 24 14:37:39 2015 [Z0][VMM][E]: Could not create domain from /var/lib/one//datastores/0/129/deployment.4
Fri Apr 24 14:37:39 2015 [Z0][VMM][I]: ExitCode: 255
Fri Apr 24 14:37:39 2015 [Z0][VMM][I]: Failed to execute virtualization driver operation: deploy.
Fri Apr 24 14:37:39 2015 [Z0][VMM][E]: Error deploying virtual machine: Could not create domain from /var/lib/one//datastores/0/129/deployment.4
Fri Apr 24 14:37:39 2015 [Z0][DiM][I]: New VM state is FAILED