I created a new node on our infrastructure and now whenever I want to start a VM on this new node I get these error messages:
error: qemu unexpectedly closed the monitor: 2019-10-28T17:09:21.244860Z qemu-kvm: -drive file=/var/lib/one//datastores/106/214/disk.0,format=qcow2,if=none,id=drive-ide0-0-0,cache=none: Failed to get "write" lock
Mon Oct 28 18:09:34 2019 [Z0][VMM][I]: Is another process using the image [/var/lib/one//datastores/106/214/disk.0]?
Mon Oct 28 18:09:34 2019 [Z0][VMM][E]: Could not create domain from /var/lib/one//datastores/106/214/deployment.38
Interestingly the file deployment.38 exists afterwards. And when I run lsof on this file afterwards I have no process using this file. This datastore is on NFS and we have opennebula 5.8.5.
Does anybody have some more ideas how I could investigate what exactly is going on?
UPDATE: I tested it now with a VM on the ceph store and there it works. So it must be something related to NFS.
Could you share the mount options for that NFS mount? Does it only happen with one image or several?
Sure. this are the mount options:
192.168.11.55:/volume1/datastore1 /var/lib/one/datastores/1 nfs rw,relatime,vers=3,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.11.55,mountvers=3,mountport=892,mountproto=udp,local_lock=none,addr=192.168.11.55 0 0
And we have the issue with 3 or 4 VMs but not more.
I have same problem, but i’m testing host_hook. And when i’m stopping a libvirtd service on one node in a cluster i get the error:
Message received: LOG I 108 Command execution fail: cat << EOT | /var/tmp/one/vmm/kvm/deploy ‘/var/lib/one//datastores/123/108/deployment.3’ ‘node1’ 108 node2
Message received: LOG I 108 error: Failed to create domain from /var/lib/one//datastores/123/108/deployment.3
Message received: LOG I 108 error: internal error: qemu unexpectedly closed the monitor: 2020-03-05T14:39:36.080776Z qemu-kvm: -drive file=/var/lib/one//datastores/123/108/disk.0,format=qcow2,if=none,id=drive-scsi0-0-0-0,cache=none: Failed to get “write” lock
QEMU have a locking method for images in use if you use a shared filesystem which supports locking. The locked files need some time to expire as NFS server gives some time to NFS clients (hypervisors) to recover in case of incident.
You can disable locking support on mounted volumes on hypervisors by adding
nolock option at mount option at fstab or decrease these timeouts on NFS server (check
--lease-time seconds and
--grace-time seconds options).
In the fstab output above you can see that I have set “timeout=600”. Which would mean 10 minutes. My test VM is now shut down for 2 hours and still can not be started due to the lock problem. From my point of view it looks like there is another problem.
I have opened a case now with Synology because the NFS share is on a Synology NAS. After several investigations they said that the problem must be with qemu. From their side all is fine. Maybe someone else knows. more.