I created a new node on our infrastructure and now whenever I want to start a VM on this new node I get these error messages:
error: qemu unexpectedly closed the monitor: 2019-10-28T17:09:21.244860Z qemu-kvm: -drive file=/var/lib/one//datastores/106/214/disk.0,format=qcow2,if=none,id=drive-ide0-0-0,cache=none: Failed to get "write" lock
Mon Oct 28 18:09:34 2019 [Z0][VMM][I]: Is another process using the image [/var/lib/one//datastores/106/214/disk.0]?
Mon Oct 28 18:09:34 2019 [Z0][VMM][E]: Could not create domain from /var/lib/one//datastores/106/214/deployment.38
Interestingly the file deployment.38 exists afterwards. And when I run lsof on this file afterwards I have no process using this file. This datastore is on NFS and we have opennebula 5.8.5.
Does anybody have some more ideas how I could investigate what exactly is going on?
UPDATE: I tested it now with a VM on the ceph store and there it works. So it must be something related to NFS.
Sure. this are the mount options: 192.168.11.55:/volume1/datastore1 /var/lib/one/datastores/1 nfs rw,relatime,vers=3,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.11.55,mountvers=3,mountport=892,mountproto=udp,local_lock=none,addr=192.168.11.55 0 0
And we have the issue with 3 or 4 VMs but not more.
QEMU have a locking method for images in use if you use a shared filesystem which supports locking. The locked files need some time to expire as NFS server gives some time to NFS clients (hypervisors) to recover in case of incident.
You can disable locking support on mounted volumes on hypervisors by adding nolock option at mount option at fstab or decrease these timeouts on NFS server (check --lease-time seconds and --grace-time seconds options).
In the fstab output above you can see that I have set “timeout=600”. Which would mean 10 minutes. My test VM is now shut down for 2 hours and still can not be started due to the lock problem. From my point of view it looks like there is another problem.
I have opened a case now with Synology because the NFS share is on a Synology NAS. After several investigations they said that the problem must be with qemu. From their side all is fine. Maybe someone else knows. more.