Requested VNC port already assgined to a VM

Hi,

I have the problem, that some of my older VMs sometimes are not deployed by the scheduler (stuck in PENDING) since VNC is enabled in the template and the scheduler hesitates to believe that the VNC port is already assigned to a VM. I triple checked that this is not the case.

The problem is hard to reproduce. Sometimes restarting the one-deamon helps, in another case I could only fix the problem by removing VNC from the VMs config, power-cycling the VM and the re-enabling VNC in the config of that VM.

An fsck did not find any errors in the database. I’m running OpenNebula 5.0.2 with KVM hypervisors.

Does anyone have an idea what could cause this behavior?

Greetings
Wilma

Hello,

I have ran into the same problem today: I tried to resume an older VM, which has been in the UNDEPLOYED state for a month or so, and I am getting the same errors as above:

Mon Nov 21 08:51:19 2016 [Z0][VM][E]: Error deploying virtual machine 568 to HID: 46. Reason: [VirtualMachineDeploy] Requested VNC port already assgined to a VM
Mon Nov 21 08:51:19 2016 [Z0][VM][E]: Error deploying virtual machine 568 to HID: 47. Reason: [VirtualMachineDeploy] Requested VNC port already assgined to a VM
Mon Nov 21 08:51:19 2016 [Z0][VM][E]: Error deploying virtual machine 568 to HID: 48. Reason: [VirtualMachineDeploy] Requested VNC port already assgined to a VM
Mon Nov 21 08:51:19 2016 [Z0][VM][E]: Error deploying virtual machine 568 to HID: 29. Reason: [VirtualMachineDeploy] Requested VNC port already assgined to a VM
Mon Nov 21 08:51:19 2016 [Z0][VM][E]: Error deploying virtual machine 568 to HID: 40. Reason: [VirtualMachineDeploy] Requested VNC port already assgined to a VM
Mon Nov 21 08:51:19 2016 [Z0][VM][E]: Error deploying virtual machine 568 to HID: 38. Reason: [VirtualMachineDeploy] Requested VNC port already assgined to a VM

Restarting oned or scheduler did not help so far. Maybe the VNC port should be assigned each time the VM is resumed from the UNDEPLOYED state?

I made the VM running using the following commands:

  • in Sunstone, open the VM tab
  • click on “Conf”, and then “Update configuration”
  • click on “Input/Output”, and under “Graphics”, select “none”
  • click on “Update”
  • click again on “Update configuration”
  • click on “Input/Output”, and under “Graphics”, select “VNC”
  • click on “Update”
  • resume the VM
  • Profit!

In the template of the instantiated VM, there should be something like this:

GRAPHICS = [
  LISTEN = "0.0.0.0",
  TYPE = "VNC" ]

The VM which failed to resume had also a "PORT=6543" line in the GRAPHICS section.

This is ONe 5.2, but the VM in question was probably instantiated in 5.0.2.

-Yenya

I just ran into this issue again (one 5.2.0), one of my VMs is not deploying because of this. As you mentioned, the config that was generated probably several versions ago contains a port number, so I guess that’s the problem. Recovery is easy, the only problem is that it is not intuitive for an unexperienced user :wink:

From the administrator’s perspective, I don’t get the following: The port number is exactly the same as would have been calculated by ONE (5900 + VM_ID). Why is ONE thinking this port was already in use? I searched the database, there is no other VM with that VNC port.

I think, the open questions are:

  • Is this a bug? My intuition says yes
  • Would it work to remove all VNC ports from entries of undeployed VMs from the database?

Greetings

Hi Wilma!
I’ve been checking the code and that “Request VNC port already assgined to a VM” will show if OpenNebula tries to set to 1 a bit in a bitmap but that bit it’s already set/used. That bitmap is stored in the cluster_vnc_bitmap table in OpenNebula’s database, and there’s a row/bitmap for each cluster you have in OpenNebula.

In my testing lab this is the bitmap in my sqldatabase for cluster 0 (cluster_id is the first column):

eJztziEBAAAIA8FK0L8cC4AGcydeTK0KAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA6HWM5PoLAAAAAAB8GbhQAtg=

If you want to check if the bitmap is the culprit of this situation, I’ve created a ruby script that you can try to run on a machine where you have ruby and you’ve permissions to install the bitmap gem: gem install bitmap --no-ri --no-rdoc.

The script is [here] at Gist (check_vncport_set.rb · GitHub)

You would only change the txt variable so it has the value for the second column in your cluster_vnc_bitmap tabl. This is a sample run:

VNC port set in bitmap: 5935
VNC port set in bitmap: 5914
VNC port set in bitmap: 5913
VNC port set in bitmap: 5912
VNC port set in bitmap: 5911
VNC port set in bitmap: 5909
VNC port set in bitmap: 5908

Which matches the VMs that I’ve on OpenNebula. This script won’t solve your issue but at least can tell us if the bitmap is the reason for the message so you can look if your 5900+VMID is on that list and help us know that to do next.

Cheers!

Thanks for the script! Yes, the port is on the list of the cluster the VM is to be deployed in, so that explains the problem.

I’m also pretty sure that there’s something wrong with these bitmaps. The bitmap of cluster 0 contains 73 used ports even though we stopped using cluster 0 several months ago. Currently, cluster 0 contains no host, no vnet, and no DS. There shouldn’t be ports in use I assume…

Hi Wilma!
maybe the script is not sharp enough and buggy and you’re getting weird values for cluster 0 :smiley: or indeed cluster 0 has a wrong bitmap. When the VM lifecycle ends the VNC associated port bit should be reset to 0, so something is wrong.

As Yenya and you suggest, you can solve the issue removing the VNC PORT from the VM configuration and then OpenNebula gets a new port, but to prevent this maybe we can check why is this happening and where’s the bug hiding.

Could you explain or confirm if this is what is happening?

  • You have a VM, and you undeploy it.
  • You try to deploy it again but the scheduler refuses as the VNC port set in the config “is used” according to the bitmap
  • It can only be deployed again removing the VNC port from config.

Is that so? I’ll try to follow the code with that undeploy behavior in mind.

Happy hoildays!

I can confirm your description, but it’s probably important to note that it only seems to happen for VMs that existed before ONE 5.0. The undeploy for that particular VM happend on 26/07/16. We were using ONE 4.14 at that time, we upgraded to 5.0 in early August.

This also holds for the VM that this thread started with (October 27th using ONE 5.0), it was undeployed on 22/07/16. If I understand @ruben correctly, the use of bitmaps started with ONE 5.0. Maybe the bug is in the migrator scripts for the database?

Greetings

I finally had the time to track this down and I think I found it (correct me if I’m wrong please):

The db migrator, which initially created the table with the bitmaps in the database, only excludes VMs that are in DONE state (see here). For all other states (including UNDEPLOYED), VNC ports are labeled to be in use if given in the template.

The onedb fsck tool follows the same approach (see here), that’s why it is not detected as db inconsistency.

Anyone with a good idea how to fix that?

Greetings

After taking a deep dive into the code, I think that VNC ports should only be reported to be in use if the VM is in RUNNING state. I am not sure though if a VM is SUSPENDED, but none of my VMs is in that state so I did not spend too much time on this.

My solution was to change line 817 in /usr/lib/one/ruby/onedb/fsck.rb to

if cid && port && state == 3

and run onedb fsck.

This removed the incorrectly blocked ports and resolved the issue for me.

Greetings
Wilma

This works for me in 5.2.1, thanks!

Maybe this fix should be included in upstream.

I made this little python snippet that actually does something similar to finding ports in use from the ruby code:

#!/usr/bin/env python3

import base64
import zlib

'''
Find the bitmap in the opennebula database table: cluster_vnc_bitmap.
MariaDB [opennebula]> describe cluster_vnc_bitmap
    -> ;
+-------+----------+------+-----+---------+-------+
| Field | Type     | Null | Key | Default | Extra |
+-------+----------+------+-----+---------+-------+
| id    | int(11)  | NO   | PRI | 0       |       |
| map   | longtext | YES  |     | NULL    |       |
+-------+----------+------+-----+---------+-------+

The table is indexed by clusterid.


'''
encoded ="eJztziEBAAAIA8FK0L8cC4AGcydeTK0KAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA6HWM5PoLAAAAAAB8GbhQAtg="

decoded = base64.b64decode(encoded,altchars=None, validate=False)

decompressed = zlib.decompress(decoded)

bitmap = [ b-48 for b in decompressed ]
maxvnc=65535
# count from 0 to 65355 inclusive
for index in range(0,65356):
  # actual position in bitmap is maxvnc - index (counting backwards)
  if bitmap[maxvnc-index] == 1:
    # show if in use
    print("port {} in use".format(index))
#  else:
#    print("port {} free".format(index))

Ran the modified fsck script and this fixed my current VNC already assigned problems.
Thanks,

Hans