Hostname length issue (was: How to debug VM start issues?)

One of my OpenNebula setups has an issue that I can’t start VMs bound to the host’s ocal storage. I’m a bot at loss as to how to debug this.

Cluster 100 consists of 6 hosts, datastore 102 is of type ssh (to utilize host-local storage). I can happily instantiate VMs with shared storage (datastore 0) on or migrate them to all hosts, but deplyoing to host thu* onto datastore 102 somehow fails. See “lfs-” VMs in this screenshot:

If I grep for e. g. VM 38’s logentries, I only get these:

root@one-1:~# grep " 38" /var/log/one/oned.log | grep "Thu Feb 23 10:" | more
Thu Feb 23 10:10:20 2017 [Z0][ReM][D]: Req:2016 UID:0 TemplateInstantiate result SUCCESS, 38
Thu Feb 23 10:11:33 2017 [Z0][ReM][D]: Req:832 UID:0 VirtualMachineDeploy invoked , 38, 4, false, 102
Thu Feb 23 10:11:33 2017 [Z0][DiM][D]: Deploying VM 38
Thu Feb 23 10:11:33 2017 [Z0][ReM][D]: Req:832 UID:0 VirtualMachineDeploy result SUCCESS, 38
Thu Feb 23 10:11:35 2017 [Z0][ReM][D]: Req:5312 UID:0 VirtualMachineInfo invoked , 38
Thu Feb 23 10:12:34 2017 [Z0][TM][D]: Message received: TRANSFER SUCCESS 38 -
Thu Feb 23 10:16:52 2017 [Z0][ReM][D]: Req:8256 UID:0 VirtualMachineInfo invoked , 38

Whereas, for VM 39, selected by OpenNebula to run on a differen host, this gives:

root@one-1:~# grep " 39" /var/log/one/oned.log | grep "Thu Feb 23 10:" | more
Thu Feb 23 10:12:45 2017 [Z0][ReM][D]: Req:7568 UID:0 TemplateInstantiate result SUCCESS, 39
Thu Feb 23 10:13:34 2017 [Z0][ReM][D]: Req:832 UID:0 VirtualMachineDeploy invoked , 39, 1, false, 102
Thu Feb 23 10:13:34 2017 [Z0][DiM][D]: Deploying VM 39
Thu Feb 23 10:13:34 2017 [Z0][ReM][D]: Req:832 UID:0 VirtualMachineDeploy result SUCCESS, 39
Thu Feb 23 10:13:39 2017 [Z0][ReM][D]: Req:4832 UID:0 VirtualMachineInfo invoked , 39
Thu Feb 23 10:15:37 2017 [Z0][TM][D]: Message received: TRANSFER SUCCESS 39 -
Thu Feb 23 10:15:43 2017 [Z0][VMM][D]: Message received: LOG I 39 Successfully execute transfer manager driver operation: tm_c
ontext.
Thu Feb 23 10:15:44 2017 [Z0][VMM][D]: Message received: LOG I 39 ExitCode: 0
Thu Feb 23 10:15:44 2017 [Z0][VMM][D]: Message received: LOG I 39 Successfully execute network driver operation: pre.
Thu Feb 23 10:15:45 2017 [Z0][VMM][D]: Message received: LOG I 39 ExitCode: 0
Thu Feb 23 10:15:45 2017 [Z0][VMM][D]: Message received: LOG I 39 Successfully execute virtualization driver operation: deploy.
Thu Feb 23 10:15:45 2017 [Z0][VMM][D]: Message received: LOG I 39 ExitCode: 0
Thu Feb 23 10:15:45 2017 [Z0][VMM][D]: Message received: LOG I 39 Successfully execute network driver operation: post.
Thu Feb 23 10:15:45 2017 [Z0][VMM][D]: Message received: DEPLOY SUCCESS 39 one-39
Thu Feb 23 10:16:21 2017 [Z0][ReM][D]: Req:7328 UID:1 VirtualMachineInfo invoked , 39
Thu Feb 23 10:16:22 2017 [Z0][ReM][D]: Req:5968 UID:0 VirtualMachineUpdateTemplate invoked , 39, "READY=YES", 1
Thu Feb 23 10:16:22 2017 [Z0][ReM][D]: Req:5968 UID:0 VirtualMachineUpdateTemplate result SUCCESS, 39
Thu Feb 23 10:16:25 2017 [Z0][VMM][D]: VM 39 successfully monitored: DISK_SIZE=[ID=0,SIZE=2996] DISK_SIZE=[ID=1,SIZE=1]  STATE
=a CPU=107.94 MEMORY=1048576 NETRX=2722478 NETTX=37526
[…]

On the thu* host, only the VM’s disk image is showing up, whereas the other hosts have additional files:

root@thu*:~# ls -la /var/lib/one/datastores/102/36
total 3066316
drwxrwxr-x 2 oneadmin oneadmin       4096 Feb 23 10:09 .
drwxr-xr-x 8 oneadmin oneadmin       4096 Feb 23 10:16 ..
-rw-r--r-- 1 oneadmin oneadmin 3139895296 Feb 23 10:10 disk.0

root@con*:~# ls -la /var/lib/one/datastores/102/39
total 3574464
drwxrwxr-x 2 oneadmin     oneadmin       4096 Feb 23 10:15 .
drwxr-xr-x 6 oneadmin     oneadmin       4096 Feb 23 10:13 ..
-rw-rw-r-- 1 oneadmin     oneadmin       1218 Feb 23 10:15 deployment.0
-rw-r--r-- 1 libvirt-qemu kvm      3659857920 Feb 23 14:38 disk.0
-rw-r--r-- 1 libvirt-qemu kvm          374784 Feb 23 10:15 disk.1

Any hints on where to look would be apprciated.

Hello,

This means that the TM_MAD for the IMAGE datastore is working, but the SYSTEM’s TM_MAD is not and there is nothing logged which is weird. It it possible that you have in the log entries without the " 39" string which are hinting for the possible problem?

Kind Regards,
Anton Todorov

Actually, it seems to be a hostname length issue of sorts.

Looking into the logfile and checking for differences, I noticed that the target host in question has the longest (host)name of the configured hosts: thuwxyzabcde.zzzz.tld, 21 chars in total. Which OpenNebula seems to be quite happy about …

… but actually isn’t: I just renamed the host in question to tabcde.zzzz.tld (15 chars) in Sunstone, added appropriate DNS A/AAAA record (no CNAMEs, just to be sure), made sure oneadmin user has the keys to the new FQDN as well and then tried a “Recover” with “retry” on VM 41 via Sunstone, which worked just like a charm. Tried again with VM 38, bingo, works.

Renamed that host to thuwxyzabcde.zzzz.tld again, tried to Recover-retry on VM 36, didn’t work. Renamed to tabcde.zzzz.tld again, tried Revover-retry, and the VM 36 is now up and running.

Therefore I assume that there’s an random hostname length limit between 19 (works for me) and 21 (doesn’t work for me) characters somewhere in the code. Updated topic accordingly.


 


BTW, I’ve seen in the dump of the Sqlite DB, that only 128 chars are reserved per hostname; IIRC the actual limit per RFC is <=255 character, with each level of the FQDN being restricted to be <=63 chars … While unlikely that someone IS using OpenNebula with 128+ character FQDNs, in therory it’s ok per RFC :wink:

Totally baffling!

Can you check the HOSTNAME in the HISTORY records for the VM?

onevm show 36 -x

root@one-host:~# onevm show 36 -x | awk '{gsub(…); printf("%s\n", $0);}'
…
  <HISTORY_RECORDS>
    <HISTORY>
      <OID>36</OID>
      <SEQ>0</SEQ>
      <HOSTNAME>tabcde.zzzz.tld</HOSTNAME>
      <HID>4</HID>
      <CID>100</CID>
      <STIME>1487840943</STIME>
      <ETIME>1487867043</ETIME>
      <VM_MAD><![CDATA[kvm]]></VM_MAD>
      <TM_MAD><![CDATA[ssh]]></TM_MAD>
      <DS_ID>102</DS_ID>
      <PSTIME>1487840943</PSTIME>
      <PETIME>1487841019</PETIME>
      <RSTIME>1487841019</RSTIME>
      <RETIME>1487866975</RETIME>
      <ESTIME>1487866975</ESTIME>
      <EETIME>1487867043</EETIME>
      <REASON>2</REASON>
      <ACTION>27</ACTION>
    </HISTORY>
  </HISTORY_RECORDS>
</VM>

Nothing suspicious. it was long shot anyway, thanks. No idea what’s happening. Probably will try to reproduce it.

Thanks!