VM deployment is failing when there is two disks are in templates

I have one running VM (VM with 2 Hard disks are attached), converted this to VM to Vcenter Template and imported to OpenNebula dashboard. But when I tried to deploy the VM using this template the vm creation is failing…


Mon May 29 09:21:41 2017 [Z0][VMM][I]: Deploy of VM 60 on host POC-Cluster with /var/lib/one/vms/60/deployment.0 failed due to "Cannot clone VM Template: Connection reset by peer"
Mon May 29 09:21:41 2017 [Z0][VMM][I]: ExitCode: 255
Mon May 29 09:21:41 2017 [Z0][VMM][I]: Failed to execute virtualization driver operation: deploy.
Mon May 29 09:21:41 2017 [Z0][VMM][E]: Error deploying virtual machine


Since my all other templates with 1 disk are working fine, so I just removed the second hard disk from this particular template and tested, then vm is successfully deploying…

Please help me, because in some cases I need two disks in my template.

Thanks,
sanal

Hi Sanal!
in my tests I use a vcenter template which has three disks and the clone task works fine, so although you’re having problems with templates with two disks I think the issue is not related with the number of disks in the template but the time it takes to complete the clone task.

The log you’ve provided informs that the connection is reset by vCenter while the clone task is being performed so somehow the network connection is interrupted. As the clone task performs a full clone operation it’s a task that will take some time depending on the disk size, and maybe there’s a timeout or the connection is reset by a network device between OpenNebula’s frontend and your vCenter server.

I’d start having a look at vSphere Tasks console, and look for the task named “Clone virtual machine” that matches the deployment action performed by OpenNebula and try to answer these questions:

  • Check at what time did the clone task started. Assuming your clocks are in sync, how many minutes passed from the clone init until the connection reset by peer exception message was thrown in OpenNebula?
  • Did the clone task finished in vCenter although the connection reset by peer appeared?

If we can identify the time passed till the connection was reset, we may check if that’s a default timeout set by vCenter that can be extended or you can check if you have a network device (firewall, vpn…) between your OpenNebula front-end and vCenter that may close the connection abruptly when a timeout reaches.

Let’s see what you can investigate.

Cheers!

Hi,

Thank you for your reply,

Actually I was out of station, Hence the testing got delayed.

As per your suggestion I have done the below things.

  1. I have tested with another template with 2 disk and it worked well.
  2. I have noticed that failing template is having bigger data, it may be one reason for failing.
  3. Checked both vcenter and open nebula time is in sync
  4. I have increased vcenter timeout settings from 30 sec to 5000 sec, AD time out settings also increased to 5000 seconds.
  5. in between open nebula and vcenter, I dont have any firewall or vpn, both are connected to one switch.
  6. I started deploying vm from the template at 7.12AM and it got failed by 7.18AM in open nebula dashboard. when I checked in Vcenter it was around 84% cloning status, and it completed sucessfully in vcenter and I could able to start the VM from Vcenter without any issue.
  7. But in open nebula it shows as failed.

Thanks,
Sanal

Here is the latest log.
Tue Jun 6 07:12:32 2017 [Z0][VM][I]: New state is ACTIVE
Tue Jun 6 07:12:32 2017 [Z0][VM][I]: New LCM state is PROLOG
Tue Jun 6 07:12:32 2017 [Z0][VM][I]: New LCM state is BOOT
Tue Jun 6 07:12:32 2017 [Z0][VMM][I]: Generating deployment file: /var/lib/one/vms/317/deployment.0
Tue Jun 6 07:12:32 2017 [Z0][VMM][I]: Successfully execute network driver operation: pre.
Tue Jun 6 07:18:34 2017 [Z0][VMM][I]: Command execution fail: /var/lib/one/remotes/vmm/vcenter/deploy ‘/var/lib/one/vms/317/deployment.0’ ‘POC-Cluster’ 317 POC-Cluster
Tue Jun 6 07:18:34 2017 [Z0][VMM][I]: /usr/lib/one/ruby/vendors/rbvmomi/lib/rbvmomi/type_loader.rb:66: warning: already initialized constant RbVmomi::VIM::Datastore
Tue Jun 6 07:18:34 2017 [Z0][VMM][I]: /usr/lib/one/ruby/vcenter_driver.rb:50: warning: previous definition of Datastore was here
Tue Jun 6 07:18:34 2017 [Z0][VMM][I]: Deploy of VM 317 on host POC-Cluster with /var/lib/one/vms/317/deployment.0 failed due to "Cannot clone VM Template: Connection reset by peer"
Tue Jun 6 07:18:34 2017 [Z0][VMM][I]: ExitCode: 255
Tue Jun 6 07:18:34 2017 [Z0][VMM][I]: Failed to execute virtualization driver operation: deploy.
Tue Jun 6 07:18:34 2017 [Z0][VMM][E]: Error deploying virtual machine
Tue Jun 6 07:18:34 2017 [Z0][VM][I]: New LCM state is BOOT_FAILURE

Hi Sanal,
thanks a lot for your feedback and your job reporting it.

It seems that your vCenter API service is closing the connection abruptly, that’s why the “Connection reset by peer” message is showing, and as you only have a switch between OpenNebula and vCenter, that RST should be sent by vCenter. As the connection with the API is closed by the vCenter instance, OpenNebula cannot know the result of its completion although the clone task is performed “on the background”. As this is a connection RST at TCP level it’s quite difficult to find what can be added to the code to deal with this issue.

In the mean time, could you give us more info and check a few more things?:

  • What’s the size of the second disk you’re having problems? and what vCenter version are you running? I’d try to reproduce the problem in my test machines.
  • Can you check in vpxd.log if there’s any trace in your vCenter server that may explain why the connection has been closed at the time the clone task was being performed?
  • Can you check if you have any performance alarms in vCenter (cpu, disk) or anything that could mean e.g that the vCenter was experiencing heavy load while the clone task was in action and hence vCenter had to close the API connections due to that heavy load?

Cheers!