Multiple errors when creating edge clusters on AWS

Good afternoon, I don’t know where to go and I really need help. I have tried many times to fix multiple errors with OneProvision when creating a cluster on AWS (virtual provision type). I have tried to do everything like in the video OpenNebula - Preventing Vendor Lock-in with an OpenNebula Multi-Cloud - YouTube . But the virtual machine does not appear to be successfully started in the VM partition on OpenNebula Sunstone. Also, during the automatic cluster configuration using the OneProvision built-in template, the Ansible “Unreachable” error occurs. Ansible “environmentfilter” error was previously found and fixed with difficulty, which was caused by a breakage in newer versions of Jinja2 starting with version 3.1.* (no information about this problem is available in the OpenNebula documentation and as far as I know, the problem is still not fixed in newer versions of the platform). Many other problems occur regardless of the choice of EC2 instance type, virtualization type (Ixc, gemu) and any other settings. In the video, everything is very easy and fast. However, in reality the cluster creation always fails (even if the status in OneProvision is green and shows as successful). Tested in Debian 11 and Ubuntu 20.04.

Also, searching the official OpenNebula documentation didn’t help. It gives vague explanations for this patch https://github.com/OpenNebula/docs/blob/master/source/quick_start/operation_basics/provisioning_edge_cluster.rst (“If you’re using OpenNebula 6.6.0 CE, before adding hosts to your environment, please apply this patch in all Frontend machines”).

The problem was similar to what was mentioned in another section of the documentation regarding Kubernetes Running Kubernetes Clusters — OpenNebula 6.6.3 documentation (“Important” message).

This problem is also relevant for any OneProvision cluster deployment on AWS (qemu, LXC) and is not explicitly mentioned in the documentation and Github.

The specialist believes that the problem is Terraform, which fails when using AWS with OneProvision. A new virtual machine is simply not created on Datastore aws-cluster-system. It always displays zero Capacity, regardless of how OpenNebula was configured. Will this very insidious bug be fixed in the future?


Versions of the related components and OS (frontend, hypervisors, VMs):
Baremetal (Host) OS – Ubuntu 22.04 (On Debian 11, the error is the same)
Frontend – 6.6.3

Steps to reproduce:

  1. Open the OneProvision interface.
  2. Add an AWS provider (IAM credentials with EC2FullAccess rights).
  3. Proceed to create a cluster (for example, in N. Virginia) with any options and resource types.
  4. Observe a “successful” cluster creation that fails with a green status.
  5. On OpenNebula Sunstone, observe that aws-cluster-system is zero capacity.
  6. Unsuccessfully try to create a virtual machine as shown in the third minute of the video.

Current results:
The virtual machine/container is not running on the AWS side.

Expected results:
The virtual machine/container is successfully started as shown in the video.

Sorry to hear that.

We recently decided to improve and upgrade OneProvision, while this is ongoing the current version didn’t get much attention and maintenance. Anyway, such errors you noticed are not acceptable.

Trying to investigate the issue, unfortunately, I wasn’t able to reproduce it.

What is the VM that doesn’t boot state? what does onevm show says?
Could you paste VM log (/var/log/one/$VM.log) or specific errror message?


For the OneProvision here are some hint and common pitfalls.

  1. It’s handy to run oneprovison from command line with -D for debug when things goes wrong.
    e.g.
oneprovision create /usr/share/one/oneprovision/edge-clusters/metal/provisions/aws.yml --provider 0  -D
  1. To tidy up the failed proivsion oneprovision delete --cleanup should help

  2. The cluster playbooks are sensitive for ansible version, that’s why we install ansible 2.9.9 using minione.

  3. When cluster is created, make sure it’s monitored, that means host and datastores are on

# onehost list
  ID NAME                   CLUSTER    TVM      ALLOCATED_CPU      ALLOCATED_MEM STAT
   1 3.68.216.92            aws-edge-c   1    100 / 9600 (1%) 128M / 188.5G (0%) on  

# onedatastore list
  ID NAME                          SIZE AVA CLUSTERS IMAGES TYPE DS      TM      STAT
 107 aws-edge-cluster-system          - -   103           0 sys  -       ssh     on  
 106 aws-edge-cluster-image       39.3G 86% 103           1 img  fs      ssh     on  
  1. When you download (export) an appliance from the marketplace, make sure it lands on the datastore within the same cluster.
# onemarketapp export "Alpine Linux 3.17" -d 106 alpine

Thanks for the reply! I apologize for not attaching the logs right away. The logs are here. The result is the same every time on different OS and servers. And yes, I did use minione for all deployments. The correct version of Ansible 2.9.9 was installed by minione automatically. One of the issues seems to have been resolved with the new version of minione as of late. The correct version of Jinja2 3.0.3 is now automatically installed (the environmentfilter error occurred during the deployment because the version used was Jinja2==3.1.*, in which important code was removed). Then I put the patch back on, and everything seemed to work. But now, for some reason, I’m getting new errors that I didn’t see before.

By the way, you are also not showing SIZE and AVA in aws-edge-cluster-system as I am. However, in the video above, it worked fine (IMHO). I have also tried lxc clusters on provisions page, not just qemu. And they also fail (tried running an nginx container as in the video). I enabled all marketplaces with containers via “onemarket enable *” command.

Now very different errors are observed and I can’t create a cluster at all. That’s why there are only OneProvision logs for now. At the same time virtual machines successfully run on local server (kvm host), but not on AWS. Due to new bugs, I can’t go back to solving old bugs yet. This behavior with a huge number of errors is observed exactly with OneProvision in conjunction with AWS. Another specialist said that there were no problems with other providers. I have been trying to deal with new and recurring errors for a very long time (fixing old errors - new ones appear). Despite this, OneProvision looks very nice and promising.

Some of the errors disappeared after I installed OpenNebula without minione (i.e. installed Frontend manually). I used the instructions from the official documentation. In the end I installed Ansible version 2.9.27 and Terraform version 0.14.7 (minione currently installs Ansible 2.9.9 and Terraform 1.1.9).

Now I have a different error when OneProvision tries to install frr-pythontools on the cluster:

Summary

The following packages have unmet dependencies:
frr-pythontools : Depends: frr (>= 9.0-stable90-g2863e7e-20230808.170622-1~ubuntu20.04.1~) but 8.5.2-0~ubuntu20.04.1 is to be installed
E: Unable to correct problems, you have held broken packages.

So where can I find and edit files that contain commands that run on a remote cluster (AWS EC2) as a temporary solution? I think it would be possible, instead of using apt install in the standard ubuntu repository, to use the new 9.0 version of frr as a temporary solution until a new version is added to the main ubuntu repository.

Hi @Roman-x86 :wave:

I’ve recently faced the same error with the Ansible playbooks that oneprovision uses, and indeed, the errors comes from a conflict in the packages versions between frr-python-tools and frr itself.

Temporarily, I’ve forced the installation of the old versions in the Ansible roles and works, but I noticed that the updated package version has been added recently to the Ubuntu’s repositories so the problem should be fixed now.

Could you try it again? Does the problem persist?

Best,
Victor.

Hi. Yes, now frr and its plugin install successfully. But as expected, the old bug with aws-cluster-system is not fixed. Screenshots of the errors are here. Virtual machines hang in PROLOG status. I think the problem is still the same datastore with zero Capacity.

I don’t think it’s datastore with zero capacity as the scheduler would not start the deployment of the VMs.

If it hangs in prolog it’s probably something else :thinking: but should be very close to working.

Could you review the hosts and datastores state?
onehost list
onedatastore list
Is the host reachable from the FE? onehost sync -f
(all as oneadmin on the FE)

Some magic is happening in OpenNebula. Now the errors have disappeared, but the main problem concerning the impossibility to create a virtual aws cluster remains unsolved. I have recorded the errors here. I also tried to execute your commands.

Screenshot

The datastore (101) looks correct, SSH datastore doesn’t show specific SIZE/AVAIL as it could be colocated on multiple hosts.

The issue with pending VM seems to be due to orignial VM template from the minione which refers the “vnet” virtual network in diferent cluster (0).

To fix that you could clone the template and either remove the NIC=[ NETWORK="vnet"...] section
or you can replace it with the provisioned VNET, should be aws-edge-cluster-public.

The problem with FRR seems to be somehow random.

Sorry, only cloning the template is not sufficient. MiniONE creates qcow2 datastore for the local KVM host and it can’t be combined with SSH edge datastore. The only option is to export the appliance again using the aws edge image datastore.

Also, I’m sorry. I really appreciate your efforts :+1:. I didn’t quite understand the steps to follow. I wish it was as easy and straightforward to create an edge cluster as the video shows. An automated method would have been nice. My client would not be able to handle the complicated deployment methods. It would be nice to have some steps on how to do it manually though. At least.

I noticed that similar problems have been previously discussed here and here.