Firecracker VM on AWS edge provision

Has anyone gotten a VM to start in an AWS edge host with the Firecracker hypervisor? It gets stuck in boot for me. I have a kernel image added in a ‘files’ datastore on the AWS cluster.

$ onevm show 2557
VIRTUAL MACHINE 2557 INFORMATION
ID                  : 2557
NAME                : centos-docker-edge-temp1
USER                : chris
GROUP               : oneadmin
STATE               : ACTIVE
LCM_STATE           : BOOT
LOCK                : None
RESCHED             : No
HOST                : 3.151.32.97
CLUSTER ID          : 113
CLUSTER             : aws-cluster
START TIME          : 11/13 14:19:22
END TIME            : -
DEPLOY ID           : one-2557

On the AWS firecracker host in /srv/jailer/firecracker/one-2557/root the logs.fifo file doesn’t show any errors, just many lines like

2021-11-13T23:20:31.534918393 [one-2557:fc_api] The API server received a Get request on "/".
2021-11-13T23:20:31.534929481 [one-2557:fc_api] Total previous API call duration: 13 us.
2021-11-13T23:20:36.640023545 [one-2557:fc_api] The API server received a Get request on "/machine-config".
2021-11-13T23:20:36.640136853 [one-2557:fc_api] The request was executed successfully. Status code: 200 OK.
2021-11-13T23:20:36.640158989 [one-2557:fc_api] Total previous API call duration: 139 us.
2021-11-13T23:20:36.640833059 [one-2557:fc_api] The API server received a Get request on "/".
2021-11-13T23:20:36.640850351 [one-2557:fc_api] Total previous API call duration: 20 us.

HI @christaylor,

Are your AWS instances metal ones?

Also, are you seeing any kind of error in the VM log or oned.log?

Hello Christian- I’m provisioning on a metal AWS instance. I tried setting up a new provision with Firecracker on metal, and preparing a new Ubuntu image to deploy. I’m getting a new puzzling error now.

I get this in oned.log:

Thu Nov 18 20:59:09 2021 [Z0][TrM][I]: Command execution failed (exit code: 1): /var/lib/one/remotes/tm/shared/clone hpc-onet-admin:/var/lib/one//datastores/150/f4ee56170e7f52b52ff7c6e8b0c72a45 XX.XX.XX.XX:/var/lib/one//datastores/149/2582/kernel 2582 150
Thu Nov 18 20:59:09 2021 [Z0][TrM][I]: /var/lib/one/remotes/tm/shared/clone: line 94: [: : integer expression expected
Thu Nov 18 20:59:09 2021 [Z0][TrM][I]: clone: Cloning /var/lib/one//datastores/150/f4ee56170e7f52b52ff7c6e8b0c72a45 in XX.XX.XX.XX:/var/lib/one//datastores/149/2582/kernel
Thu Nov 18 20:59:09 2021 [Z0][TrM][E]: clone: Command "    set -e -o pipefail
Thu Nov 18 20:59:09 2021 [Z0][TrM][I]:
Thu Nov 18 20:59:09 2021 [Z0][TrM][I]: cd /var/lib/one//datastores/149/2582
Thu Nov 18 20:59:09 2021 [Z0][TrM][I]: rm -f /var/lib/one//datastores/149/2582/kernel
Thu Nov 18 20:59:09 2021 [Z0][TrM][I]: cp /var/lib/one//datastores/150/f4ee56170e7f52b52ff7c6e8b0c72a45 /var/lib/one//datastores/149/2582/kernel
Thu Nov 18 20:59:09 2021 [Z0][TrM][I]:
Thu Nov 18 20:59:09 2021 [Z0][TrM][I]: " failed: cp: cannot stat '/var/lib/one//datastores/150/f4ee56170e7f52b52ff7c6e8b0c72a45': No such file or directory
Thu Nov 18 20:59:09 2021 [Z0][TrM][E]: Error copying hpc-onet-admin:/var/lib/one//datastores/150/f4ee56170e7f52b52ff7c6e8b0c72a45 to XX.XX.XX.XX:/var/lib/one//datastores/149/2582/kernel
Thu Nov 18 20:59:09 2021 [Z0][TrM][E]: Error executing image transfer script: Error copying hpc-onet-admin:/var/lib/one//datastores/150/f4ee56170e7f52b52ff7c6e8b0c72a45 to XX.XX.XX.XX:/var/lib/one//datastores/149/2582/kernel

Connectivity from my front end to the bare metal server seems fine:

$ ssh XX.XX.XX.XX
Activate the web console with: systemctl enable --now cockpit.socket

[oneadmin@ip-10-0-0-4 ~]$ logout
Connection to XX.XX.XX.XX closed

Running the command manually gives these errors:

$ /var/lib/one/remotes/tm/shared/clone hpc-onet-admin:/var/lib/one//datastores/150/f4ee56170e7f52b52ff7c6e8b0c72a45 XX.XX.XX.XX:/var/lib/one//datastores/149/2582/kernel 2582 150
/var/lib/one/remotes/tm/shared/clone: line 94: [: : integer expression expected
INFO: clone: Cloning /var/lib/one//datastores/150/f4ee56170e7f52b52ff7c6e8b0c72a45 in XX.XX.XX.XX:/var/lib/one//datastores/149/2582/kernel
ERROR: clone: Command "    set -e -o pipefail

    cd /var/lib/one//datastores/149/2582
    rm -f /var/lib/one//datastores/149/2582/kernel
    cp /var/lib/one//datastores/150/f4ee56170e7f52b52ff7c6e8b0c72a45 /var/lib/one//datastores/149/2582/kernel

    " failed: cp: cannot stat '/var/lib/one//datastores/150/f4ee56170e7f52b52ff7c6e8b0c72a45': No such file or directory
ERROR MESSAGE --8<------
Error copying hpc-onet-admin:/var/lib/one//datastores/150/f4ee56170e7f52b52ff7c6e8b0c72a45 to XX.XX.XX.XX:/var/lib/one//datastores/149/2582/kernel
ERROR MESSAGE ------>8--

But the file it’s trying to copy looks fine:

$ ls -l /var/lib/one//datastores/150/f4ee56170e7f52b52ff7c6e8b0c72a45
-rw-r--r-- 1 oneadmin oneadmin 21100896 Nov 18 20:48 /var/lib/one//datastores/150/f4ee56170e7f52b52ff7c6e8b0c72a45

Maybe I just need to work with it some more… could I be doing something wrong?

Oh, my mistake- I resolved the file copying problem. I forgot to set TM_MAD="ssh" in the files datastore where the firecracker kernel is downloaded.

However, I’m still back to my original problem of a VM deployed to the edge host is stuck in ‘boot’.

I don’t see any errors in /var/log/one/oned.log, and the VM log just says

$ cat /var/log/one/2588.log
Sat Nov 20 09:53:40 2021 [Z0][VM][I]: New state is ACTIVE
Sat Nov 20 09:53:40 2021 [Z0][VM][I]: New LCM state is PROLOG
Sat Nov 20 09:54:18 2021 [Z0][VM][I]: New LCM state is BOOT
Sat Nov 20 09:54:18 2021 [Z0][VMM][I]: Generating deployment file: /var/lib/one/vms/2588/deployment.0

On the edge host, the logs.fifo just starts to say this over and over

$ cat logs.fifo
Running Firecracker v0.24.2
2021-11-20T17:54:20.907653301 [one-2588:main] Artificially kick devices.
2021-11-20T17:54:20.907689107 [one-2588:main] kick net eth0.
2021-11-20T17:54:20.907695441 [one-2588:main] kick block disk.1.
2021-11-20T17:54:20.907699135 [one-2588:main] kick block disk.0.
2021-11-20T17:54:20.908207404 [one-2588:main] Successfully started microvm that was configured from one single json
2021-11-20T17:54:21.064857353 [one-2588:main] block: activate event
2021-11-20T17:54:21.065985005 [one-2588:main] block: activate event
2021-11-20T17:54:21.067813006 [one-2588:main] net: activate event
2021-11-20T17:54:21.067958418 [one-2588:fc_vcpu 0] Failed to trigger i8042 kbd interrupt (disabled by guest OS)
2021-11-20T17:54:21.075466733 [one-2588:fc_vcpu 0] Failed to trigger i8042 kbd interrupt (disabled by guest OS)
2021-11-20T17:54:22.519085838 [one-2588:fc_api] The API server received a Get request on "/machine-config".
2021-11-20T17:54:22.519172431 [one-2588:fc_api] The request was executed successfully. Status code: 200 OK.
2021-11-20T17:54:22.519186615 [one-2588:fc_api] Total previous API call duration: 112 us.
2021-11-20T17:54:22.519689815 [one-2588:fc_api] The API server received a Get request on "/".

Anybody else run into this?

Interesting- on the edge host a VNC port is opened on :8488 like it should, and I can set up a SSH tunnel through my ONE frontend host and connect. It looks like the contextualization is running but the VM can’t route out through the default gateway:

It looks like the Ansible play to set up packet forwarding is working, but iptables isn’t being configured… :thinking:

This works fine on a KVM edge host.

net.ipv4.conf.all.forwarding = 1
net.ipv4.conf.br0.forwarding = 1
net.ipv4.conf.default.forwarding = 1
net.ipv4.conf.eth0.forwarding = 1
net.ipv4.conf.lo.forwarding = 1
net.ipv4.conf.one-2588-0.forwarding = 1
net.ipv4.ip_forward = 1
net.ipv4.ip_forward_update_priority = 1

[root@ip-10-0-0-4 ~]# iptables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination
ACCEPT     all  --  anywhere             anywhere             state RELATED,ESTABLISHED
ACCEPT     icmp --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere
ACCEPT     udp  --  anywhere             anywhere             state NEW udp dpt:otv
ACCEPT     tcp  --  anywhere             anywhere             state NEW tcp dpt:bgp
ACCEPT     tcp  --  anywhere             anywhere             state NEW tcp dpt:ssh
REJECT     all  --  anywhere             anywhere             reject-with icmp-host-prohibited

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
[root@ip-10-0-0-4 ~]# iptables -L -t nat
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination

Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

[root@ip-10-0-0-4 ~]# ip6tables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination
ACCEPT     all      anywhere             anywhere             state RELATED,ESTABLISHED
ACCEPT     ipv6-icmp    anywhere             anywhere
ACCEPT     all      anywhere             anywhere
ACCEPT     udp      anywhere             anywhere             state NEW udp dpt:otv
ACCEPT     udp      anywhere             fe80::/64            state NEW udp dpt:dhcpv6-client
ACCEPT     tcp      anywhere             anywhere             state NEW tcp dpt:bgp
ACCEPT     tcp      anywhere             anywhere             state NEW tcp dpt:ssh
REJECT     all      anywhere             anywhere             reject-with icmp6-adm-prohibited

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
[root@ip-10-0-0-4 ~]# ip6tables -L -t nat
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination

Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

Hi @christaylor,

Interesting- on the edge host a VNC port is opened on :8488 like it should, and I can set up a SSH tunnel through my ONE frontend host and connect.

So, are you able now to deploy the microVMs correctly or are they still stuck in boot state?

Regarding the networking issue, are you managing the iptables of the server somehow? or you just delegate to OpenNebula?

Hello @cgonzalez, right now my Firecracker VMs are still stuck in boot state- I’m letting the Terraform and Ansible tasks set up the iptables rules on the edge host, and it appears they aren’t working. On a KVM node, iptables does get set up correctly. It’s strange- hopefully I can dig into it some more over the holiday. I don’t see any errors when the provision runs in Fireedge.

I would focus on why the VM is stuck in boot state then first. As it might be the cause of some networking driver not being executed.

In order to debug that, could you share the oned.log and the vm log files containing the time when the VM is deployed? Also could you provide some details on the provision you’re deploying (i.e instance type, OS, and any other thing you think might be relevant)

Interesting- tonight I’m able to successfully deploy a Centos8 VM on a Firecracker edge host in AWS. The instance is sourced from the Dockerhub marketplace; I have outbound and inbound connectivity via SSH. I haven’t made any changes from before- I’ll continue to investigate

Ok, please let us know if you figure out the reason