Has anyone gotten a VM to start in an AWS edge host with the Firecracker hypervisor? It gets stuck in boot for me. I have a kernel image added in a ‘files’ datastore on the AWS cluster.
$ onevm show 2557
VIRTUAL MACHINE 2557 INFORMATION
ID : 2557
NAME : centos-docker-edge-temp1
USER : chris
GROUP : oneadmin
STATE : ACTIVE
LCM_STATE : BOOT
LOCK : None
RESCHED : No
HOST : 3.151.32.97
CLUSTER ID : 113
CLUSTER : aws-cluster
START TIME : 11/13 14:19:22
END TIME : -
DEPLOY ID : one-2557
On the AWS firecracker host in /srv/jailer/firecracker/one-2557/root the logs.fifo file doesn’t show any errors, just many lines like
2021-11-13T23:20:31.534918393 [one-2557:fc_api] The API server received a Get request on "/".
2021-11-13T23:20:31.534929481 [one-2557:fc_api] Total previous API call duration: 13 us.
2021-11-13T23:20:36.640023545 [one-2557:fc_api] The API server received a Get request on "/machine-config".
2021-11-13T23:20:36.640136853 [one-2557:fc_api] The request was executed successfully. Status code: 200 OK.
2021-11-13T23:20:36.640158989 [one-2557:fc_api] Total previous API call duration: 139 us.
2021-11-13T23:20:36.640833059 [one-2557:fc_api] The API server received a Get request on "/".
2021-11-13T23:20:36.640850351 [one-2557:fc_api] Total previous API call duration: 20 us.
Hello Christian- I’m provisioning on a metal AWS instance. I tried setting up a new provision with Firecracker on metal, and preparing a new Ubuntu image to deploy. I’m getting a new puzzling error now.
I get this in oned.log:
Thu Nov 18 20:59:09 2021 [Z0][TrM][I]: Command execution failed (exit code: 1): /var/lib/one/remotes/tm/shared/clone hpc-onet-admin:/var/lib/one//datastores/150/f4ee56170e7f52b52ff7c6e8b0c72a45 XX.XX.XX.XX:/var/lib/one//datastores/149/2582/kernel 2582 150
Thu Nov 18 20:59:09 2021 [Z0][TrM][I]: /var/lib/one/remotes/tm/shared/clone: line 94: [: : integer expression expected
Thu Nov 18 20:59:09 2021 [Z0][TrM][I]: clone: Cloning /var/lib/one//datastores/150/f4ee56170e7f52b52ff7c6e8b0c72a45 in XX.XX.XX.XX:/var/lib/one//datastores/149/2582/kernel
Thu Nov 18 20:59:09 2021 [Z0][TrM][E]: clone: Command " set -e -o pipefail
Thu Nov 18 20:59:09 2021 [Z0][TrM][I]:
Thu Nov 18 20:59:09 2021 [Z0][TrM][I]: cd /var/lib/one//datastores/149/2582
Thu Nov 18 20:59:09 2021 [Z0][TrM][I]: rm -f /var/lib/one//datastores/149/2582/kernel
Thu Nov 18 20:59:09 2021 [Z0][TrM][I]: cp /var/lib/one//datastores/150/f4ee56170e7f52b52ff7c6e8b0c72a45 /var/lib/one//datastores/149/2582/kernel
Thu Nov 18 20:59:09 2021 [Z0][TrM][I]:
Thu Nov 18 20:59:09 2021 [Z0][TrM][I]: " failed: cp: cannot stat '/var/lib/one//datastores/150/f4ee56170e7f52b52ff7c6e8b0c72a45': No such file or directory
Thu Nov 18 20:59:09 2021 [Z0][TrM][E]: Error copying hpc-onet-admin:/var/lib/one//datastores/150/f4ee56170e7f52b52ff7c6e8b0c72a45 to XX.XX.XX.XX:/var/lib/one//datastores/149/2582/kernel
Thu Nov 18 20:59:09 2021 [Z0][TrM][E]: Error executing image transfer script: Error copying hpc-onet-admin:/var/lib/one//datastores/150/f4ee56170e7f52b52ff7c6e8b0c72a45 to XX.XX.XX.XX:/var/lib/one//datastores/149/2582/kernel
Connectivity from my front end to the bare metal server seems fine:
$ ssh XX.XX.XX.XX
Activate the web console with: systemctl enable --now cockpit.socket
[oneadmin@ip-10-0-0-4 ~]$ logout
Connection to XX.XX.XX.XX closed
Running the command manually gives these errors:
$ /var/lib/one/remotes/tm/shared/clone hpc-onet-admin:/var/lib/one//datastores/150/f4ee56170e7f52b52ff7c6e8b0c72a45 XX.XX.XX.XX:/var/lib/one//datastores/149/2582/kernel 2582 150
/var/lib/one/remotes/tm/shared/clone: line 94: [: : integer expression expected
INFO: clone: Cloning /var/lib/one//datastores/150/f4ee56170e7f52b52ff7c6e8b0c72a45 in XX.XX.XX.XX:/var/lib/one//datastores/149/2582/kernel
ERROR: clone: Command " set -e -o pipefail
cd /var/lib/one//datastores/149/2582
rm -f /var/lib/one//datastores/149/2582/kernel
cp /var/lib/one//datastores/150/f4ee56170e7f52b52ff7c6e8b0c72a45 /var/lib/one//datastores/149/2582/kernel
" failed: cp: cannot stat '/var/lib/one//datastores/150/f4ee56170e7f52b52ff7c6e8b0c72a45': No such file or directory
ERROR MESSAGE --8<------
Error copying hpc-onet-admin:/var/lib/one//datastores/150/f4ee56170e7f52b52ff7c6e8b0c72a45 to XX.XX.XX.XX:/var/lib/one//datastores/149/2582/kernel
ERROR MESSAGE ------>8--
But the file it’s trying to copy looks fine:
$ ls -l /var/lib/one//datastores/150/f4ee56170e7f52b52ff7c6e8b0c72a45
-rw-r--r-- 1 oneadmin oneadmin 21100896 Nov 18 20:48 /var/lib/one//datastores/150/f4ee56170e7f52b52ff7c6e8b0c72a45
Maybe I just need to work with it some more… could I be doing something wrong?
Oh, my mistake- I resolved the file copying problem. I forgot to set TM_MAD="ssh" in the files datastore where the firecracker kernel is downloaded.
However, I’m still back to my original problem of a VM deployed to the edge host is stuck in ‘boot’.
I don’t see any errors in /var/log/one/oned.log, and the VM log just says
$ cat /var/log/one/2588.log
Sat Nov 20 09:53:40 2021 [Z0][VM][I]: New state is ACTIVE
Sat Nov 20 09:53:40 2021 [Z0][VM][I]: New LCM state is PROLOG
Sat Nov 20 09:54:18 2021 [Z0][VM][I]: New LCM state is BOOT
Sat Nov 20 09:54:18 2021 [Z0][VMM][I]: Generating deployment file: /var/lib/one/vms/2588/deployment.0
On the edge host, the logs.fifo just starts to say this over and over
$ cat logs.fifo
Running Firecracker v0.24.2
2021-11-20T17:54:20.907653301 [one-2588:main] Artificially kick devices.
2021-11-20T17:54:20.907689107 [one-2588:main] kick net eth0.
2021-11-20T17:54:20.907695441 [one-2588:main] kick block disk.1.
2021-11-20T17:54:20.907699135 [one-2588:main] kick block disk.0.
2021-11-20T17:54:20.908207404 [one-2588:main] Successfully started microvm that was configured from one single json
2021-11-20T17:54:21.064857353 [one-2588:main] block: activate event
2021-11-20T17:54:21.065985005 [one-2588:main] block: activate event
2021-11-20T17:54:21.067813006 [one-2588:main] net: activate event
2021-11-20T17:54:21.067958418 [one-2588:fc_vcpu 0] Failed to trigger i8042 kbd interrupt (disabled by guest OS)
2021-11-20T17:54:21.075466733 [one-2588:fc_vcpu 0] Failed to trigger i8042 kbd interrupt (disabled by guest OS)
2021-11-20T17:54:22.519085838 [one-2588:fc_api] The API server received a Get request on "/machine-config".
2021-11-20T17:54:22.519172431 [one-2588:fc_api] The request was executed successfully. Status code: 200 OK.
2021-11-20T17:54:22.519186615 [one-2588:fc_api] Total previous API call duration: 112 us.
2021-11-20T17:54:22.519689815 [one-2588:fc_api] The API server received a Get request on "/".
Interesting- on the edge host a VNC port is opened on :8488 like it should, and I can set up a SSH tunnel through my ONE frontend host and connect. It looks like the contextualization is running but the VM can’t route out through the default gateway:
Hello @cgonzalez, right now my Firecracker VMs are still stuck in boot state- I’m letting the Terraform and Ansible tasks set up the iptables rules on the edge host, and it appears they aren’t working. On a KVM node, iptables does get set up correctly. It’s strange- hopefully I can dig into it some more over the holiday. I don’t see any errors when the provision runs in Fireedge.
I would focus on why the VM is stuck in boot state then first. As it might be the cause of some networking driver not being executed.
In order to debug that, could you share the oned.log and the vm log files containing the time when the VM is deployed? Also could you provide some details on the provision you’re deploying (i.e instance type, OS, and any other thing you think might be relevant)
Interesting- tonight I’m able to successfully deploy a Centos8 VM on a Firecracker edge host in AWS. The instance is sourced from the Dockerhub marketplace; I have outbound and inbound connectivity via SSH. I haven’t made any changes from before- I’ll continue to investigate