VM on brand-new HCI Cluster entering LCM State "Unknown" immediately

Hello,

I wanted to test and play around with OpenNebula so I deployed a minione and used the new HCI Provisioner to create a 3-node HCI cluster on three other VMs (nested virtualization is enabled.)

First I ran into an issue where the /var/lib/one/.ssh folder was owned by user 9869 (which doesn’t exist), after I chowned that to oneadmin I can now deploy VMs.

I was instantiating an Alpine Linux via Sunstone, after booting its LCM state immediately changed to ERROR, the log of the VM shows the following:

Mon May 23 14:18:54 2022 [Z0][VM][I]: New state is ACTIVE
Mon May 23 14:18:54 2022 [Z0][VM][I]: New LCM state is PROLOG
Mon May 23 14:18:58 2022 [Z0][VM][I]: New LCM state is BOOT
Mon May 23 14:18:58 2022 [Z0][VMM][I]: Generating deployment file: /var/lib/one/vms/2/deployment.0
Mon May 23 14:18:59 2022 [Z0][VMM][I]: Successfully execute transfer manager driver operation: tm_context.
Mon May 23 14:18:59 2022 [Z0][VMM][I]: Successfully execute network driver operation: pre.
Mon May 23 14:19:00 2022 [Z0][VMM][I]: ExitCode: 0
Mon May 23 14:19:00 2022 [Z0][VMM][I]: Successfully execute virtualization driver operation: /bin/mkdir -p.
Mon May 23 14:19:00 2022 [Z0][VMM][I]: ExitCode: 0
Mon May 23 14:19:00 2022 [Z0][VMM][I]: Successfully execute virtualization driver operation: /bin/cat - >/var/lib/one//datastores/101/2/vm.xml.
Mon May 23 14:19:00 2022 [Z0][VMM][I]: ExitCode: 0
Mon May 23 14:19:00 2022 [Z0][VMM][I]: Successfully execute virtualization driver operation: /bin/cat - >/var/lib/one//datastores/101/2/ds.xml.
Mon May 23 14:19:02 2022 [Z0][VMM][I]: ExitCode: 0
Mon May 23 14:19:02 2022 [Z0][VMM][I]: Successfully execute virtualization driver operation: deploy.
Mon May 23 14:19:02 2022 [Z0][VMM][I]: Successfully execute network driver operation: post.
Mon May 23 14:19:02 2022 [Z0][VM][I]: New LCM state is RUNNING
Mon May 23 14:19:03 2022 [Z0][VMM][I]: VM running but monitor state is ERROR.
Mon May 23 14:19:03 2022 [Z0][VM][I]: New LCM state is UNKNOWN
Mon May 23 14:19:18 2022 [Z0][VMM][I]: VM running but monitor state is ERROR.
Mon May 23 14:22:22 2022 [Z0][VMM][I]: VM running but monitor state is ERROR.

On the hypervisor host the syslog shows this:

May 23 14:43:17 lha-edge03 kernel: [ 4331.193669] audit: type=1400 audit(1653316997.636:68): apparmor="STATUS" operation="profile_remove" profile="unconfined" name="libvirt-635a298a-1c9f-4f43-a61f-1fda339876ff" pid=37596 comm="apparmor_parser"
May 23 14:43:38 lha-edge03 kernel: [ 4352.095442] audit: type=1400 audit(1653317018.536:69): apparmor="STATUS" operation="profile_load" profile="unconfined" name="libvirt-747934b6-0dee-4b80-b5ce-77921f20611b" pid=37975 comm="apparmor_parser"
May 23 14:43:38 lha-edge03 kernel: [ 4352.351566] audit: type=1400 audit(1653317018.792:70): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="libvirt-747934b6-0dee-4b80-b5ce-77921f20611b" pid=37978 comm="apparmor_parser"
May 23 14:43:38 lha-edge03 kernel: [ 4352.545797] audit: type=1400 audit(1653317018.988:71): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="libvirt-747934b6-0dee-4b80-b5ce-77921f20611b" pid=37982 comm="apparmor_parser"
May 23 14:43:39 lha-edge03 kernel: [ 4352.738264] audit: type=1400 audit(1653317019.180:72): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="libvirt-747934b6-0dee-4b80-b5ce-77921f20611b" pid=37986 comm="apparmor_parser"
May 23 14:43:39 lha-edge03 kernel: [ 4353.531455] audit: type=1400 audit(1653317019.972:73): apparmor="DENIED" operation="open" profile="libvirt-747934b6-0dee-4b80-b5ce-77921f20611b" name="/etc/ceph/ceph.client.oneadmin.keyring" pid=37989 comm="qemu-kvm-one" requested_mask="r" denied_mask="r" fsuid=1001 ouid=0
May 23 14:43:39 lha-edge03 kernel: [ 4353.531835] audit: type=1400 audit(1653317019.972:74): apparmor="DENIED" operation="open" profile="libvirt-747934b6-0dee-4b80-b5ce-77921f20611b" name="/etc/ceph/ceph.client.oneadmin.keyring" pid=37989 comm="qemu-kvm-one" requested_mask="r" denied_mask="r" fsuid=1001 ouid=0
May 23 14:43:39 lha-edge03 kernel: [ 4353.534304] audit: type=1400 audit(1653317019.976:75): apparmor="DENIED" operation="open" profile="libvirt-747934b6-0dee-4b80-b5ce-77921f20611b" name="/etc/ceph/ceph.client.oneadmin.keyring" pid=37989 comm="qemu-kvm-one" requested_mask="r" denied_mask="r" fsuid=1001 ouid=0
May 23 14:43:39 lha-edge03 kernel: [ 4353.534308] audit: type=1400 audit(1653317019.976:76): apparmor="DENIED" operation="open" profile="libvirt-747934b6-0dee-4b80-b5ce-77921f20611b" name="/etc/ceph/ceph.client.oneadmin.keyring" pid=37989 comm="qemu-kvm-one" requested_mask="r" denied_mask="r" fsuid=1001 ouid=0
May 23 14:43:39 lha-edge03 kernel: [ 4353.539438] audit: type=1400 audit(1653317019.980:77): apparmor="DENIED" operation="open" profile="libvirt-747934b6-0dee-4b80-b5ce-77921f20611b" name="/etc/ceph/ceph.client.oneadmin.keyring" pid=37989 comm="qemu-kvm-one" requested_mask="r" denied_mask="r" fsuid=1001 ouid=0
May 23 14:43:39 lha-edge03 kernel: [ 4353.539565] audit: type=1400 audit(1653317019.980:78): apparmor="DENIED" operation="open" profile="libvirt-747934b6-0dee-4b80-b5ce-77921f20611b" name="/etc/ceph/ceph.client.oneadmin.keyring" pid=37989 comm="qemu-kvm-one" requested_mask="r" denied_mask="r" fsuid=1001 ouid=0

The Cluster was deployed on three freshly installed and up-to-date Ubuntu focal VMs using /usr/share/one/oneprovision/edge-clusters/metal/provisions/onprem-hci.yml.

Disabling the apparmor for libvirt fixed the issue for me, the VM is now correctly starting.
EDIT: It did not solve the issue for me, VM still stuck in ERROR.
Where can I look next?

Hi @Cookiefamily, is your Front-end node able to monitor the provisioned host? Which kind of provision did you use?

Good morning @cgonzalez ,
All provisioned hosts can be monitored successfully by the frontend node, they show as “ON” and have working graphs.
I used the new /usr/share/one/oneprovision/edge-clusters/metal/provisions/onprem-hci.yml and set them up as kvm hosts.

In that case it would be nice to know why the monitoring is setting the VMs into error state:

Mon May 23 14:19:18 2022 [Z0][VMM][I]: VM running but monitor state is ERROR.

Could you share the /var/log/one/monitor.log contents for that time?

Sure,

Mon May 23 14:10:51 2022 [Z0][HMM][I]: --Mark--
Mon May 23 14:11:00 2022 [Z0][HMM][I]: Successfully monitored host: 1
Mon May 23 14:12:15 2022 [Z0][HMM][I]: Successfully monitored host: 0
Mon May 23 14:12:15 2022 [Z0][HMM][I]: Successfully monitored host: 2
Mon May 23 14:13:00 2022 [Z0][HMM][I]: Successfully monitored host: 1
Mon May 23 14:14:15 2022 [Z0][HMM][I]: Successfully monitored host: 0
Mon May 23 14:14:15 2022 [Z0][HMM][I]: Successfully monitored host: 2
Mon May 23 14:15:00 2022 [Z0][HMM][I]: Successfully monitored host: 1
Mon May 23 14:16:15 2022 [Z0][HMM][I]: Successfully monitored host: 0
Mon May 23 14:16:16 2022 [Z0][HMM][I]: Successfully monitored host: 2
Mon May 23 14:17:01 2022 [Z0][HMM][I]: Successfully monitored host: 1
Mon May 23 14:18:16 2022 [Z0][HMM][I]: Successfully monitored host: 0
Mon May 23 14:18:16 2022 [Z0][HMM][I]: Successfully monitored host: 2
Mon May 23 14:19:01 2022 [Z0][HMM][I]: Successfully monitored host: 1
Mon May 23 14:19:20 2022 [Z0][HMM][I]: Successfully monitored VM: 2
Mon May 23 14:19:50 2022 [Z0][HMM][I]: Successfully monitored VM: 2
Mon May 23 14:20:16 2022 [Z0][HMM][I]: Successfully monitored host: 0
Mon May 23 14:20:16 2022 [Z0][HMM][I]: Successfully monitored host: 2
Mon May 23 14:20:21 2022 [Z0][HMM][I]: Successfully monitored VM: 2
Mon May 23 14:20:51 2022 [Z0][HMM][I]: --Mark--

and it just continues this way

The monitoring info looks correct, can you share the libvirt logs for the VM? They should be available on the hypervior node at /var/log/libvirt/qemu/one-<vm-id>.log

Oh that looks more promising:

2022-05-23 14:43:39.199+0000: starting up libvirt version: 6.0.0, package: 0ubuntu8.16 (Marc Deslauriers <marc.deslauriers@ubuntu.com> Wed, 20 Apr 2022 11:31:12 -0400), qemu version: 4.2.1Debian 1:4.2-3ubuntu6.21, kernel: 5.4.0-110-generic, hostname: lha-edge03
LC_ALL=C \
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin \
HOME=/var/lib/libvirt/qemu/domain-2-one-3 \
XDG_DATA_HOME=/var/lib/libvirt/qemu/domain-2-one-3/.local/share \
XDG_CACHE_HOME=/var/lib/libvirt/qemu/domain-2-one-3/.cache \
XDG_CONFIG_HOME=/var/lib/libvirt/qemu/domain-2-one-3/.config \
QEMU_AUDIO_DRV=none \
/usr/bin/qemu-kvm-one \
-name guest=one-3,debug-threads=on \
-S \
-object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-2-one-3/master-key.aes \
-machine pc-i440fx-focal,accel=kvm,usb=off,dump-guest-core=off \
-cpu qemu64 \
-m 128 \
-overcommit mem-lock=off \
-smp 1,sockets=1,cores=1,threads=1 \
-uuid 747934b6-0dee-4b80-b5ce-77921f20611b \
-no-user-config \
-nodefaults \
-chardev socket,id=charmonitor,fd=31,server,nowait \
-mon chardev=charmonitor,id=monitor,mode=control \
-rtc base=utc \
-no-shutdown \
-boot strict=on \
-device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 \
-device virtio-scsi-pci,id=scsi0,num_queues=1,bus=pci.0,addr=0x3 \
-device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x4 \
-object secret,id=libvirt-2-storage-secret0,data=EDCmkZuNRjw4iB+Z//AWfJaRP1QKZAHBebj2DuNd9Ro=,keyid=masterKey0,iv=hAlhSStdYRZdhMejQ/nOYA==,format=base64 \
-blockdev '{"driver":"rbd","pool":"one","image":"one-1-3-0","user":"oneadmin","auth-client-required":["cephx","none"],"key-secret":"libvirt-2-storage-secret0","node-name":"libvirt-2-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-2-format","read-only":false,"discard":"unmap","cache":{"direct":true,"no-flush":false},"driver":"raw","file":"libvirt-2-storage"}' \
-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=libvirt-2-format,id=virtio-disk0,bootindex=1,write-cache=on \
-blockdev '{"driver":"file","filename":"/var/lib/one//datastores/101/3/disk.1","node-name":"libvirt-1-storage","auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-1-format","read-only":true,"driver":"raw","file":"libvirt-1-storage"}' \
-device ide-cd,bus=ide.0,unit=0,drive=libvirt-1-format,id=ide0-0-0 \
-chardev socket,id=charchannel0,fd=33,server,nowait \
-device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 \
-vnc 0.0.0.0:3 \
-device cirrus-vga,id=video0,bus=pci.0,addr=0x2 \
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 \
-sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \
-msg timestamp=on
2022-05-23T14:43:40.058465Z qemu-kvm-one: warning: host doesn't support requested feature: CPUID.80000001H:ECX.svm [bit 2]
KVM: entry failed, hardware error 0x7
EAX=00000000 EBX=00000000 ECX=00000000 EDX=00000663
ESI=00000000 EDI=00000000 EBP=00000000 ESP=00000000
EIP=0000fff0 EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 00000000 0000ffff 00009300
CS =f000 ffff0000 0000ffff 00009b00
SS =0000 00000000 0000ffff 00009300
DS =0000 00000000 0000ffff 00009300
FS =0000 00000000 0000ffff 00009300
GS =0000 00000000 0000ffff 00009300
LDT=0000 00000000 0000ffff 00008200
TR =0000 00000000 0000ffff 00008b00
GDT=     00000000 0000ffff
IDT=     00000000 0000ffff
CR0=60000010 CR2=00000000 CR3=00000000 CR4=00000000
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000000
Code=04 66 41 eb f1 66 83 c9 ff 66 89 c8 66 5b 66 5e 66 5f 66 c3 <ea> 5b e0 00 f0 30 36 2f 32 33 2f 39 39 00 fc 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
2022-05-23T14:45:29.473713Z qemu-kvm-one: terminating on signal 15 from pid 7464 (/usr/sbin/libvirtd)
2022-05-23 14:45:29.684+0000: shutting down, reason=destroyed

Seems like it chose the qemu64 CPU Type and that one requires the svm feature on the host? AFAIK svm is the flag the cpu has when it is an AMD CPU and supports virtualization.

I’m running the ONE-Hypervisor in Nested Virtualization, I think that might be the cause for this. I havent had issues with this in the past though -
The ONE-Hypervisor runs on an Intel Hypervisor with KVM and the CPU Type “host” - this is the ONE-Hypervisors lscpu output:

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          4
On-line CPU(s) list:             0-3
Thread(s) per core:              1
Core(s) per socket:              4
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           62
Model name:                      Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
Stepping:                        4
CPU MHz:                         2999.998
BogoMIPS:                        5999.99
Virtualization:                  VT-x
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       128 KiB
L1i cache:                       128 KiB
L2 cache:                        16 MiB
L3 cache:                        16 MiB
NUMA node0 CPU(s):               0-3
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Mitigation; PTE Inversion; VMX flush not necessary, SMT disabled
Vulnerability Mds:               Mitigation; Clear CPU buffers; SMT Host state unknown
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 cx16 pdcm pcid sse4_1
                                  sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cpuid_fault pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust smep erms xsaveopt arat umip md_clear arch_capabilities

I havent run into this issue with other nested virtualization scenarios before, but I don’t think I have tried using the qemu64 CPU Type in any of them.

Is there a way to specify which CPU Type ONE uses for the VM? Can I set that to host and try it this way?

Ok I found a way to set it to host, the svm error can be ignored and is not the real cause of the issue.
The star of the show is the “hardware error 0x7” which seems to be a bug in some kernels and has appeared again and again in different bug reports, one of the more recent ones being this one https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1952246
It only happens when doing nested virtualization on certain kernel versions and certain CPUs.

I guess I’ll declare defeat then :roll_eyes: Might have to try and see if I can get some real hardware to try out ONE.

Thank you for your help!

Hi @Cookiefamily,

Thank you for sharing the information! Hope you can find some hardware to give it a try soon :crossed_fingers:. In the meantime if you are interested and can afford it, maybe you can give it a try using one of the supported cloud providers.