Cannot Boot Image in LVM Datastore after Host Reboot

Hi all,

I’ve been testing ONE with LVM datastore back-end for the past month, and everything went great so far until I do a test the scenario of a host failure. After I restarted all the KVM host nodes, all the deployed VMs are fine. But when I instantiate new VMs, somehow the new VMs cannot boot. On the vnc console it shows “Boot failed: not a bootable disk” or “no bootable device”.

I’m still new to OpenNebula system, am I missing something? Your help really appreciated.

Best regards,
Ryan


Versions of the related components and OS (frontend, hypervisors, VMs):
Current components:
OpenNebula version 5.8.1
frontend: centos 7
host-node1: centos 7 (KVM)
host-node2: centos 7 (KVM)
SAN node: using open-isci for shared LUN

Image Datastore

DATASTORE 100 INFORMATION                                                       
ID             : 100                 
NAME           : image_ds-lvm        
USER           : oneadmin            
GROUP          : oneadmin            
CLUSTERS       : 0                   
TYPE           : IMAGE               
DS_MAD         : fs                  
TM_MAD         : fs_lvm              
BASE PATH      : /var/lib/one//datastores/100
DISK_TYPE      : BLOCK               
STATE          : READY               

DATASTORE CAPACITY                                                              
TOTAL:         : 8G                  
FREE:          : 3.5G                
USED:          : 4.5G                
LIMIT:         : -                   

PERMISSIONS                                                                     
OWNER          : um-                 
GROUP          : u--                 
OTHER          : ---                 

DATASTORE TEMPLATE                                                              
ALLOW_ORPHANS="NO"
CLONE_TARGET="SYSTEM"
DISK_TYPE="BLOCK"
DRIVER="raw"
DS_MAD="fs"
LN_TARGET="SYSTEM"
RESTRICTED_DIRS="/"
SAFE_DIRS="/var/tmp"
TM_MAD="fs_lvm"
TYPE="IMAGE_DS"

IMAGES         
42             
43             
44                   

System Datastore

DATASTORE 101 INFORMATION                                                       
ID             : 101                 
NAME           : lvm_system          
USER           : oneadmin            
GROUP          : oneadmin            
CLUSTERS       : 0                   
TYPE           : SYSTEM              
DS_MAD         : -                   
TM_MAD         : fs_lvm              
BASE PATH      : /var/lib/one//datastores/101
DISK_TYPE      : FILE                
STATE          : READY               

DATASTORE CAPACITY                                                              
TOTAL:         : 20G                 
FREE:          : 16.6G               
USED:          : 3.4G                
LIMIT:         : -                   

PERMISSIONS                                                                     
OWNER          : um-                 
GROUP          : u--                 
OTHER          : ---                 

DATASTORE TEMPLATE                                                              
ALLOW_ORPHANS="NO"
BRIDGE_LIST="node1 node2"
DISK_TYPE="FILE"
DS_MIGRATE="YES"
RESTRICTED_DIRS="/"
SAFE_DIRS="/var/tmp"
SHARED="YES"
TM_MAD="fs_lvm"
TYPE="SYSTEM_DS"

IMAGES         

Steps to reproduce:
Poweroff/reboot all KVM host node.

Current results:
-All new instantiated VMs cannot boot.
-The previously deployed guest VMs still can be run.

Can you check for errors in /var/log/one/[VM_ID].log and /var/log/one/oned.log?
Also can check if LVM is working fine after the boot whether you can create LV etc?

Hi @jorel,

Sorry for the late reply. The logs seems fine, no error task are showing.

/var/log/one/148.log:

Fri Sep 13 17:47:57 2019 [Z0][VM][I]: New state is ACTIVE
Fri Sep 13 17:47:57 2019 [Z0][VM][I]: New LCM state is PROLOG
Fri Sep 13 17:52:03 2019 [Z0][VM][I]: New LCM state is BOOT
Fri Sep 13 17:52:03 2019 [Z0][VMM][I]: Generating deployment file: /var/lib/one/vms/148/deployment.0
Fri Sep 13 17:52:03 2019 [Z0][VMM][I]: Successfully execute transfer manager driver operation: tm_context.
Fri Sep 13 17:52:04 2019 [Z0][VMM][I]: ExitCode: 0
Fri Sep 13 17:52:04 2019 [Z0][VMM][I]: Successfully execute network driver operation: pre.
Fri Sep 13 17:52:04 2019 [Z0][VMM][I]: ExitCode: 0
Fri Sep 13 17:52:04 2019 [Z0][VMM][I]: Successfully execute virtualization driver operation: deploy.
Fri Sep 13 17:52:04 2019 [Z0][VMM][I]: ExitCode: 0
Fri Sep 13 17:52:04 2019 [Z0][VMM][I]: Successfully execute network driver operation: post.
Fri Sep 13 17:52:04 2019 [Z0][VM][I]: New LCM state is RUNNING
Fri Sep 13 17:52:56 2019 [Z0][VM][I]: New LCM state is SHUTDOWN
Fri Sep 13 17:52:57 2019 [Z0][VMM][I]: ExitCode: 0
Fri Sep 13 17:52:57 2019 [Z0][VMM][I]: Successfully execute virtualization driver operation: cancel.
Fri Sep 13 17:52:57 2019 [Z0][VMM][I]: ExitCode: 0
Fri Sep 13 17:52:57 2019 [Z0][VMM][I]: Successfully execute network driver operation: clean.
Fri Sep 13 17:52:57 2019 [Z0][VM][I]: New LCM state is EPILOG
Fri Sep 13 17:53:00 2019 [Z0][VM][I]: New state is DONE
Fri Sep 13 17:53:00 2019 [Z0][VM][I]: New LCM state is LCM_INIT

oned.log:

Fri Sep 13 17:51:58 2019 [Z0][ReM][D]: Req:400 UID:1 one.documentpool.info result SUCCESS, "<DOCUMENT_POOL></DOC..."
Fri Sep 13 17:52:03 2019 [Z0][TM][D]: Message received: TRANSFER SUCCESS 148 -

Fri Sep 13 17:52:03 2019 [Z0][VMM][D]: Message received: LOG I 148 Successfully execute transfer manager driver operation: tm_context.

Fri Sep 13 17:52:04 2019 [Z0][VMM][D]: Message received: LOG I 148 ExitCode: 0

Fri Sep 13 17:52:04 2019 [Z0][VMM][D]: Message received: LOG I 148 Successfully execute network driver operation: pre.

Fri Sep 13 17:52:04 2019 [Z0][VMM][D]: Message received: LOG I 148 ExitCode: 0

Fri Sep 13 17:52:04 2019 [Z0][VMM][D]: Message received: LOG I 148 Successfully execute virtualization driver operation: deploy.

Fri Sep 13 17:52:04 2019 [Z0][VMM][D]: Message received: LOG I 148 ExitCode: 0

Fri Sep 13 17:52:04 2019 [Z0][VMM][D]: Message received: LOG I 148 Successfully execute network driver operation: post.

Fri Sep 13 17:52:04 2019 [Z0][VMM][D]: Message received: DEPLOY SUCCESS 148 one-148

Fri Sep 13 17:52:08 2019 [Z0][AuM][D]: Message received: AUTHENTICATE SUCCESS 9 -

Fri Sep 13 17:52:08 2019 [Z0][ReM][D]: Req:5328 UID:0 IP:127.0.0.1 one.vmpool.info invoked , -2, 0, -200, -1

On the host the LVM is also working. Aas mentioned on the first post, it can deploy the VM, create its LV, but the only problem is it cannot boot into the OS.

[root@one-kvm1 ~]# vgs
  VG          #PV #LV #SN Attr   VSize  VFree 
  cl_one-kvm1   1   2   0 wz--n- <9,00g     0 
  vg-one-101    1   2   0 wz--n- 19,96g 14,96g
[root@one-kvm1 ~]# lvs
  LV           VG          Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  root         cl_one-kvm1 -wi-ao---- <8,00g                                                    
  swap         cl_one-kvm1 -wi-ao----  1,00g                                                    
  lv-one-149-0 vg-one-101  -wi-------  2,50g                                                    
  lv-one-150-0 vg-one-101  -wi-ao----  2,50g  

The only solution that works so far is to wipe the iSCSI LUN block (recreate PV and VG again), then re-upload the images into datastore. But I cannot use this as a fix solution if I want to implement ONE into production, since I cannot lose the current running VMs.

Regards,
Ryan