V6.8.2 - KVM hosts and SAN storage

Hi
Hopefully some of you could help us getting our storage fixed.

We have deployed a small lab, with 2 kvm hosts, front-end node.
Front-end node (no direct connection to the storage).
Lab01(mgmt 10.106.0.11/24, storage 10.20.0.11/24)
and
lab03(mgmt 10.106.0.13/24, storage 10.20.0.13/24)
Storage Synology, 10.20.0.1/24 (dedicated physical network, only available to the kvm hosts).

[root@localhost ~]# onedatastore list
ID NAME SIZE AVA CLUSTERS IMAGES TYPE DS TM STAT
113 lvm_image 71.3G 79% 100 2 img fs fs_lvm_on
103 lvm_system 3.9T 99% 100 0 sys - fs_lvm_on

[root@localhost ~]# onedatastore list
ID NAME SIZE AVA CLUSTERS IMAGES TYPE DS TM STAT
113 lvm_image 10G 21% 100 2 img fs fs_lvm_on
103 lvm_system 10G 21% 100 0 sys - fs_lvm_on

The problem is that the storage works sometimes😊, event though the image datastore reports total capacity of 71.3G(that I believe is local storage?). But about 50% of the time it reports 10GB,
miraculously we have been able to deploy a vm, that actually runs fine with network connectivity but migrating it is another chapter to be looked into as soon as the storage issues have been dealt with😊

Front end node got passwordless login to hosts.
hosts, ubuntu server 22.04.4
2 LUNs:
4TB for system (vg-one-103)
1TB for image (vg-one-113)
both hosts sees:

pvs

PV VG Fmt Attr PSize PFree
/dev/sda3 ubuntu-vg lvm2 a-- <146.00g 73.00g
/dev/sdg vg-one-113 lvm2 a-- <1000.00g <1000.00g
/dev/sdh vg-one-103 lvm2 a-- <3.91t 3.88t

DATASTORE 103 INFORMATION
ID : 103
NAME : lvm_system
USER : oneadmin
GROUP : oneadmin
CLUSTERS : 100
TYPE : SYSTEM
DS_MAD : -
TM_MAD : fs_lvm_ssh
BASE PATH : /var/lib/one//datastores/103
DISK_TYPE : BLOCK
STATE : READY

DATASTORE CAPACITY
TOTAL: : 3.9T
FREE: : 3.9T
USED: : 22G
LIMIT: : -

PERMISSIONS
OWNER : um-
GROUP : u–
OTHER : —

DATASTORE TEMPLATE
ALLOW_ORPHANS=“NO”
BRIDGE_LIST=“lab01 lab03 10.46.0.4”
DISK_TYPE=“BLOCK”
DS_MIGRATE=“YES”
RESTRICTED_DIRS=“/”
SAFE_DIRS=“/var/tmp”
SHARED=“YES”
TM_MAD=“fs_lvm_ssh”
TM_MAD_SYSTEM=“shared”
TYPE=“SYSTEM_DS”

[root@localhost ~]# onedatastore show 113
DATASTORE 113 INFORMATION
ID : 113
NAME : lvm_image
USER : oneadmin
GROUP : oneadmin
CLUSTERS : 100
TYPE : IMAGE
DS_MAD : fs
TM_MAD : fs_lvm_ssh
BASE PATH : /var/lib/one//datastores/113
DISK_TYPE : BLOCK
STATE : READY

DATASTORE CAPACITY
TOTAL: : 71.3G
FREE: : 55G
USED: : 12.7G
LIMIT: : -

PERMISSIONS
OWNER : um-
GROUP : u–
OTHER : —

DATASTORE TEMPLATE
ALLOW_ORPHANS=“NO”
BRIDGE_LIST=“lab01 lab03 10.46.0.4”
CLONE_TARGET=“SYSTEM”
DISK_TYPE=“BLOCK”
DRIVER=“raw”
DS_MAD=“fs”
LN_TARGET=“SYSTEM”
SAFE_DIRS=“/var/tmp /tmp”
TM_MAD=“fs_lvm_ssh”
TYPE=“IMAGE_DS”

Any help would be grately appreciated.

Kind regards
Svela

Hi @CloudClown :smiley:

Welcome to the OpenNebula Forum! Thank you so much for the detailed description of your setup, It really help me to understand your case.

I would like to share with you some considerations:

  • I’ve seen that you have the attribute BRIDGE_LIST configured. That’s OK since your Frontend node doesn’t have access to the Storage. But, apart of the lab01 and lab03 hosts, what’s about the host with the IP 10.46.0.4? Is this host managed by OpenNebula?
  • I’ve noticed you’re using fs_lvm_ssh driver for your datastores 103 and 113, with this drivers the image files are transferred to the Host through the SSH protocol, storing the image files as symbolic links to the block devices.However, additional VM files like checkpoints or deployment files are stored under /var/lib/one/datastores/<id>. So this driver use the local storage of the machine for that purpose. If this is not you’re expected behavior maybe you should change the driver to fs_lvm instead. More info about this can be found here.

Best,
Victor.

1 Like

Thank you for your feedback @vpalma !

The 10.46.0.4 is actually the Front-End node, I added the ip to the bridge list to make sure it reached the storage, but removed it now.

According to the docs about creating datastores it was not very clear, but my impression was that NFS required the ‘fs_lvm’ and iscsi based block storage should use ‘fs_lvm_ssh’. But I’ve changed to ‘fs_lvm’ now and it looks much better!

The only thing now is that the image datastore is reporting the capacity:
TOTAL: : 71.3G
but the volume group 113 is 1000GB.

Is there a parameter to adjusted anywhere to correct this?

Kind regards
Svela

Have moved on from the originally image datastore as I never could figure out the issues with the size.

Now there is 2 datastores:

[root@localhost ~]# onedatastore show 103
DATASTORE 103 INFORMATION
ID             : 103
NAME           : lvm_system
USER           : oneadmin
GROUP          : oneadmin
CLUSTERS       : 100
TYPE           : SYSTEM
DS_MAD         : -
TM_MAD         : fs_lvm
BASE PATH      : /var/lib/one//datastores/103
DISK_TYPE      : BLOCK
STATE          : READY

DATASTORE CAPACITY
TOTAL:         : 3.9T
FREE:          : 3.9T
USED:          : 7.2G
LIMIT:         : -

PERMISSIONS
OWNER          : um-
GROUP          : u--
OTHER          : ---

DATASTORE TEMPLATE
ALLOW_ORPHANS="NO"
BRIDGE_LIST="lab01 lab03"
DISK_TYPE="BLOCK"
DS_MIGRATE="YES"
RESTRICTED_DIRS="/"
SAFE_DIRS="/var/tmp"
SHARED="YES"
TM_MAD="fs_lvm"
TM_MAD_SYSTEM="shared"
TYPE="SYSTEM_DS"
[root@localhost ~]# onedatastore show 118
DATASTORE 118 INFORMATION
ID             : 118
NAME           : nfs-images
USER           : oneadmin
GROUP          : oneadmin
CLUSTERS       : 100
TYPE           : IMAGE
DS_MAD         : fs
TM_MAD         : fs_lvm
BASE PATH      : /var/lib/one//datastores/118
DISK_TYPE      : BLOCK
STATE          : READY

DATASTORE CAPACITY
TOTAL:         : 96.9G
FREE:          : 88G
USED:          : 4G
LIMIT:         : -

PERMISSIONS
OWNER          : um-
GROUP          : u--
OTHER          : ---

DATASTORE TEMPLATE
ALLOW_ORPHANS="NO"
CLONE_TARGET="SYSTEM"
DISK_TYPE="BLOCK"
DRIVER="raw"
DS_MAD="fs"
LN_TARGET="SYSTEM"
SAFE_DIRS="/var/tmp /tmp"
TM_MAD="fs_lvm"
TYPE="IMAGE_DS"

IMAGES
29
30
31
32
33

Now its possible to build new vm’s, but not migrating. Giving error:

Fri Apr 12 11:05:01 2024 [Z0][VMM][I]: Successfully execute transfer manager driver operation: tm_premigrate.
Fri Apr 12 11:05:01 2024 [Z0][VMM][I]: Successfully execute network driver operation: pre.
Fri Apr 12 11:05:04 2024 [Z0][VMM][I]: Command execution fail (exit code: 1): cat << 'EOT' | /var/tmp/one/vmm/kvm/migrate '50125069-4be7-4a34-8c0e-11285ebf23dc' 'lab01' 'lab03' 60 lab03
Fri Apr 12 11:05:04 2024 [Z0][VMM][I]: error: Cannot access storage file '/var/lib/one//datastores/103/60/disk.1': No such file or directory
Fri Apr 12 11:05:04 2024 [Z0][VMM][I]: Could not migrate 50125069-4be7-4a34-8c0e-11285ebf23dc to lab01
Fri Apr 12 11:05:04 2024 [Z0][VMM][I]: ExitCode: 1
Fri Apr 12 11:05:04 2024 [Z0][VMM][I]: Successfully execute transfer manager driver operation: tm_failmigrate.
Fri Apr 12 11:05:04 2024 [Z0][VMM][I]: Failed to execute virtualization driver operation: migrate.
Fri Apr 12 11:05:04 2024 [Z0][VMM][E]: MIGRATE: error: Cannot access storage file '/var/lib/one//datastores/103/60/disk.1': No such file or directory Could not migrate 50125069-4be7-4a34-8c0e-11285ebf23dc to lab01 ExitCode: 1

But if I change TM_MAD to fs_lvm_ssh, then migrating is ok, but now its not possible to deploy new vm’s anymore… Is it possible to use the fs_lvm_ssh setting only when migrating somehow?

Kind regards
Svela

What error do you get on deploy after the change?
Did you change the TM_MAD on both the IMAGE and SYSTEM datastores?
The migration scripts are called only in the context of the SYSTEM datastore, so did you try to change the TM_MAD to fs_lvm_ssh on the SYSTEM datastore only?

BR,
Anton