Error creating images from Ceph datastore after 5.12.0.1 upgrade

Hi

We have found a new issue in our clusters running 5.12.0.1 release. OpenNebula is not able to create/delete images in our ceph datastore. oneimage create does not show any error:

# oneimage create -d ceph.altaria --name datatest3 --type DATABLOCK --size 400
ID: 14

but:

# oneimage list
  ID USER     GROUP    NAME                                                                                                                             DATASTORE     SIZE TYPE PER STAT RVMS
  14 oneadmin oneadmin datatest3                                                                                                                        ceph.altar    400M DB    No err     0

and it shows this strange error (it was working before the upgrade):

IMAGE TEMPLATE                                                                  
DEV_PREFIX="vd"
ERROR="Sat Aug 15 12:45:59 2020 : Error creating datablock: Datastore driver 'ceph' not available"

From the opennebula logs we can also see this error message:

020-08-15T11:53:09.331756+02:00 one02 oned[676348]: [Z0][ImM][I]: Creating disk at of 400Mb (type: raw)
2020-08-15T11:53:09.344430+02:00 one02 oned[676348]: [Z0][ImM][E]: Error creating datablock: Datastore driver 'ceph' not available
2020-08-15T11:53:09.352447+02:00 one02 oned[676348]: [Z0][ImM][E]: Error monitoring datastore 100: Datastore driver 'ceph' not available
2020-08-15T11:53:14.847495+02:00 one02 one_monitor[676512]: [Z0][HMM][I]: Successfully monitored VM: 16
2020-08-15T11:53:22.567502+02:00 one02 oned[676348]: [Z0][AuM][I]: Command execution failed (exit code: 255): /var/lib/one/remotes/auth/server_cipher/authenticate
2020-08-15T11:53:22.567626+02:00 one02 oned[676348]: [Z0][AuM][I]: login token expired
2020-08-15T11:53:22.567928+02:00 one02 oned[676348]: [Z0][AuM][E]: Auth Error: login token expired
2020-08-15T11:53:22.568042+02:00 one02 oned[676348]: [Z0][ReM][E]: Req:4464 UID:- one.vmpool.infoextended result FAILURE [one.vmpool.infoextended] User couldn't be authenticated, aborting call.

this setup was working before, this is our ceph datastore:

ALLOW_ORPHANS="mixed"
BRIDGE_LIST="one20.swablu.os"
CEPH_HOST="ceph031.swablu.data ceph032.swablu.data ceph033.swablu.data"
CEPH_SECRET="xxxxxxxxxxxxxxxxxxxxxxxx"
CEPH_USER="libvirt"
CLONE_TARGET="SELF"
CLONE_TARGET_SHARED="SELF"
CLONE_TARGET_SSH="SYSTEM"
DATASTORE_CAPACITY_CHECK="yes"
DISK_TYPE="RBD"
DISK_TYPE_SHARED="rbd"
DISK_TYPE_SSH="FILE"
DRIVER="raw"
DS_MAD="ceph"
LN_TARGET="NONE"
LN_TARGET_SHARED="NONE"
LN_TARGET_SSH="SYSTEM"
NAME="ceph.swablu"
POOL_NAME="one"
QUATTOR="1"
RBD_FORMAT="2"
TM_MAD="ceph"
TM_MAD_SYSTEM="ssh,shared"
TYPE="IMAGE_DS"

Any idea why the ceph auth is not working now? anyone else found the same issue?

Cheers
Álvaro

Hi

More info about this issue, we did a downgrade from 5.12.0.1 to 5.12.0 and now Ceph datastore is working again:

$ oneimage create -d ceph.altaria --name data --type DATABLOCK --size 400
ID: 24
$ oneimage list
  ID USER     GROUP    NAME                                                                                                                             DATASTORE     SIZE TYPE PER STAT RVMS
  24 oneadmin oneadmin data                                                                                                                             ceph.altar    400M DB    No rdy     0
   6 oneadmin oneadmin node2201.shuppet.os_vda                                                                                                          ceph.altar     40G DB   Yes used    1

Someone else had the same issue using ceph datastore with 5.12.0.1? as far as we know only ceph was affected by this issue (our rdm datastore was working without issues after the minor upgrade)

Cheers
Álvaro

Hello @alvaro_simongarcia,

Could you retry using the 5.12.0.1 version and if the error appears again replace the /usr/lib/one/mads/one_datastore.rb by this one with debug information: one_datastore.rb (9.3 KB) and share your oned.log and the /tmp/debug generated?

NOTE: In order to generate /tmp/debug just restart the OpenNebula service once the file have been replaced. Remember to backup the current version of the file so you can recover it later.

Hi @cgonzalez

Thanks a lot for the reply.
I have replaced the ruby script and restarted opennebula. After that I have tried to create a new image in 5.12.0.1 so it has failed, here is the output from the one log and tmp:

messages.log


2020-08-17T14:06:04.523561+02:00 one11 systemd[1]: Started OpenNebula Cloud Controller Daemon.
2020-08-17T14:06:04.525222+02:00 one11 polkitd[749547]: Unregistered Authentication Agent for unix-process:3900238:468184510 (system bus name :1.195144, object path /org/freedesktop/PolicyK
it1/AuthenticationAgent, locale en_US.UTF-8) (disconnected from bus)
2020-08-17T14:06:04.528794+02:00 one11 one_monitor[3900435]: [Z0][HMM][I]: Raft status: SOLO
2020-08-17T14:06:06.644303+02:00 one11 oned[3900270]: [Z0][AuM][I]: Command execution failed (exit code: 255): /var/lib/one/remotes/auth/server_cipher/authenticate
2020-08-17T14:06:06.644474+02:00 one11 oned[3900270]: [Z0][AuM][I]: login token expired
2020-08-17T14:06:06.644830+02:00 one11 oned[3900270]: [Z0][AuM][E]: Auth Error: login token expired
2020-08-17T14:06:06.645051+02:00 one11 oned[3900270]: [Z0][ReM][E]: Req:1952 UID:- one.vmpool.infoextended result FAILURE [one.vmpool.infoextended] User couldn’t be authenticated, aborting
call.


2020-08-17T14:06:14.709748+02:00 one11 oned[3900270]: [Z0][ImM][E]: Error monitoring datastore 100: Datastore driver ‘ceph’ not available

2020-08-17T14:06:26.772579+02:00 one11 oned[3900270]: [Z0][ImM][I]: Creating disk at of 400Mb (type: raw)
2020-08-17T14:06:26.788593+02:00 one11 oned[3900270]: [Z0][ImM][E]: Error creating datablock: Datastore driver ‘ceph’ not available
2020-08

and from /tmp/debug:

# cat debug 
["dummy", "fs", "lvm", "dev", "iscsi_libvirt", "vcenter"]
["dummy", "fs", "lvm", "dev", "iscsi_libvirt", "vcenter"]

It looks like ceph DS type is not in the list (we didn’t change our oned.conf file either since the minor upgrade)

Cheers
Álvaro

Could you share your oned.conf file?

Hi @cgonzalez

Yes, and we found the issue!
It was in our oned.conf indeed (but it was ignored until 5.12.0.1 upgrade). ceph was missing from our DATASTORE_MAD section:

 $ diff oned.conf oned.conf.nopasswd 
51c51
<     arguments = "-t 15 -d dummy,fs,lvm,ceph,dev,iscsi_libvirt,vcenter -s shared,ssh,ceph,fs_lvm,qcow2,vcenter",
---
>     arguments = "-t 15 -d dummy,fs,lvm,dev,iscsi_libvirt,vcenter -s shared,ssh,ceph,fs_lvm,qcow2,vcenter",

So we have changed that:

DATASTORE_MAD = [
    EXECUTABLE = "one_datastore",
    ARGUMENTS  = "-t 15 -d dummy,fs,lvm,ceph,dev,iscsi_libvirt,vcenter -s shared,ssh,ceph,fs_lvm,qcow2,vcenter"
]

And restarted opennebula service, after that the ceph images are managed correctly again.

It is weird that we didnt find this issue before, probably the value was ignored until 5.12.0.1?

Thanks a lot for the help, and sorry for the noise!

Cheers
Álvaro

2 Likes