Hosts in error after upgrading to 5.12.0

Please, describe the problem here and provide additional information below (if applicable) …

Upgrade from 5.10.1 to 5.12.0 with migration package on CentOS7 following the documentation. Everything went smooth until I wanted to deploy new VMs. All my hosts are now in ERROR. onehost sync fails to update hosts, with just this message:

Failed to update the following hosts:
* hostXXX

but ssh connection for oneadmin to hosts are ok.

When I try to enable any hosts, I’ve got this in oned.log:

Thu Jun 25 14:05:18 2020 [Z0][ReM][D]: Req:2464 UID:1 IP:127.0.0.1 one.documentpool.info invoked , -2, -1, -1, 100
Thu Jun 25 14:05:18 2020 [Z0][AuM][D]: Message received: LOG I 20 Command execution failed (exit code: 255): /var/lib/one/remotes/auth/server_cipher/authenticate

Thu Jun 25 14:05:18 2020 [Z0][AuM][I]: Command execution failed (exit code: 255): /var/lib/one/remotes/auth/server_cipher/authenticate
Thu Jun 25 14:05:18 2020 [Z0][AuM][D]: Message received: LOG E 20 login token expired

Thu Jun 25 14:05:18 2020 [Z0][AuM][I]: login token expired
Thu Jun 25 14:05:18 2020 [Z0][AuM][D]: Message received: AUTHENTICATE FAILURE 20 login token expired

Thu Jun 25 14:05:18 2020 [Z0][AuM][E]: Auth Error: login token expired
Thu Jun 25 14:05:18 2020 [Z0][ReM][D]: Req:1872 UID:-1 IP:127.0.0.1 one.vmpool.infoextended invoked , -2, -1, -1, -1, ""
Thu Jun 25 14:05:18 2020 [Z0][ReM][E]: Req:1872 UID:- one.vmpool.infoextended result FAILURE [one.vmpool.infoextended] User couldn't be authenticated, aborting call.

Is there something new here ? Where can I configure this IP >> Req:1872 UID:-1 IP:127.0.0.1 one.vmpool.infoextended invoked , -2, -1, -1, -1, “”

It also seems that the scheduler is constently restarting. Can’t find any error in the sched.log, just this:

juin 25 14:02:07 t110.lan systemd[1]: Started OpenNebula Cloud Scheduler Daemon.
juin 25 14:02:07 t110.lan mm_sched[6213]: [Z0][SCHED][I]: Init Scheduler Log system
juin 25 14:02:07 t110.lan mm_sched[6213]: [Z0][SCHED][I]: Starting Scheduler Daemon
juin 25 14:02:07 t110.lan mm_sched[6213]: [Z0][SCHED][I]: ----------------------------------------
juin 25 14:02:07 t110.lan mm_sched[6213]: [Z0][SCHED][I]:      Scheduler Configuration File
juin 25 14:02:07 t110.lan mm_sched[6213]: [Z0][SCHED][I]: ----------------------------------------
juin 25 14:02:07 t110.lan mm_sched[6213]: [Z0][SCHED][I]: DEFAULT_DS_SCHED=POLICY=1
juin 25 14:02:07 t110.lan mm_sched[6213]: [Z0][SCHED][I]: DEFAULT_NIC_SCHED=POLICY=1
juin 25 14:02:07 t110.lan mm_sched[6213]: [Z0][SCHED][I]: DEFAULT_SCHED=POLICY=1
juin 25 14:02:07 t110.lan mm_sched[6213]: [Z0][SCHED][I]: DIFFERENT_VNETS=YES
juin 25 14:02:07 t110.lan mm_sched[6213]: [Z0][SCHED][I]: LIVE_RESCHEDS=0
juin 25 14:02:07 t110.lan mm_sched[6213]: [Z0][SCHED][I]: LOG=DEBUG_LEVEL=5,SYSTEM=syslog
juin 25 14:02:07 t110.lan mm_sched[6213]: [Z0][SCHED][I]: MAX_DISPATCH=30
juin 25 14:02:07 t110.lan mm_sched[6213]: [Z0][SCHED][I]: MAX_HOST=1
juin 25 14:02:07 t110.lan mm_sched[6213]: [Z0][SCHED][I]: MAX_VM=5000
juin 25 14:02:07 t110.lan mm_sched[6213]: [Z0][SCHED][I]: MEMORY_SYSTEM_DS_SCALE=0
juin 25 14:02:07 t110.lan mm_sched[6213]: [Z0][SCHED][I]: MESSAGE_SIZE=1073741824
juin 25 14:02:07 t110.lan mm_sched[6213]: [Z0][SCHED][I]: ONE_XMLRPC=http://localhost:2633/RPC2
juin 25 14:02:07 t110.lan mm_sched[6213]: [Z0][SCHED][I]: SCHED_INTERVAL=30
juin 25 14:02:07 t110.lan mm_sched[6213]: [Z0][SCHED][I]: TIMEOUT=60
juin 25 14:02:07 t110.lan mm_sched[6213]: [Z0][SCHED][I]: ----------------------------------------
juin 25 14:02:07 t110.lan mm_sched[6213]: [Z0][SCHED][I]: XML-RPC client using 1073741824 bytes for response buffer.
juin 25 14:02:07 t110.lan mm_sched[6213]: [Z0][SCHED][I]: oned successfully contacted.
juin 25 14:02:07 t110.lan mm_sched[6213]: [Z0][SCHED][I]: Configuring scheduler for Zone ID: 0
juin 25 14:02:07 t110.lan mm_sched[6213]: [Z0][SCHED][I]: Starting scheduler loop...
juin 25 14:02:07 t110.lan mm_sched[6213]: [Z0][SCHED][I]: Scheduler loop started.
juin 25 14:02:37 t110.lan systemd[1]: opennebula-scheduler.service: main process exited, code=killed, status=6/ABRT
juin 25 14:02:37 t110.lan systemd[1]: Unit opennebula-scheduler.service entered failed state.
juin 25 14:02:37 t110.lan systemd[1]: opennebula-scheduler.service failed.
juin 25 14:02:43 t110.lan systemd[1]: opennebula-scheduler.service holdoff time over, scheduling restart.
juin 25 14:02:43 t110.lan systemd[1]: Stopped OpenNebula Cloud Scheduler Daemon.
juin 25 14:02:43 t110.lan systemd[1]: Starting OpenNebula Cloud Scheduler Daemon...
juin 25 14:02:43 t110.lan systemd[1]: Started OpenNebula Cloud Scheduler Daemon.

Versions of the related components and OS (frontend, hypervisors, VMs):

up-to-date CentOS7
OpenNebula 5.12.0 RPMS

Steps to reproduce:

Upgrade from 5.10 following the documentation

Current results:

Hosts are all in ERROR state with the following message in sunstone:

Thu Jun 25 14:07:00 2020 : Error monitoring Host t110.lan (2):

Expected results:

Hosts should be OK

I also have this when mm_sched abort:
type=ANOM_ABEND msg=audit(1593088003.125:2510): auid=4294967295 uid=9869 gid=9869 ses=4294967295 subj=system_u:system_r:unconfined_service_t:s0 pid=10302 comm="mm_sched" reason="memory violation" sig=6

I don’t know if the scheduler and the hosts errors are related. But it seems that the scheduler crash because of UTF8. Here is a strace output for mm_shed:

    11647 sendto(4, "POST /RPC2 HTTP/1.1\r\nHost: localhost:2633\r\nAccept: */*\r\nContent-Type: text/xml\r\nUser-Agent: Xmlrpc-c/1.40.0 Curl/7.29.0\r\nContent-Length: 402\r\n\r\n<?xml version=\"1.0\" encoding=\"UTF-8\"?>\r\n<methodCall>\r\n<methodName>one.vmpool.infoextended</methodName>\r\n<params>\r\n<param><value><string>oneadmin:7cc0bdbc0f9b02b8f598ada6ef3171b7</string></value></param>\r\n<param><value><i4>-2</i4></value></param>\r\n<param><value><i4>-1</i4></value></param>\r\n<param><value><i4>-1</i4></value></param>\r\n<param><value><i4>-1</i4></value></param>\r\n</params>\r\n</methodCall>\r\n", 546, MSG_NOSIGNAL, NULL, 0) = 546
    11647 poll([{fd=4, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 0 (Timeout)
    11647 pselect6(5, [4], [], [], {tv_sec=3, tv_nsec=0}, {NULL, 8}) = 1 (in [4], left {tv_sec=2, tv_nsec=995252388})
    11647 poll([{fd=4, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 1 ([{fd=4, revents=POLLIN|POLLRDNORM}])
    11647 recvfrom(4, "HTTP/1.1 200 OK\r\nContent-type: text/xml; charset=utf-8\r\nContent-length: 579\r\nConnection: close\r\nDate: Thu, 25 Jun 2020 13:47:55 UTC\r\nServer: Xmlrpc-c_Abyss/1.40.0\r\n\r\n<?xml version=\"1.0\" encoding=\"UTF-8\"?>\r\n<methodResponse>\r\n<fault>\r\n<value><struct>\r\n<member><name>faultCode</name>\r\n<value><i4>-500</i4></value></member>\r\n<member><name>faultString</name>\r\n<value><string>Unexpected error executing code for particular method, detected by Xmlrpc-c method registry code.  Method did not fail; rather, it did not complete at all.  385050-byte supposed UTF-8 string is not valid UTF-8.  UTF-8 multibyte sequence contains character 0x20, which does not indicate continuation.</string></value></member>\r\n</struct></value>\r\n</fault>\r\n</methodResponse>\r\n", 16384, 0, NULL, NULL) = 745
    11647 close(4)                          = 0
    11647 futex(0x7f177970f190, FUTEX_WAKE_PRIVATE, 2147483647) = 0
    11647 write(2, "terminate called after throwing an instance of '", 48) = 48
    11647 write(2, "girerr::error", 13)     = 13
    11647 write(2, "'\n", 2)                = 2
    11647 write(2, "  what():  ", 11)       = 11
    11647 write(2, "Reference to xmlrpc_c::value that has not been instantiated.  (xmlrpc_c::value::isInstantiated may be useful in diagnosing)", 123) = 123
    11647 write(2, "\n", 1)                 = 1
    11647 rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
    11647 tgkill(11644, 11647, SIGABRT)     = 0
    11647 --- SIGABRT {si_signo=SIGABRT, si_code=SI_TKILL, si_pid=11644, si_uid=9869} ---
    11644 <... rt_sigtimedwait resumed> <unfinished ...>) = ?
    11647 +++ killed by SIGABRT +++
    11644 +++ killed by SIGABRT +++

Any idea ?

Best regards,
Edouard

ok so I have converted all my tables to UTF8 and now mm_sched is not crashing anymore. Do I need to open a ticket somewhere ? mm_sched could be more resilient, or onedb fsck could check for table encoding ?

But still, I can’t get opennebula to work correctly. Hosts are still in error. It seems to be related to oneflow, and encoding:

Thu Jun 25 15:14:31 2020 [Z0][ReM][E]: Req:1296 UID:1 one.document.info result FAILURE [one.document.info] Error getting document [53].
Thu Jun 25 15:14:31 2020 [Z0][ReM][D]: Req:5888 UID:1 IP:127.0.0.1 one.document.info invoked , 53, false
Thu Jun 25 15:14:31 2020 [Z0][ReM][E]: Req:5888 UID:1 one.document.info result FAILURE [one.document.info] Error getting document [53].

These documents all have accents (é) in their template. Could it be related ?

Best regards,
Edouard

I’d delete /var/tmp/one on the hosts and retry the host sync (It should succeed).

Best Regards,
Anton Todorov

nope, I have deleted the /var/tmp/one, onehost sync (even with --force and/or --ssh) is still failing. But the files are back in /var/tmp/one

Thanks for your help

Another option - the monitoring in 5.12 was reworked, so check the new monitoring setup for something like “host is not reporting to the frontend ip”. I’d check does the hosts send the monitoring data to the frontend.

Best Regards,
Anton Todorov

can’t find anything on this monitord part. What troubles me the most are those repeating errors:

/var/log/one/oned.log:Fri Jun 26 11:20:13 2020 [Z0][AuM][I]: authenticate: Authenticating serveradmin, with password XXX (YYY)
/var/log/one/oned.log-Fri Jun 26 11:20:13 2020 [Z0][AuM][D]: Message received: LOG E 31 login token expired
/var/log/one/oned.log-
/var/log/one/oned.log-Fri Jun 26 11:20:13 2020 [Z0][AuM][I]: login token expired
/var/log/one/oned.log-Fri Jun 26 11:20:13 2020 [Z0][AuM][D]: Message received: AUTHENTICATE FAILURE 31 login token expired

I don’t understand how this login token could have expired.

Hi @madko

There are a couple of issues, so let us go through them:

  • Database encoding. So this encoding should be preserved by the upgrade process and should be not a problem if it was not a problem in the previous version of OpenNebula. So how did you change the encoding of the DB? What is the output of (assuming MySQL and db_name = opennebula)

    select default_character_set_name FROM information_schema.SCHEMATA WHERE schema_name = 'opennebula';
    

    and

    select CCSA.character_set_name FROM information_schema.`TABLES` T, information_schema.`COLLATION_CHARACTER_SET_APPLICABILITY` CCSA ' WHERE CCSA.collation_name = T.table_collation AND T.table_schema = 'opennebula' AND T.table_name = 'system_attributes';
    

    For example in my case I have in both cases:

    +----------------------------+
    | default_character_set_name |
    +----------------------------+
    | latin1                     |
    +----------------------------+
    1 row in set (0.00 sec)
    
  • Then you have problems monitoring the hosts. Please check the upgrade notes, as monitoring has changed in the last version. If you follow the steps:

    1. Increase logging information for monitord in /etc/one/monitord.conf to 5 and send the /var/log/one/monitor.log file
    2. Check that OpenNebula has no problems accessing the hosts (onehost show ) we are looking for failures getting the host from the DB.
  • And then you have some problems with accessing documents can you execute (oneflow-template list) for example. Again we are looking for a confirmation of a DB problem here.

  • Finally the expiration token can be perfectly normal from server-admin user by Sunstone, it should refresh the token.

Cheers

Ruben

Thank you Ruben for your help, that’s the first time I see a so complete mess after an upgrade. I agree there was multiple problems.

So first, for the encoding, you are right the encoding was preserved, it was full latin1. But I wasn’t able to read some documents (images, flows), any document containing é ç, encoded in utf8 (inside latin1 tables). I guess that is from sunstone. But before upgrade mm_sched never crashed because of that.

So I switched from latin1 contening utf8 (I guess from sunstone inputs from users), to plain utf8:

MariaDB [opennebula]> select default_character_set_name FROM information_schema.SCHEMATA WHERE schema_name = 'opennebula';                                                                                                                    +----------------------------+
| default_character_set_name |
+----------------------------+
| utf8                       |
+----------------------------+
1 row in set (0,000 sec)

MariaDB [opennebula]> select CCSA.character_set_name FROM information_schema.`TABLES` T, information_schema.`COLLATION_CHARACTER_SET_APPLICABILITY` CCSA WHERE CCSA.collation_name = T.table_collation AND T.table_schema = 'opennebula' AND T.table_name = 'system_attributes';
+--------------------+
| character_set_name |
+--------------------+
| utf8               |
+--------------------+
1 row in set (0,001 sec)

MariaDB [opennebula]> 

After that, mm_sched is not crashing anymore. But I can’t still read some documents:
oneflow show 99
[one.document.info] Error getting document [99].

I’m trying to disable oneflow to fix the more urgent problems first.

For monitord, I’m not sure, was there something to do during the upgrade process ?

I have already put loglevel at 5 for monitord but I don’t find any valuable information:

Sat Jun 27 10:29:00 2020 [Z0][MON][I]: Init Monitor Log system
Sat Jun 27 10:29:00 2020 [Z0][MON][I]: Starting Monitor Daemon
----------------------------------------
       Monitor Configuration File       
----------------------------------------
DATASTORE_LOCATION=/var/lib/one//datastores
DB=CONNECTIONS=15
HOST_MONITORING_EXPIRATION_TIME=43200
IM_MAD=ARGUMENTS=-r 3 -t 15 -w 90 kvm,EXECUTABLE=one_im_ssh,NAME=kvm,SUNSTONE_NAME=KVM,THREADS=0
IM_MAD=ARGUMENTS=-r 3 -t 15 -w 90 lxd,EXECUTABLE=one_im_ssh,NAME=lxd,SUNSTONE_NAME=LXD,THREADS=0
IM_MAD=ARGUMENTS=-r 3 -t 15 -w 90 firecracker,EXECUTABLE=one_im_ssh,NAME=firecracker,SUNSTONE_NAME=Firecracker,THREADS=0
IM_MAD=ARGUMENTS=-c -t 15 -r 0 vcenter,EXECUTABLE=one_im_sh,NAME=vcenter,SUNSTONE_NAME=VMWare vCenter
LOG=DEBUG_LEVEL=5,SYSTEM=FILE
MANAGER_TIMER=15
MONITORING_INTERVAL_HOST=180
NETWORK=ADDRESS=192.168.2.172,MONITOR_ADDRESS=192.168.2.172,PORT=4124,PRIKEY=,PUBKEY=,THREADS=8
PROBES_PERIOD=BEACON_HOST=30,MONITOR_HOST=120,MONITOR_VM=30,STATE_VM=5,SYNC_STATE_VM=180,SYSTEM_HOST=600
VM_MONITORING_EXPIRATION_TIME=43200
----------------------------------------
Sat Jun 27 10:29:00 2020 [Z0][ONE][I]: Set up 15 DB connections using encoding utf8
Sat Jun 27 10:29:00 2020 [Z0][ONE][I]: Using 10.3.20-mariadb (100320) with:
Sat Jun 27 10:29:00 2020 [Z0][ONE][I]: 	FTS enabled
Sat Jun 27 10:29:00 2020 [Z0][DrM][I]: Loading drivers.
Sat Jun 27 10:29:00 2020 [Z0][InM][I]: Loading driver: kvm
Sat Jun 27 10:29:00 2020 [Z0][InM][I]: 	Driver loaded: kvm
Sat Jun 27 10:29:00 2020 [Z0][InM][I]: Loading driver: lxd
Sat Jun 27 10:29:00 2020 [Z0][InM][I]: 	Driver loaded: lxd
Sat Jun 27 10:29:00 2020 [Z0][InM][I]: Loading driver: firecracker
Sat Jun 27 10:29:00 2020 [Z0][InM][I]: 	Driver loaded: firecracker
Sat Jun 27 10:29:00 2020 [Z0][InM][I]: Loading driver: vcenter
Sat Jun 27 10:29:00 2020 [Z0][InM][I]: 	Driver loaded: vcenter
Sat Jun 27 10:29:04 2020 [Z0][HMM][I]: Raft status: SOLO
Sat Jun 27 10:32:19 2020 [Z0][HMM][D]: Monitoring host t110.lan(2)
Sat Jun 27 10:32:21 2020 [Z0][MDP][I]: 
Sat Jun 27 10:32:21 2020 [Z0][MDP][I]: 
Sat Jun 27 10:32:21 2020 [Z0][MDP][I]: 
Sat Jun 27 10:32:21 2020 [Z0][MDP][I]: 
Sat Jun 27 10:32:21 2020 [Z0][MDP][I]: 
Sat Jun 27 10:32:21 2020 [Z0][MDP][D]: [1:0:0] Recieved START_MONITOR message from host 2:

Sat Jun 27 10:32:21 2020 [Z0][MDP][W]: Start monitor failed for host 2: 
Sat Jun 27 10:32:21 2020 [Z0][HMM][E]: Unable to monitor host id: 2
Sat Jun 27 10:35:34 2020 [Z0][HMM][D]: Monitoring host t110.lan(2)
Sat Jun 27 10:35:35 2020 [Z0][MDP][I]: 
Sat Jun 27 10:35:35 2020 [Z0][MDP][I]: 
Sat Jun 27 10:35:35 2020 [Z0][MDP][I]: 
Sat Jun 27 10:35:35 2020 [Z0][MDP][I]: 
Sat Jun 27 10:35:35 2020 [Z0][MDP][I]: 
Sat Jun 27 10:35:35 2020 [Z0][MDP][D]: [1:0:0] Recieved START_MONITOR message from host 2:

Sat Jun 27 10:35:35 2020 [Z0][MDP][W]: Start monitor failed for host 2: 
Sat Jun 27 10:35:35 2020 [Z0][HMM][E]: Unable to monitor host id: 2
Sat Jun 27 10:38:49 2020 [Z0][HMM][D]: Monitoring host t110.lan(2)
Sat Jun 27 10:38:50 2020 [Z0][MDP][I]: 
Sat Jun 27 10:38:50 2020 [Z0][MDP][I]: 
Sat Jun 27 10:38:50 2020 [Z0][MDP][I]: 
Sat Jun 27 10:38:50 2020 [Z0][MDP][I]: 
Sat Jun 27 10:38:50 2020 [Z0][MDP][I]: 
Sat Jun 27 10:38:50 2020 [Z0][MDP][D]: [1:0:0] Recieved START_MONITOR message from host 2:

Sat Jun 27 10:38:50 2020 [Z0][MDP][W]: Start monitor failed for host 2: 
Sat Jun 27 10:38:50 2020 [Z0][HMM][E]: Unable to monitor host id: 2
Sat Jun 27 10:39:04 2020 [Z0][HMM][I]: --Mark--
Sat Jun 27 10:42:04 2020 [Z0][HMM][D]: Monitoring host t110.lan(2)
Sat Jun 27 10:42:05 2020 [Z0][MDP][I]: 
Sat Jun 27 10:42:05 2020 [Z0][MDP][I]: 
Sat Jun 27 10:42:05 2020 [Z0][MDP][I]: 
Sat Jun 27 10:42:05 2020 [Z0][MDP][I]: 
Sat Jun 27 10:42:05 2020 [Z0][MDP][I]: 
Sat Jun 27 10:42:05 2020 [Z0][MDP][D]: [1:0:0] Recieved START_MONITOR message from host 2:

Sat Jun 27 10:42:05 2020 [Z0][MDP][W]: Start monitor failed for host 2: 
Sat Jun 27 10:42:05 2020 [Z0][HMM][E]: Unable to monitor host id: 2

To simplify I have disabled all my hosts but one, the t110.lan is a host and also the frontend.

So the expired token is not a good lead, good to know.
I know that’s a lot of problems, I also have a new strange “search_data” column in sunstone VM view. Not a priority at the moment.
Thanks again for your help, we need to fix that before going to production.

Best regards,
Edouard

stracing monitord process gives me this:

[pid 10724] clock_gettime(CLOCK_REALTIME, {tv_sec=1593426551, tv_nsec=127675185}) = 0
[pid 10724] futex(0x7ffd4e9be670, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 10724] futex(0x7ffd4e9be6a4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 7, {tv_sec=1593426566, tv_nsec=127675185}, 0xffffffff <unfinished ...>
[pid 10634] <... read resumed>"UPDATE_HOST - 2 0 eJy1VtuSmzgQ/RXtW/JgA+JmuygqDNYaNsNlAXtnNpWiXLYy4xoMDmPPZvbrV0iABHE2eZkHm+4+R92tVutieVGa2Za/tKElkX8rdAJknxVFnhbb0pKoaqWZkyFbtiQmWHGCNnlvFDTLD/LAIW5+++Quncz59PRy/PzZJq5b++YHeGd3b9dphpKcpEI8C1qH2Hv8ZXspzj1oW80U8tRzEhI/QEG+Tp0VshVzBnVzTqbFbZYbr1sRao3/XrWyKHNuc0Jt4nKltRMitzeKFTh3HbsTqa1ldqKVrMPQD1f5Jkhtw5JElRQjzTMvQWTiSlOCXrOa2qRZlCDCWvrpxy7LrmwyLZqI/J4glDcGTtKV+Vw3KVNAm8yGPNXU5ZmsUSJH1ylajolQlU1VoUQBlsRsY9fPl2jju40iDbRwHTh5GC2pTD5kOaKELUoqTG0xYRGYeYmWvktaS+iZMGKz50gzPY4r/ZRpY4+K1likNvCV8MqbhVd+JTx8s/DwV8KrbxZeHYf31isU9607XiVma/B0DLXG1P9bGAVlbUZRZr66XVqjxCOTXRsl93SLZU7oXttgrX2Yo6rIhqHPDSFVejSIW2WuqIbG3LTY/yXVbCehiDPNlDWt32g2PcRoqs3Gyb/v6t4ssZ0liZtNEk9IevIQpqmy5SA/fS73sjHndlOHgizatXYpqbMMBfEtPf2dxPV4Zt9mRm6wSTCgaaoYiRNVoan3/cYglCRRwhlBVYI/LiWAc6DAhaIudA1AGcpgAVBdVzU4VuXhXNWH8gF41fMZdHcXeAffLwB13rpsqsDus855x6UsDnv3MUo2firm0V1UIvazy+4judWaKyYgyyD0Rrot96839WH/gCf+TZLSMSPuQBW2gDYzwAmX58Pl2H1hJ6idcKorsKtqvL9UoISmTBXYaF/x8aJCQBIl/7vTxdAm9SMudEE2GtTQKJV83KqsKwxiXNavJQjx47bAx+5L0wd/4efzEde4F5hZmCYYTxn4Lx"..., 2047) = 1423
[pid 10634] read(0,  <unfinished ...>
[pid 10724] <... futex resumed>)        = -1 ETIMEDOUT (Connection timed out)
[pid 10724] clock_gettime(CLOCK_REALTIME, {tv_sec=1593426566, tv_nsec=128045580}) = 0
[pid 10724] sendto(34, "=\0\0\0\3DELETE FROM host_monitoring WHERE last_mon_time < 1593383366", 65, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 65
[pid 10724] recvfrom(34, "\7\0\0\1\0\0\0\"\0\0\0", 16384, MSG_DONTWAIT, NULL, NULL) = 11
[pid 10724] sendto(35, "7\0\0\0\3DELETE FROM vm_monitoring WHERE last_poll < 1593383366", 59, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 59
[pid 10724] recvfrom(35, "\7\0\0\1\0\0\0\"\0\0\0", 16384, MSG_DONTWAIT, NULL, NULL) = 11
[pid 10724] clock_gettime(CLOCK_REALTIME, {tv_sec=1593426566, tv_nsec=129493222}) = 0

or

[pid 13619] recvfrom(35, "\7\0\0\1\0\0\0\"\0\0\0", 16384, MSG_DONTWAIT, NULL, NULL) = 11
[pid 13619] sendto(36, "7\0\0\0\3DELETE FROM vm_monitoring WHERE last_poll < 1593383825", 59, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 59
[pid 13619] recvfrom(36, 0x5615edfb41f0, 16384, MSG_DONTWAIT, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
[pid 13619] poll([{fd=36, events=POLLIN}], 1, -1) = 1 ([{fd=36, revents=POLLIN}])
[pid 13619] recvfrom(36, "\7\0\0\1\0\0\0\"\0\0\0", 16384, MSG_DONTWAIT, NULL, NULL) = 11
[pid 13619] clock_gettime(CLOCK_REALTIME, {tv_sec=1593427025, tv_nsec=57558218}) = 0

Is it a timeout from the Mysql DB connection ?

Best regards,
Edouard

Hello @madko

Could you please share the output of the following command to see if running the monitoring probes manually we can see any error?

echo '<MONITOR_CONFIGURATION><DATASTORE_LOCATION><![CDATA[/var/lib/one//datastores]]></DATASTORE_LOCATION><DB><CONNECTIONS><![CDATA[15]]></CONNECTIONS></DB><HOST_MONITORING_EXPIRATION_TIME><![CDATA[43200]]></HOST_MONITORING_EXPIRATION_TIME><IM_MAD><ARGUMENTS><![CDATA[-r 3 -t 15 -w 90 kvm]]></ARGUMENTS><EXECUTABLE><![CDATA[one_im_ssh]]></EXECUTABLE><NAME><![CDATA[kvm]]></NAME><SUNSTONE_NAME><![CDATA[KVM]]></SUNSTONE_NAME><THREADS><![CDATA[0]]></THREADS></IM_MAD><IM_MAD><ARGUMENTS><![CDATA[-r 3 -t 15 -w 90 lxd]]></ARGUMENTS><EXECUTABLE><![CDATA[one_im_ssh]]></EXECUTABLE><NAME><![CDATA[lxd]]></NAME><SUNSTONE_NAME><![CDATA[LXD]]></SUNSTONE_NAME><THREADS><![CDATA[0]]></THREADS></IM_MAD><IM_MAD><ARGUMENTS><![CDATA[-r 3 -t 15 -w 90 firecracker]]></ARGUMENTS><EXECUTABLE><![CDATA[one_im_ssh]]></EXECUTABLE><NAME><![CDATA[firecracker]]></NAME><SUNSTONE_NAME><![CDATA[Firecracker]]></SUNSTONE_NAME><THREADS><![CDATA[0]]></THREADS></IM_MAD><IM_MAD><ARGUMENTS><![CDATA[-c -t 15 -r 0 vcenter]]></ARGUMENTS><EXECUTABLE><![CDATA[one_im_sh]]></EXECUTABLE><NAME><![CDATA[vcenter]]></NAME><SUNSTONE_NAME><![CDATA[VMWare vCenter]]></SUNSTONE_NAME></IM_MAD><IM_MAD><ARGUMENTS><![CDATA[-r 3 -t 15 -w 90 dummy]]></ARGUMENTS><EXECUTABLE><![CDATA[one_im_sh]]></EXECUTABLE><NAME><![CDATA[dummy]]></NAME><SUNSTONE_NAME><![CDATA[Dummy]]></SUNSTONE_NAME><THREADS><![CDATA[0]]></THREADS></IM_MAD><LOG><DEBUG_LEVEL><![CDATA[3]]></DEBUG_LEVEL><SYSTEM><![CDATA[FILE]]></SYSTEM></LOG><MANAGER_TIMER><![CDATA[15]]></MANAGER_TIMER><MONITORING_INTERVAL_HOST><![CDATA[180]]></MONITORING_INTERVAL_HOST><NETWORK><ADDRESS><![CDATA[0.0.0.0]]></ADDRESS><MONITOR_ADDRESS><![CDATA[auto]]></MONITOR_ADDRESS><PORT><![CDATA[4124]]></PORT><PRIKEY><![CDATA[]]></PRIKEY><PUBKEY><![CDATA[]]></PUBKEY><THREADS><![CDATA[8]]></THREADS></NETWORK><PROBES_PERIOD><BEACON_HOST><![CDATA[30]]></BEACON_HOST><MONITOR_HOST><![CDATA[120]]></MONITOR_HOST><MONITOR_VM><![CDATA[30]]></MONITOR_VM><STATE_VM><![CDATA[5]]></STATE_VM><SYNC_STATE_VM><![CDATA[180]]></SYNC_STATE_VM><SYSTEM_HOST><![CDATA[600]]></SYSTEM_HOST></PROBES_PERIOD><VM_MONITORING_EXPIRATION_TIME><![CDATA[43200]]></VM_MONITORING_EXPIRATION_TIME><HOST_ID>0</HOST_ID></MONITOR_CONFIGURATION>' | /var/tmp/one/im/run_monitord_client kvm 0 localhost-kvm

Note that the XML we are passing via STDIN is the content of monitord.conf encoded in XML. In order for the command to works correctly you need to replace the value of <MONITOR_ADDRESS><![CDATA[auto]]></MONITOR_ADDRESS> by your frontend IP address.

Hi @cgonzalez,

with your command monitord received something !

Mon Jun 29 19:00:44 2020 [Z0][MDP][D]: [19:0:44] Recieved BEACON_HOST message from host 2:
1593450044
Mon Jun 29 19:00:45 2020 [Z0][MDP][D]: [19:0:45] Recieved MONITOR_HOST message from host 2:
HYPERVISOR=kvm
USEDMEMORY=1180020
FREEMEMORY=31733616
FREECPU=376
USEDCPU=24
NETRX=58317933
NETTX=28052114
HUGEPAGE = [ NODE_ID = "0", SIZE = "2048", FREE = "0" ]
MEMORY_NODE = [ NODE_ID = "0", FREE = "30990036", USED = "1923600" ]
Mon Jun 29 19:00:45 2020 [Z0][HMM][W]: message ignored for unknown host 2
Mon Jun 29 19:00:46 2020 [Z0][MDP][D]: [19:0:46] Recieved SYSTEM_HOST message from host 2:
ARCH=x86_64
MODELNAME="Intel(R) Xeon(R) CPU E31220 @ 3.10GHz"
HYPERVISOR=kvm
TOTALCPU=400
CPUSPEED=3000
TOTALMEMORY=32913636
KVM_MACHINES="pc-i440fx-rhel7.6.0 pc pc-i440fx-rhel7.0.0 pc-q35-rhel7.6.0 q35 rhel6.3.0 rhel6.4.0 rhel6.0.0 pc-i440fx-rhel7.5.0 pc-i440fx-rhel7.1.0 pc-i440fx-rhel7.2.0 pc-q35-rhel7.3.0 rhel6.5.0 pc-q35-rhel7.4.0 rhel6.6.0 rhel6.1.0 rhel6.2.0 pc-i440fx-rhel7.3.0 pc-i440fx-rhel7.4.0 pc-q35-rhel7.5.0"
KVM_CPU_MODELS="486 pentium pentium2 pentium3 pentiumpro coreduo n270 core2duo qemu32 kvm32 cpu64-rhel5 cpu64-rhel6 kvm64 qemu64 Conroe Penryn Nehalem Nehalem-IBRS Westmere Westmere-IBRS SandyBridge SandyBridge-IBRS IvyBridge IvyBridge-IBRS Haswell-noTSX Haswell-noTSX-IBRS Haswell Haswell-IBRS Broadwell-noTSX Broadwell-noTSX-IBRS Broadwell Broadwell-IBRS Skylake-Client Skylake-Client-IBRS Skylake-Server Skylake-Server-IBRS Icelake-Client Icelake-Server athlon phenom Opteron_G1 Opteron_G2 Opteron_G3 Opteron_G4 Opteron_G5 EPYC EPYC-IBPB"
KVM_CPU_MODEL="SandyBridge-IBRS"
DS_LOCATION_USED_MB  = 3233813
DS_LOCATION_TOTAL_MB = 3750804
DS_LOCATION_FREE_MB  = 516875
HOSTNAME=t110.lan
HUGEPAGE = [ NODE_ID = "0", SIZE = "2048", PAGES = "0" ]
CORE = [ NODE_ID = "0", ID = "0", CPUS = "0" ]
CORE = [ NODE_ID = "0", ID = "1", CPUS = "1" ]
CORE = [ NODE_ID = "0", ID = "2", CPUS = "2" ]
CORE = [ NODE_ID = "0", ID = "3", CPUS = "3" ]
MEMORY_NODE = [ NODE_ID = "0", TOTAL = "32913636", DISTANCE = "0" ]
VERSION="5.12.0"
Mon Jun 29 19:00:46 2020 [Z0][HMM][W]: message ignored for unknown host 2
Mon Jun 29 19:01:14 2020 [Z0][MDP][D]: [19:1:14] Recieved BEACON_HOST message from host 2:
1593450074
Mon Jun 29 19:01:44 2020 [Z0][MDP][D]: [19:1:44] Recieved BEACON_HOST message from host 2:
1593450104
Mon Jun 29 19:02:14 2020 [Z0][MDP][D]: [19:2:14] Recieved BEACON_HOST message from host 2:
1593450134
Mon Jun 29 19:02:44 2020 [Z0][MDP][D]: [19:2:44] Recieved BEACON_HOST message from host 2:
1593450164
Mon Jun 29 19:02:45 2020 [Z0][MDP][D]: [19:2:45] Recieved MONITOR_HOST message from host 2:
HYPERVISOR=kvm
USEDMEMORY=1175772
FREEMEMORY=31737864
FREECPU=300
USEDCPU=100
NETRX=58531537
NETTX=28217281
HUGEPAGE = [ NODE_ID = "0", SIZE = "2048", FREE = "0" ]
MEMORY_NODE = [ NODE_ID = "0", FREE = "30994604", USED = "1919032" ]
Mon Jun 29 19:02:45 2020 [Z0][HMM][W]: message ignored for unknown host 2
Mon Jun 29 19:03:14 2020 [Z0][MDP][D]: [19:3:14] Recieved BEACON_HOST message from host 2:
1593450194
Mon Jun 29 19:03:44 2020 [Z0][MDP][D]: [19:3:44] Recieved BEACON_HOST message from host 2:
1593450224
Mon Jun 29 19:03:47 2020 [Z0][MDP][D]: [19:3:47] Recieved STATE_VM message from host 2:
SYNC_STATE=yes
MISSING_STATE=POWEROFF
Mon Jun 29 19:03:47 2020 [Z0][HMM][W]: message ignored for unknown host 2

I have also change the HOST_ID to match my host, is there anything else I need to change ?

[root@t110 one]# onehost list
  ID NAME                                                                                                                                                                            CLUSTER    TVM      ALLOCATED_CPU      ALLOCATED_MEM STAT
   2 t110.lan                                                                                                                                                                        default      6            240 / -            17G / - err 

Best regards,
Edouard

Ok, so it seems that probes are working as expected. Could you share your /etc/one/monitord.conf file?

Also, could you try to replace the value of MONITOR_ADDRESS by your frontend IP address at /etc/one/monitord.conf and restart OpenNebula?

I’ve only changed ADDRESS and MONITOR_ADDRESS thinking it would help. I have also tried to add the complete DB definition, doesn’t change anything.
Here is my monitord.conf:

 DB = [
  CONNECTIONS = 15
]
LOG = [
  system      = "FILE",
  debug_level = 5
]
NETWORK = [
    ADDRESS         = "192.168.2.172",
    MONITOR_ADDRESS = "192.168.2.172",
    PORT    = 4124,
    THREADS = 8,
    PUBKEY  = "",
    PRIKEY  = ""
]
PROBES_PERIOD = [
    BEACON_HOST    = 30,
    SYSTEM_HOST    = 600,
    MONITOR_HOST   = 120,
    STATE_VM       = 5,
    MONITOR_VM     = 30,
    SYNC_STATE_VM  = 180
]
IM_MAD = [
    NAME          = "kvm",
    SUNSTONE_NAME = "KVM",
    EXECUTABLE    = "one_im_ssh",
    ARGUMENTS     = "-r 3 -t 15 -w 90 kvm",
    THREADS       = 0
]
IM_MAD = [
    NAME          = "lxd",
    SUNSTONE_NAME = "LXD",
    EXECUTABLE    = "one_im_ssh",
    ARGUMENTS     = "-r 3 -t 15 -w 90 lxd",
    THREADS       = 0
]
IM_MAD = [
    NAME          = "firecracker",
    SUNSTONE_NAME = "Firecracker",
    EXECUTABLE    = "one_im_ssh",
    ARGUMENTS     = "-r 3 -t 15 -w 90 firecracker",
    THREADS       = 0
]
IM_MAD = [
      NAME          = "vcenter",
      SUNSTONE_NAME = "VMWare vCenter",
      EXECUTABLE    = "one_im_sh",
      ARGUMENTS     = "-c -t 15 -r 0 vcenter" ]

Thank you for your help Christian

By the way, if I do your command while my host is in INIT state, it works. It’s then in OK state. I can again then deploy VM etc But if I reboot or restart OpenNebula, the host is back in ERROR state.

Hello @madko,

Just trying to figure out where the problem can be, if you only modified the fields HOST_ID and MONITOR_ADDRESS of the XML I provided, the only difference I can see is ADDRESS attribute which in your conf file is set to 192.168.2.172 and in the XML its value is 0.0.0.0.

Could you try to modify your monitord.conf file again and set ADDRESS = "0.0.0.0" to make sure that the information retrieved from your configuration file is the same than when you run the command manually?

Same with original config file, and/or with MONITOR_ADDRESS set to frontend IP.

DB = [
  CONNECTIONS = 15
]
LOG = [
  system      = "FILE",
  debug_level = 5
]
NETWORK = [
    ADDRESS         = "0.0.0.0",
    MONITOR_ADDRESS = "192.168.2.172",
    PORT    = 4124,
    THREADS = 8,
    PUBKEY  = "",
    PRIKEY  = ""
]
PROBES_PERIOD = [
    BEACON_HOST    = 30,
    SYSTEM_HOST    = 600,
    MONITOR_HOST   = 120,
    STATE_VM       = 5,
    MONITOR_VM     = 30,
    SYNC_STATE_VM  = 180
]
IM_MAD = [
    NAME          = "kvm",
    SUNSTONE_NAME = "KVM",
    EXECUTABLE    = "one_im_ssh",
    ARGUMENTS     = "-r 3 -t 15 -w 90 kvm",
    THREADS       = 0
]
IM_MAD = [
    NAME          = "lxd",
    SUNSTONE_NAME = "LXD",
    EXECUTABLE    = "one_im_ssh",
    ARGUMENTS     = "-r 3 -t 15 -w 90 lxd",
    THREADS       = 0
]
IM_MAD = [
    NAME          = "firecracker",
    SUNSTONE_NAME = "Firecracker",
    EXECUTABLE    = "one_im_ssh",
    ARGUMENTS     = "-r 3 -t 15 -w 90 firecracker",
    THREADS       = 0
]
IM_MAD = [
      NAME          = "vcenter",
      SUNSTONE_NAME = "VMWare vCenter",
      EXECUTABLE    = "one_im_sh",
      ARGUMENTS     = "-c -t 15 -r 0 vcenter" ]

I’ve just realised that your command only works as root, with oneadmin user I’ve got this:

./monitord-client_control.sh: line 51: /tmp/one-monitord-0.error: Permission denied
ERROR MESSAGE --8<------
Error executing monitord-client_control.sh: 
ERROR MESSAGE ------>8--

I’m trying to figure out if it comes from the fact I had previously run this with root

I’d guess that this is leftover from running as root. Could you try deleting the file and retry?

yes that was just a leftover. Command is now ok with oneadmin user, and host goes to INIT to OK when I manually fire this command. But still nothing after reboot, I don’t really understand why