Odd problems with 5.12 upgrade

Hi folks. I just finished upgrading my ONE 5.6 / ubuntu 18.04 installation to 5.12 / ubuntu 20.04. After clearing a few bumps in the road everything is going well, aside from one (possibly two) oddities.

(1) Every couple of minutes, I see the following in oned.log:
[Z0][AuM][D]: Message received: LOG I 0 Command execution failed (exit code: 255): /var/lib/one/remotes/auth/server_cipher/authenticate
[Z0][AuM][I]: Command execution failed (exit code: 255): /var/lib/one/remotes/auth/server_cipher/authenticate
[Z0][AuM][D]: Message received: LOG D 0 authenticate: Authenticating serveradmin, with password ----- (------)
[Z0][AuM][I]: authenticate: Authenticating serveradmin, with password ------ (------)
[Z0][AuM][D]: Message received: LOG E 0 bad decrypt
[Z0][AuM][I]: bad decrypt
[Z0][AuM][D]: Message received: AUTHENTICATE FAILURE 0 bad decrypt
[Z0][AuM][E]: Auth Error: bad decrypt
[Z0][ReM][D]: Req:9632 UID:-1 IP:127.0.0.1 one.documentpool.info invoked , -2, -1, -1, 100
[Z0][ReM][E]: Req:9632 UID:- one.documentpool.info result FAILURE [one.documentpool.info] User couldn’t be authenticated, aborting call.
[Z0][AuM][D]: Message received: LOG I 1 Command execution failed (exit code: 255): /var/lib/one/remotes/auth/server_cipher/authenticate
[Z0][AuM][I]: Command execution failed (exit code: 255): /var/lib/one/remotes/auth/server_cipher/authenticate
[Z0][AuM][D]: Message received: LOG D 1 authenticate: Authenticating serveradmin, with password ------ (------)
[Z0][AuM][I]: authenticate: Authenticating serveradmin, with password ------ (------)
[Z0][AuM][D]: Message received: LOG E 1 bad decrypt
[Z0][AuM][I]: bad decrypt
[Z0][AuM][D]: Message received: AUTHENTICATE FAILURE 1 bad decrypt
[Z0][AuM][E]: Auth Error: bad decrypt
[Z0][ReM][D]: Req:2688 UID:-1 IP:127.0.0.1 one.vmpool.infoextended invoked , -2, -1, -1, -1, “”
[Z0][ReM][E]: Req:2688 UID:- one.vmpool.infoextended result FAILURE [one.vmpool.infoextended] User couldn’t be authenticated, aborting call.
(I’ve redacted the passwords but really didn’t need to because they in no way resemble the actual serveradmin pswd. I’m assuming these are hashes and/or tokens.)
I’ve tried everything I can think of to straighten it out: changing the serveradmin password (with --sha256), patching the database directly (which I botched and locked myself out - good thing I had a backup!) and even going so far as to create a linux serveradmin user and changing the auth method from “server_cipher” to “core” All with multiple service restarts and server reboots mixed in. Nothing gets rid of the auth failures.
The only possible clue that I have is that:
(2) syslog is showing the following at frequent intervals:
Condition check resulted in OpenNebula SSH persistent connection cleaner being skipped.
I have no idea what this is or what to do about it, and also cannot make a chicken/egg determination, if this is not in fact an unrelated problem.

I’m stumped. Does anyone have any thoughts as to how I can straighten this out?

RH

Edit (1): To clarify: the password shown in the “oneuser show systemadmin” output matches the password shown in the log output. Neither matches the password as set in the various _auth files. I tried plugging the logged password into sunstone_auth and subsequent login failed; replaced the original string and login succeeded. I assume, because I must, that the log/show string is the sha256 hash of the actual password.

Edit (2): The condition-check message is coming from systemd. No sign of a “connection cleaner” job in /etc/systemd or systemctl list-unit-files.

Hello @Rich.H,

Could you follow the steps below and try again:

I know probably you’ve already executed this (or something similar), please execute it again anyway just to make sure. If after executing this, the error persist will try to debug this to find out if there is a bug.

Well. I followed the steps in your list. The decrypt errors are gone, and remain absent after a server reboot. (The systemd SSH cleanup messages in syslog are still appearing.)
There are two significant difference in your procedure vs. the one I followed during installation:

  • The length of the changed serveradmin password was 64 characters, containing numbers and mixed-case letters. The previous password, which is again the current password, is 32 characters with numbers and lower-case letters.
  • The oneuser command I used at upgrade time specified the password directly, instead of fetching it from sunstone_auth: “oneuser passwd --sha256 serveradmin [password]”. I assume the command in the docs is removing “serveradmin:” from the sunstone_auth file contents. Is it modifying the password string in any other way?

At any rate, the authentication errors in oned.log are gone, and I suspect that the systemd messages in syslog are benign. (I hope so because there are a lot of them, 27 in ten minutes.) If I’m wrong about this, please advise. Regardless, your assistance is greatly appreciated. Thanks!

The log messages for connection cleaner are related to opennebula-ssh-socks-cleaner.timer and opennebula-ssh-socks-cleaner.service. In case you are using persistent SSH connections (default for new OpenNebula 5.12 installations), this scheduled service properly terminates SSH connections after a fixed time since they were created to workaround unclean sockets cleanup handling in OpenSSH. The check in systemd ensures the service runs only when necessary, when there is something to clean. There is nothing to worry about, leave it as is.

Connection cleaner is to workaround the problem described in Important box at http://docs.opennebula.io/5.12/deployment/node_installation/ssh.html#persistent-connections

sigh. So as of this morning a one.vmpool.infoextended call is failing with “login token expired.” Oddly, one.documentpool.info, which was failing earlier with the bad-decrypt error, is running to completion.

What I find really odd is that the failing call seems to be invoked with UID -1:

Fri Aug 28 11:39:35 2020 [Z0][AuM][E]: Auth Error: login token expired
Fri Aug 28 11:39:35 2020 [Z0][ReM][D]: Req:1568 UID:-1 IP:127.0.0.1 one.vmpool.infoextended invoked , -2, -1, -1, -1, “”
Fri Aug 28 11:39:35 2020 [Z0][ReM][E]: Req:1568 UID:- one.vmpool.infoextended result FAILURE [one.vmpool.infoextended] User couldn’t be authenticated, aborting call.

It may be a typo in the log message, but the many successful one.vmpool.infoextended calls using UID 0 (oneadmin) are logging as UID:0. The documentpool call that is (now) working logs UID:1. Curiouser and curiouser. Any thoughts?

Got it. Thanks.