OneD not responding after failed VM Creation attempt

e-discoveryinfra-tid · July 7, 2025, 11:15am

Version: OpenNebula 6.10.3 (64156dd6) Enterprise Edition
Setup: 1 Zone with 3 Frontends

Problem description:
After a unsuccessful VM creation attempt (via Sunstone GUI), all the services and the requests to OneD (via GUI or via CLI with commands like “onevm”) aren’t being processed by the oned daemon.

The oned service log (/var/log/one/oned.log) shows every second the following error:

Mon Jul  7 12:29:08 2025 [Z0][ONE][E]: SQL command was: INSERT INTO vm_pool (
  oid, name, body, uid, gid, state, lcm_state, owner_u, group_u, other_u, short_body, body_json
) VALUES (
  1339,
  'vm-machine-name',
  '<VM>...</VM>',   -- (very long XML, see below)
  5, 0, 1, 0, 1, 0, 0,
  '<VM>...</VM>',   -- (shorter XML, see below)
  '{...}'           -- (JSON, see below)
)

The JSON body contains the template used to create the VM with the specific values (hidden the private data):

{
  "VM": {
    "ID": "1339",
    "UID": "...",
    "GID": "...",
    "UNAME": "...",
    "GNAME": "...",
    "NAME": "...",
    "LAST_POLL": "...",
    "STATE": "...",
    "LCM_STATE": "...",
    "PREV_STATE": "...",
    "PREV_LCM_STATE": "...",
    "RESCHED": "...",
    "STIME": "...",
    "ETIME": "...",
    "DEPLOY_ID": "...",
    "TEMPLATE": {
      "AUTOMATIC_DS_REQUIREMENTS": "...",
      "AUTOMATIC_NIC_REQUIREMENTS": "...",
      "AUTOMATIC_REQUIREMENTS": "...",
      "CONTEXT": {
        "DISK_ID": "...",
        "ETH0_DNS": "...",
        "ETH0_EXTERNAL": "...",
        "ETH0_GATEWAY": "...",
        "ETH0_IP": "...",
        "ETH0_IP6": "...",
        "ETH0_IP6_GATEWAY": "...",
        "ETH0_IP6_METHOD": "...",
        "ETH0_IP6_METRIC": "...",
        "ETH0_IP6_PREFIX_LENGTH": "...",
        "ETH0_IP6_ULA": "...",
        "ETH0_MAC": "...",
        "ETH0_MASK": "...",
        "ETH0_METHOD": "...",
        "ETH0_METRIC": "...",
        "ETH0_MTU": "...",
        "ETH0_NETWORK": "...",
        "ETH0_SEARCH_DOMAIN": "...",
        "ETH0_VLAN_ID": "...",
        "ETH0_VROUTER_IP": "...",
        "ETH0_VROUTER_IP6": "...",
        "ETH0_VROUTER_MANAGEMENT": "...",
        "NETWORK": "...",
        "PASSWORD": "...",
        "PCI0_ADDRESS": "...",
        "PCI0_IP": "...",
        "PCI0_MAC": "...",
        "SET_HOSTNAME": "...",
        "SSH_PUBLIC_KEY": "...",
        "START_SCRIPT_BASE64": "...",
        "TARGET": "..."
      },
      "CPU": "...",
      "CPU_MODEL": {
        "MODEL": "..."
      },
      "DISK": [
        {
          "ALLOW_ORPHANS": "...",
          "CEPH_HOST": "...",
          "CEPH_SECRET": "...",
          "CEPH_USER": "...",
          "CLONE": "...",
          "CLONE_TARGET": "...",
          "CLUSTER_ID": "...",
          "DATASTORE": "...",
          "DATASTORE_ID": "...",
          "DEV_PREFIX": "...",
          "DISK_ID": "...",
          "DISK_SNAPSHOT_TOTAL_SIZE": "...",
          "DISK_TYPE": "...",
          "DRIVER": "...",
          "FORMAT": "...",
          "IMAGE": "...",
          "IMAGE_ID": "...",
          "IMAGE_STATE": "...",
          "IMAGE_UNAME": "...",
          "LN_TARGET": "...",
          "ORIGINAL_SIZE": "...",
          "POOL_NAME": "...",
          "READONLY": "...",
          "SAVE": "...",
          "SIZE": "...",
          "SOURCE": "...",
          "TARGET": "...",
          "TM_MAD": "...",
          "TYPE": "..."
        }
      ],
      "GRAPHICS": {
        "LISTEN": "...",
        "TYPE": "..."
      },
      "MEMORY": "...",
      "MEMORY_MAX": "...",
      "MEMORY_RESIZE_MODE": "...",
      "NIC": [
        {
          "AR_ID": "...",
          "BRIDGE": "...",
          "BRIDGE_TYPE": "...",
          "CLUSTER_ID": "...",
          "DNS": "...",
          "GATEWAY": "...",
          "IP": "...",
          "MAC": "...",
          "NAME": "...",
          "NETWORK": "...",
          "NETWORK_ID": "...",
          "NETWORK_UNAME": "...",
          "NIC_ID": "...",
          "SECURITY_GROUPS": "...",
          "TARGET": "...",
          "VN_MAD": "..."
        }
      ],
      "OS": {
        "UUID": "..."
      },
      "PCI": {
        "AR_ID": "...",
        "BRIDGE": "...",
        "BRIDGE_TYPE": "...",
        "CLASS": "...",
        "CLUSTER_ID": "...",
        "DEVICE": "...",
        "IP": "...",
        "MAC": "...",
        "NAME": "...",
        "NETWORK": "...",
        "NETWORK_ID": "...",
        "NIC_ID": "...",
        "PCI_ID": "...",
        "SECURITY_GROUPS": "...",
        "TARGET": "...",
        "TYPE": "...",
        "VENDOR": "...",
        "VM_ADDRESS": "...",
        "VM_BUS": "...",
        "VM_DOMAIN": "...",
        "VM_FUNCTION": "...",
        "VM_SLOT": "...",
        "VN_MAD": "..."
      },
      "SECURITY_GROUP_RULE": [
        {
          "PROTOCOL": "...",
          "RULE_TYPE": "...",
          "SECURITY_GROUP_ID": "...",
          "SECURITY_GROUP_NAME": "..."
        }
        // ... (possibly more rules)
      ],
      "TEMPLATE_ID": "...",
      "VCPU": "...",
      "VCPU_MAX": "...",
      "VMID": "..."
    },
    "USER_TEMPLATE": {
      "HOT_RESIZE": {
        "CPU_HOT_ADD_ENABLED": "...",
        "MEMORY_HOT_ADD_ENABLED": "..."
      },
      "HYPERVISOR": "...",
      "LOGO": "...",
      "MEMORY_UNIT_COST": "...",
      "SCHED_REQUIREMENTS": "..."
    }
  }
}

Steps tried to solve:

Restart the Opennebula services on the 3 zone hosts (systemctl restart opennebula)
Stop services and repair consistency on the DB (onedb fsck), getting the following output:

Removing possibly corrupted records from VM showback please run 'oneshowback calculate` to recalculate the showback
VM 1339 is in Image XXX VM id list, but it should not
VNet XX AR 0 has leased ... to VM 1339, but it is actually free
VNet XX has 20 used leases, but it is actually 19
VNet XXX AR 0 has leased ... to VM 1339, but it is actually free
....

so multiple actions were performed to cleanup remaining configurations of the failed VM, but the action is constantly hitting the service, avoiding perform any action (stop a VM, create a VNet…) making the service unusable.

Additionally, the schedule logs (/var/log/one/sched.log) from the different hosts only show the following entry:

Mon Jul  7 11:37:50 2025 [Z0][SCHED][E]: oned is not leader

We haven’t found any workaround to cleanup this action to unlock the requests to the daemon.

FrancJP · July 7, 2025, 3:21pm

Hello @e-discoveryinfra-tid,

If you are using Enterprise Edition, you have a support specific for you, please check your portal for that: https://support.opennebula.pro/hc/en-us

If you are using an Elemental License, please let us know.

Regards,

brunorro · July 8, 2025, 11:43am

Hello,

There may be some illegal chars in your database. Can you please check if you for the error 4025 in OpenNebula logs?

A grep 4025 /var/log/one/oned.log may have to do with this problem. In that case, you will have the SQL command that triggered this problem.

e-discoveryinfra-tid · July 8, 2025, 3:50pm

Hi Bruno! Thanks you for answering,
You’re right, the oned.log show the command that has failed (a VM creation from template with an invalid name);

[Z0][ONE][E]: SQL command was: INSERT INTO vm_pool (oid, name, body, uid, gid, state, lcm_state, owner_u, group_u, other_u, short_body, body_json) VALUES (1339,' pqcpoc-locust-backend'... .... error 4025 : CONSTRAINT 'vm_pool.body_json' failed for 'opennebula'.'vm_pool'

The problem, in this case, is that after having stopped/restarted the oned service and cleaned up on the DB, the error seems to be stuck on the Control Plane and it disallows us to perform any action on Opennebula.

Thanks!

brunorro · July 9, 2025, 9:22am

This looks related to a bug fixed in 6.10.4 issued when an entity (i.e., a VM) is created from the API forcing its name with special chars (\t or similar)

In this case, the best way to proceed is backup the database and fix it before restarting OpenNebula, and after that upgrade to 6.10.4

e-discoveryinfra-tid · July 10, 2025, 9:53am

Hi again, Bruno,
we’ve tried te following queries on the “vm_pool” table;

update vm_pool v set v.name = REPLACE(v.name, '\\t', '') where v.name LIKE '%\\t%';
update vm_pool v set v.body = REPLACE(v.body, '\\t', '') where v.body LIKE '%\\t%';
update vm_pool v set v.short_body = REPLACE(v.short_body, '\\t', '') where v.short_body LIKE '%\\t%';

Query outputs:

but no changes have been performed, as the problematic record isn’t created on the DB.
The error is still present on the Daemon (oned.log) (both on leader and follower hosts) even after restarting alll the services.

Is there any cache or stack where the API stores the operations or something that can we cleanup on the different frontends?

Thanks you for the support!

e-discoveryinfra-tid · July 10, 2025, 1:31pm

Finally we found a way to remove the DB error from OneD services.

The workaround and the procedure to remove the failing Insert operation was the following.

Stop oned on 3 frontends (systemctl stop opennebula)
On 1 of the frontends, create a Trigger within the “opennebula” DB (MYSQL DB backend in our case): mysql -h [ONED_DB_SERVER] -u [ONED_DB_USER] opennebula

DELIMITER $$
CREATE TRIGGER vm_pool_replace_fields
BEFORE INSERT ON vm_pool
FOR EACH ROW
BEGIN
    IF NEW.oid = 1339 THEN
        SET NEW.name = REPLACE(NEW.name, '\t', '');
        SET NEW.body = REPLACE(NEW.body, '\t', '');
        SET NEW.short_body = REPLACE(NEW.short_body, '\t', '');
        SET NEW.body_json = REPLACE(NEW.body_json, '\t', '');
    END IF;
END$$
DELIMITER ;

This trigger has a condition where oid == 1339 (our specific case), but it can be removed to perform that replace for all the Insert operations on “vm_pool” table

Restart oned on that frontend (systemctl restart opennebula)
- Bad-formed query is finally run on the DB, so SQL 4025 errors stop to be appearing on “oned.log” file
To replicate the DB to the rest of the frontends:
- onedb backup /path/to/bkp.sql
- On the rest of the nodes: onedb restore /path/to/bkp.sql -f
[Optional] In our case, we’ve tuned some XML RPC parameters as initially the replication failed due to XMLRPC calls between leader and followers (scheduler)
- On /etc/one/oned.conf;
  - adjusted XML-RPC parameters (TIMEOUT, KEEPALIVED_TIMEOUT…)
  - adjusted RAFT parameters: XMLRPC_TIMEOUT_MS
Start oned (systemctl restart opennebula) on the remaining frontend nodes and wait for replication to be completed
- Check zone status: onezone show [ID]
- Review OpenNebula scheduler logs: /var/log/one/sched.log

system · July 12, 2025, 1:32pm

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
VM creation failed Product Support	2	41	June 25, 2025
Database crash after bad power down Product Support	7	1365	February 3, 2016
[Resolved] Not able to create a more then one VM on OpenNebula Product Support	5	1807	March 15, 2016
OpenNebula suddenly dies after upgrade to 5.4 Product Support	8	1100	August 2, 2017
OpenNebula Setup: One VM + 2 Physicals: Program terminated with signal 11, Segmentation fault Product Support	9	2299	November 26, 2019

OneD not responding after failed VM Creation attempt

Related topics