New VMs stuck in LCM_INIT state

Our long-lasting OpenNebula environment has recently started to have this issue:

As the title states, deployments get stuck in LCM_INIT state, right before communicating the scheduler decission to the selected host

Logs are the expected in this phase of the deployment process
image

Our setup uses 3 FrontEnd (virtualized) hosts running in HA mode. All of them have connectivity to all hosts, and have enough disk space and computing resources. The VIP is also in the right active FE.

Also, FEs can ssh the hosts without password using oneadmin user

Our “temporal solution” to this issue has consisted on restarting opennebula in the active FE, and retrying everything again. This allows to use OpenNebula as normal for a while, but it is a temporal solution, as not long after it happens again


**Versions of the related components and OS (frontend, hypervisors, VMs):**Frontends in v6.10.2, hypervisors in v6.10.3

**Steps to reproduce:**Nothing weird, instancing a normal VM from any template

Current results: As show in the screenshot, VMs stuck in LCM_INIT state, and its imposible to delete them or restart the process

**Expected results:**VMs deploying correctly

Hello @Alvaro,

Welcome to this forum, hope we can help you out.

The first quick solution is to update your OpenNebula to 6.10 version, and check if the issue persists.

In the meantime, and since you already have the logs, have you tried to force synchronization of the hosts?. I will check with the team if we can give you some ideas about a potential solution.

Cheers,

Hello @Alvaro,
thank you for your post with detailed report.

The LCM_INIT is very unusual state, I’ve never seen it before, actually it looks like a possible deadlock in OpenNebula. Few questions:

  1. How often it happens? Once per week? per month?
  2. If it happens it is possible to execute actions on other VMs, which are already in RUNNING state, e.g. attach a disk? Does it works or the VM stayes in HOTPLUG state?
  3. In case it happens, can you please send us a core dump? Please send a download link to pczerny@opennebula.io, with a reference to this post.

Here are steps to create a core dump:

# As root user:
# Get process ID of oned
pidof oned
# Use the <pid> to get the core dump, it will be stored in working directory as core.<pid>
pgrep <pid>

Good morning @pczerny, and thank you for your answer.
Well, this time it happened after a weekend, so I can’t tell when exactly did it start happening again, but the last “fix” has applied on thursday for urgent reasons. For 2 days minimum?

The HOTPLUG experiment was indeed very interesting.
I tried hotfplugging a NIC to an existing VM and these are the reults:

  • The VM has its NIC
  • Logs look fine
Mon Jun 9 09:29:36 2025 [Z0][VMM][I]: Successfully execute virtualization driver operation: reconfigure.
Mon Jun 9 09:29:36 2025 [Z0][VMM][I]: VM NIC Successfully attached.
  • But the LCM state is still HOTPLUG_NIC
  • Network tab shows the NICs as attach in progress:

Ill proceed to send you a core dump.
Best regards!

In order to ensure consistency again in the DB between FEs, I’ve decided to upgrade the FEs from 6.10.2 to the new version 6.10.4.
The old broken deployments are still unmanageable, but after a purge, I’m able to do normal operations from sunstone, so It looks repaired to me. Now let’s wait for some days to see if the update was a stable solution, or it breaks until friday :smiley:
Thank you so much for the help anyways!

Well unfortunatelly I wasn’t able to do the coredump.
The instructions weren’t clear, as pgrep is not enough to perform the coredump.
I tried doing it with gdb (command gcore) but apt requested me to restart some services due to unattended upgrades and now the leader FrontEnd changed.

Nothing appears to be fixed though: The VM is still on HOTPLUG and the previous 2 are still on LCD_INIT. Doing action recover->retry is also useless.

Any other recommended way to do the coredump? Could the problem be in the database?

Regards!

Sorry for late reply, I missed notification for new post.

The correct command for generating the coredump is gcore <pid> (included in the gdb package).

I’m afraid we don’t have a command to recover from the LCM_INIT state. The only option is to delete the VMs using onevm recover --delete and create them again.

For the VM in HOTPLUG, onevm recover --failure should return the VM to previous state.

Hi @pczerny thank you for your answer.
None of those commands worked, but I guess the problem was in the communication between the FrontEnd and the nodes due to a mismatch in versions.
Monitoring worked, but operations with libvirt didn’t.
I tried upgrading everything to new version 6.10.4 and the enviroment seems stable again.
So I guess we can close this issue :smiley:
Thank you all again for your support!
Best Regards!