FAQ: VM in Error (STATE_UNSPECIFIED) state

Introduction

This article addresses common questions about the ERROR VM state in the Crusoe platform. ERROR is a VM lifecycle state that appears when the Crusoe control plane cannot confirm a VM's actual running status. Because this is always a Crusoe-side infrastructure issue, it cannot be resolved through the console or CLI. This guide explains what the state means, why it occurs, and the correct steps to take when you encounter it.

What does ERROR / STATE_UNSPECIFIED mean?

Every Crusoe VM has a lifecycle state: RUNNING, SHUTOFF, and so on. Error state appears when the control plane loses coherent knowledge of a VM's true state, typically because the underlying hypervisor became unreachable or an internal operation did not complete cleanly.

The most common way customers encounter this is when trying to start a VM and receiving the following error:

Could not start VM. Bad request, check request parameters: failed to start vm: unable to start a VM that is not currently shut off, VM current state: STATE_UNSPECIFIED.

You may also see UNSPECIFIED listed directly in the Crusoe console or in the output of crusoe compute vms list.

What causes a VM to enter STATE_UNSPECIFIED?

There are three common causes:

Hardware failure on the underlying hypervisor. The physical server hosting your VM develops a fault (for example, a failed CPU component or GPU error) that causes the hypervisor to stop responding. The Crusoe control plane loses contact with the node and marks all VMs on it as UNSPECIFIED because their actual running status can no longer be confirmed. This is the most common cause.
Network or storage outage affecting the hypervisor. A network infrastructure event such as a switch losing upstream connectivity or a storage fabric disruption causes one or more hypervisors to become temporarily unreachable. VMs on those hypervisors enter UNSPECIFIED state. Once connectivity is restored, most VMs recover automatically. VMs that were mid-operation (start/stop/reset in progress) at the time of the outage may remain stuck and require manual recovery.
Control-plane error during a VM lifecycle operation. In some cases, a stop, start, or reset operation stalls internally and does not complete. The VM ends up in UNSPECIFIED because the cleanup step did not run. This typically affects a single VM and leaves a hung operation visible in list-ops output.

Can I fix this myself?

No. ERROR / STATE_UNSPECIFIED requires intervention from Crusoe support or engineering. Standard lifecycle operations — start, stop, and delete — will fail while the VM is in this state. Attempting to delete a stuck VM will return UNKNOWN_ERROR until the underlying hypervisor is reachable again.

⚠️ Warning: Do not repeatedly attempt to delete and recreate the VM. This can complicate the recovery process.

Is my data safe?

Yes. Persistent disks (boot disks and shared volumes) are stored independently of the hypervisor and are not affected by STATE_UNSPECIFIED. Your disk contents are intact.

⚠️ Warning: Data written exclusively to local NVMe ephemeral storage may not survive if the VM has to be migrated to a new hypervisor without a clean shutdown.

What will Crusoe support do to resolve this?

The recovery path depends on the cause:

Hardware failure: Engineers migrate your VM to a healthy spare node in the same network. The faulty hypervisor is placed into maintenance mode for hardware repair. Typical resolution time is 2–6 hours depending on spare availability.
Network or storage outage: Engineers restore connectivity to the affected hypervisors. VMs that self-recover are verified. VMs that remain stuck after the outage is mitigated are manually recovered.
Control-plane error: Engineers correct the VM's state record. The VM is then started on a healthy node.

What should I include when contacting support?

Please provide the following when opening a ticket:

VM name(s) and VM ID(s)
Region (for example, us-east1-a or eu-iceland1-a)
When the state change was first noticed
Whether a preceding operation (stop, start, reset) was in progress when the state changed

Mark your ticket priority as High for a single affected VM, or Urgent if multiple VMs are down or production workloads are blocked.

Several nodes in my CMK cluster went into STATE_UNSPECIFIED at once. Is this the same issue?

Yes. CMK node VMs follow the same lifecycle states as standalone VMs. Multiple nodes entering UNSPECIFIED simultaneously is a strong indicator of a network or infrastructure event. Open a support ticket and include your cluster ID and node pool ID in addition to the VM names. CMK will attempt to self-heal by replacing affected nodes once the underlying hypervisors recover.

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Article is closed for comments.