CMK Worker Node Boots Into Emergency Mode After Stop/Start Due to NVMe UUID Mismatch

Overview

When a CMK worker node with Ephemeral Storage enabled is stopped and restarted, the ephemeral NVMe disk is wiped and a new UUID is assigned to the block device. If the VM's /etc/fstab still references the old UUID, the OS will fail to mount the disk during boot and halt in Emergency Mode. The node will appear as RUNNING in the Crusoe Console but remain NotReady in Kubernetes, with SSH access unavailable.

The correct resolution is to delete and recreate the affected VM rather than attempting to manually repair the filesystem. Because Ephemeral Storage stores both container images and the container filesystem (containerd) on the NVMe device, all container state is lost when the disk is wiped regardless — manual filesystem repair does not recover this data and leaves the node in a non-standard configuration prone to future instability.

Prerequisites

Access to the Crusoe Console or CLI
kubectl access to the CMK cluster
Serial Console access via the Crusoe CLI (for diagnosis)

Steps

Confirm the Node Is in Emergency Mode
- If a worker node shows as RUNNING in the Crusoe Console but is NotReady in Kubernetes and SSH is unresponsive, use the Crusoe CLI Serial Console to inspect the boot state:

     crusoe compute vms serial-console --name <vm-name>

Emergency Mode output will typically include a message referencing a failed /etc/fstab mount and a prompt to enter the root password for maintenance.

Understand the Data Impact Before Proceeding
- With Ephemeral Storage enabled, the NVMe device stores container images, the containerd filesystem, and any manual host-level configurations.
- All of this data was wiped when the VM was stopped. Proceeding with VM deletion does not result in additional data loss beyond what has already occurred.
Delete the Affected VM
- Delete the affected nodepool VM via the Crusoe Console or API. The nodepool will automatically provision a fresh replacement.
- Do not attempt to manually repair /etc/fstab or recreate the filesystem on the wiped NVMe device. This creates a non-standard node configuration and does not restore lost container state.
Verify the Replacement Node Joins the Cluster
- Confirm the new VM comes up RUNNING in the Crusoe Console and transitions to Ready in Kubernetes:

     kubectl get nodes

Verify the NVMe mount is healthy on the new node:

     mount | grep nvme

Reconfigure Any Node-Level Dependencies
- Any software or configuration that was applied directly to the previous node (e.g., Run:ai node configuration, manually installed drivers) will need to be reapplied to the new node, as this state does not persist across VM recreation.

How to Avoid This Issue

The safest way to reboot an unresponsive CMK worker node without wiping the ephemeral NVMe disk is to use RESET instead of STOP. A reset performs a hard reboot at the hypervisor level without deallocating the instance, preserving the ephemeral disk and its UUID.

Note: Currently, VM reset is only available via the Crusoe CLI:
crusoe compute vms reset --name <vm-name>
A Product Feature Request (PFR) has been filed to expose this option in the Crusoe Cloud Console UI.

Resolution

The following describes how this issue was resolved in a confirmed case:

A CMK worker node became unresponsive. The customer issued a STOP command to attempt recovery, wiping the ephemeral NVMe disk and assigning it a new UUID.
When the VM was restarted, the OS failed to mount the disk using the old UUID in /etc/fstab and halted in Emergency Mode. The node appeared RUNNING in the Crusoe Console but was NotReady in Kubernetes with no SSH access.
Crusoe Support used the Serial Console to confirm Emergency Mode and identify the fstab UUID mismatch as the cause.
After confirming that all container state had already been lost when the disk was wiped, the affected VM was deleted and the nodepool provisioned a healthy replacement.
The new node came up Ready in Kubernetes with the NVMe mount intact.

Additional Resources

Related to

nvme emergency-mode cmk solution

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Article is closed for comments.

Overview

Prerequisites

Steps

How to Avoid This Issue

Resolution

Additional Resources

Related to

Was this article helpful?

Still need help?

Related Articles

Recently Viewed

Comments

CMK Worker Node Boots Into Emergency Mode After Stop/Start Due to NVMe UUID Mismatch

Overview

Prerequisites

Steps

How to Avoid This Issue

Resolution

Additional Resources

Related to

Was this article helpful?

Still need help?

Related Articles

Related articles

Recently Viewed

Comments