Overview
When a CMK worker node with Ephemeral Storage enabled is stopped and restarted, the ephemeral NVMe disk is wiped and a new UUID is assigned to the block device. If the VM's /etc/fstab still references the old UUID, the OS will fail to mount the disk during boot and halt in Emergency Mode. The node will appear as RUNNING in the Crusoe Console but remain NotReady in Kubernetes, with SSH access unavailable.
The correct resolution is to delete and recreate the affected VM rather than attempting to manually repair the filesystem. Because Ephemeral Storage stores both container images and the container filesystem (containerd) on the NVMe device, all container state is lost when the disk is wiped regardless — manual filesystem repair does not recover this data and leaves the node in a non-standard configuration prone to future instability.
Prerequisites
- Access to the Crusoe Console or CLI
kubectlaccess to the CMK cluster- Serial Console access via the Crusoe CLI (for diagnosis)
Steps
- Confirm the Node Is in Emergency Mode
- If a worker node shows as
RUNNINGin the Crusoe Console but isNotReadyin Kubernetes and SSH is unresponsive, use the Crusoe CLI Serial Console to inspect the boot state:
- If a worker node shows as
crusoe compute vms serial-console --name <vm-name>
- Emergency Mode output will typically include a message referencing a failed
/etc/fstabmount and a prompt to enter the root password for maintenance.
- Understand the Data Impact Before Proceeding
- With Ephemeral Storage enabled, the NVMe device stores container images, the
containerdfilesystem, and any manual host-level configurations. - All of this data was wiped when the VM was stopped. Proceeding with VM deletion does not result in additional data loss beyond what has already occurred.
- With Ephemeral Storage enabled, the NVMe device stores container images, the
- Delete the Affected VM
- Delete the affected nodepool VM via the Crusoe Console or API. The nodepool will automatically provision a fresh replacement.
- Do not attempt to manually repair
/etc/fstabor recreate the filesystem on the wiped NVMe device. This creates a non-standard node configuration and does not restore lost container state.
- Verify the Replacement Node Joins the Cluster
- Confirm the new VM comes up
RUNNINGin the Crusoe Console and transitions toReadyin Kubernetes:
- Confirm the new VM comes up
kubectl get nodes
- Verify the NVMe mount is healthy on the new node:
mount | grep nvme
- Reconfigure Any Node-Level Dependencies
- Any software or configuration that was applied directly to the previous node (e.g., Run:ai node configuration, manually installed drivers) will need to be reapplied to the new node, as this state does not persist across VM recreation.
How to Avoid This Issue
The safest way to reboot an unresponsive CMK worker node without wiping the ephemeral NVMe disk is to use RESET instead of STOP. A reset performs a hard reboot at the hypervisor level without deallocating the instance, preserving the ephemeral disk and its UUID.
Note: Currently, VM reset is only available via the Crusoe CLI:
crusoe compute vms reset --name <vm-name>A Product Feature Request (PFR) has been filed to expose this option in the Crusoe Cloud Console UI.
Resolution
The following describes how this issue was resolved in a confirmed case:
- A CMK worker node became unresponsive. The customer issued a STOP command to attempt recovery, wiping the ephemeral NVMe disk and assigning it a new UUID.
- When the VM was restarted, the OS failed to mount the disk using the old UUID in
/etc/fstaband halted in Emergency Mode. The node appearedRUNNINGin the Crusoe Console but wasNotReadyin Kubernetes with no SSH access. - Crusoe Support used the Serial Console to confirm Emergency Mode and identify the fstab UUID mismatch as the cause.
- After confirming that all container state had already been lost when the disk was wiped, the affected VM was deleted and the nodepool provisioned a healthy replacement.
- The new node came up
Readyin Kubernetes with the NVMe mount intact.